Build Recommendation System for Online Courses
Your company runs a marketplace of online learning courses. You are tasked with building a recommendation system to generate a top five list tailored for each customer. Describe how you would build a recommendation engine for this task.
Start with content based filtering. How would you select a similarity method?
Now, think about collaborative filtering. How would this improve the content based filtering method?
Users start with no course ratings. How can we handle the cold start problem? (Cold start is a potential problem that the system cannot draw enough inferences for users to make a recommendation).
What would happen if we recommended the most popular course in the marketplace to every new potential buyer?
Start building the recommendation system with the first important step of data collection. The data can be collected by two means:
- Explicit input data: user wish lists, previous orders, course ratings
- Implicit input data: search history As we are evaluating what data is available to us, prioritize getting explicit data first. Users’ past actions tend to be the best at predicting future behavior. Implicit data is helpful but less reliable, so try to focus on explicit inputs. Also remember to not focus or depend too heavily on course ratings since many users will use the marketplace often and never leave a rating. With these data considerations in mind, we need to pick a modeling method that will help us analyze the foundational premise of recommendation systems: that there is overlap between user preferences and purchase histories that predicts future behavior.
Content Based Filtering
We try content based filtering as our first approach focused on explicit input from users. Content based filtering starts with a key assumption: the user is more likely to purchase courses that are similar to ones that he or she has rated well in the past. This will recommend courses in the same genre as the ones the user has interacted with or purchased in the past. All information about the user is stored in a user vector representing all courses the user has rated or purchased. It is worth pointing out that ratings are less common in most applications and less trustworthy than purchase data. To handle this, we use fractional weights for ratings data and integers for purchases:
(e.g. User 3’s Course 4 rating of 1.3 means the user action of purchasing the course was +1 and the user’s rating of ⅗ was worth 0.3 weight. In this example, User 3 is more similar in behavior to User 2, so we would recommend Course 5 over the liked courses of User 1, Course 2)
Each course also has a vector containing categorical/genre information about the course. This matrix of course vectors will look like the above user vector matrix but instead the rows will be movies and the columns will be course metadata. Mathematically, content-based filtering finds the magnitude of the angle between the user vector and the course vector. This works to help compare users because the algorithm tries to find the most similar purchase patterns/ratings of the same courses. There are three commonly used methods of comparing the user and course vectors:
- Cosine Similarity method: the cosine of the angle between two vectors. It is the dot product of the two vectors divided by the product of the two vectors' magnitudes.

- Euclidean Distance method: similar courses will be found close to each other if plotted in n-dimensional space (n being the number of course descriptive features used). We can calculate the distance between courses and based on that distance, recommend them to the user:
-
Pearson’s Correlation method: rank courses by their correlation of attributes and the higher correlated they are, the more similar:
Each method has advantages. The cosine similarities of a subset of the original data are the same as that of the original data, which is not true for the Pearson correlation. Both Pearson correlation and cosine similarity are invariant to scaling. This is an important property as you often don't care that two vectors are similar in absolute terms, only that they vary in the same way. The Euclidean method takes large differences and exaggerates them while small differences are ignored (small numbers squared becomes tiny numbers, large numbers squared becomes huge numbers). Thus if two vectors almost agree everywhere, they will be regarded as very similar.
The limit to the content based filtering approach lies in the fact that the recommendation tailors courses that are of the same type of metadata. It will never recommend a result that the user has never searched for or rated or purchased. Yes, this is a valid way to build a recommendation system, but it is the simple approach to building such an engine. A better approach would be an algorithm that can recommend courses based on the behavior of similar users as well, in addition to the single user’s history.
Collaborative Filtering
The collaborative filtering algorithm builds on the content based algorithm by generalizing user behavior to recommend. Instead of building a course recommendation, we first build a similar user model based on each user’s previously purchased courses and group by similar user patterns. In pseudocode:
- For all courses C the score matrix all to 0 (score[C] = 0)
- For each other user y
- If y’s scored courses match x’s preferences, increment score[s] for all C that user y likes by +0.1 x rating and +1 for every purchased course
- Find the course with the highest score and return it

We use the same similarity measure as before in the Pearson correlation to find the courses rated by both the users. Based on the Pearson results, recommendations can be made. If calculating all possibilities of users and courses is too computationally challenging, clustering can be used to limit the number of “neighbors” (similar users) to be considered.
Next, we continue with the course recommendation. We make the assumption in our answer that there are more users than courses. This requires us to extend our user collaborative filtering to course-course collaborative filtering to compute the correlation between each pair of courses as defined by our score function in the above pseudo code. All courses scored similarly indicates correlation between them. The fundamental difference from content filtering is that we are now summing all the scores across the entire vector of courses to evaluate the users instead of evaluating the courses. To perform this weighted sum requires history of users and course purchases/ratings.
The best recommendation engines provide well-received options but struggle with new users or new courses as they are represented in vector form as empty vectors (all values equal to zero). For users with no history (user cold start), we must pick an approach to fill all the empty values. One approach would be to recommend the most popular courses by any attribute we know about the user. After a couple ratings or purchases, we can use the personalized recommendations. The same consideration must be taken for any new courses added to the platform. Use the content and classification of the course to build a similar course recommendation to encourage users to purchase and then rate to course until you have enough ratings for good personalization. Consider using aggregate information about existing courses to provide a starting recommendation on the basis of category at which this item belongs.
Comparison of Methods
To evaluate the success of our new model, the most common evaluation metrics for recommendation engines during testing (offline mode) are the following:
- Recall (how many courses correctly recommended over how many were correctly recommended positively and negatively)
- Precision (how many courses are correctly recommended over all courses recommended )
- RMSE (Root Mean Squared Error)
- Mean Reciprocal RankMAP at k (Mean Average Precision at cutoff k)
- NDCG (Normalized Discounted Cumulative Gain)
In the online model scenario, business metrics are much more important such as:
- Customer lifetime value
- A/B testing
- Product/marketplace ROI
- Difficulty of implementation and ability to tune
- Click-through-rate
- Conversion rate
- BEST: create a new session with no cookies and check your intuition to see if your recommendation engine returns courses you think match the history you gave your user