Imagine a world where we do not receive recommendations of any kind. It would be very difficult to find something that might interest us with today's plenty of information. In this way, it is quite common to learn about movies, books, and music we like through a friend or by searching for something that suits our tastes.
We will cover some key topics of Recommendation Systems, which is a solution that aims to personalize items or topics of any kind to a user based on his/her preference. Although, in the first part of the Recommender System Series, we will introduce some concepts and focus on one type of Recommender System: Content-Based Filtering (CBF). This post is accompanied by a Jupyter Notebook that makes it more interesting and aims to provide practical examples from a real-world dataset (Amazon Reviews).
Real-World Applications
After the popularization of the web, where users can provide feedback implicitly or explicitly about what they like or dislike, Recommendation Systems have proven successful in providing users with personalized content or service recommendations, while the company has been able to increase its sales. For example, Lee and Hosanagar[1] found that in their experimental setting, using Collaborative Filtering (CF) methods led to a 35% increase in item purchases over a control group where no recommender was in place.
Real-world applications of Recommender Systems are easy to find in our daily lives. There are a wide variety of systems that recommend several types of content. We receive friendship recommendations in social media, video and music recommendations from streaming services, product recommendations from digital commerce platforms, restaurant recommendations, and many more.
Data Collection
A Recommender System depends on user, item, and user-item interactions data to do its job. The data can be collected in diverse ways, for instance, when the user explicitly rates an item. Although, it is not always that we can collect this data explicitly. In these cases, a key factor to determine implicitly when a user liked or disliked an item is by tracking user interactions on the web, e.g., with click data or visits data that represent user behavior.
Types of Recommender Systems
There are two main approaches for implementing Recommender Systems.
Collaborative Filtering (CF): These methods collect preferences in the form of ratings or signals from many users (hence the name), and then recommend items for a user based on item interactions with people having similar tastes to that user had in the past. In other words, these methods assume that if a person X likes a subset of items that a person Y likes, then X is more likely to have the same opinion as Y for a given item compared to a random person that may or may not have the same preferences.
Content-Based Filtering (CBF): These methods use attributes and descriptions from items and/or textual profiles from users to recommend similar content to what they like. This way, items that are close to what the user has liked in the past or items that match what they explicitly say they like are going to be recommended. The assumption here is that items with similar attributes will be rated similarly; that is, if a user liked item X in the past, it is likely that they will also like a similar item Y.
Advantages and Disadvantages
In the case of Collaborative Filtering, since users only interact with a fraction of items, it is common that most of these user-item interactions are unobserved, and this can lead to imprecise recommendations. There is also a well-known challenge of the cold-start problem: there is no interaction data on a new item because no one has yet rated it. For instance, a movie recommendation application would not recommend a new movie using this method until it starts getting reviews from the user base.
Differently from CF methods, CBF methods do not suffer from the cold-start problem for new items, since it uses the item content itself to base their recommendations. Although, Content-based methods also have their fair share of problems, such as overspecialization, which refers to recommending very similar items to the user due to a limited user profile. This is problematic because offering little or no diversity in the recommendations increases the risk of the user not liking any of the recommended items. The disadvantages of these methods will be explored in the Jupyter Notebook with examples.
Since both methods have advantages and disadvantages, there are hybrid methods that use the strengths of both methods to outweigh the weaknesses of one another. For instance, a hybrid method could use content information in case of a cold start setting, while avoiding overspecialization when a user makes a considerable number of ratings. There are different ways this combination of methods could be implemented, and this will be explored in a later post.
Content-Based Filtering
There are different approaches to implementing CBF models. In general, they revolve around creating item attributes by using Text-Mining techniques. It is possible to use item features alone and then proceed with finding similarities between items to drive the recommendations. For this post, though, we will build a user profile based on user-rated content and item attributes.
Bobadilla et al.[2] explained the mechanism through which CBF models operate:
- Extract the attributes of items for recommendation
- Compare the attributes of items with the preferences of the active user
- Recommend items with characteristics that fit the user’s interests
Step 1: It is common practice to extract relevant keywords from content (e.g., item descriptions and other textual fields) to form the item's attributes. One way to encode textual data so that a Machine Learning (ML) model can leverage it is by using the Term Frequency/Inverse Document Frequency (TF-IDF) measure, which weights the importance of a word within a document (in this case, the item's content). This is done by counting the term frequency relative to all terms in an item (TF) and then inversely weighing by the number of items containing the term across all items (IDF). This reduces noise by giving more importance to terms that are rare in the item’s dataset but frequent for the item at hand, which makes common words that are irrelevant to the recommendation context, such as 'the', 'a', and 'is', to receive less weight. The formula for TF-IDF is as follows:
For example, imagine we want to recommend books from the Amazon dataset, and for simplicity, we use four book titles as the item catalog.
- Think and Grow Rich
- Rich Dad Poor Dad
- Without Remorse
- How to Get Rich
By applying TF-IDF in this corpus, we obtain the following item matrix I, where each row represents an item, and each column represents its attributes (keywords).
Step 2: There are multiple ways to build a representation of user tastes—data can be collected explicitly (from questionnaires) or implicitly (from the user's past behavior such as likes/clicks on products). One common way to represent user preference is to combine user ratings and item attributes for the items that have been rated. The simplest approach to obtain a user profile would be to multiply the user ratings by the matrix of user-rated items, which in this case contains weights for the importance of each keyword in the item's content, and then average the resulting matrix to obtain a user profile vector.
For instance, say a user has rated the first two items of matrix I with ratings 5 and 4, respectively. Next, to combine ratings with item attributes, we can multiply the ratings with the first two lines of matrix I to obtain the U' matrix:
Then, average the values of each attribute (i.e., keywords of the item’s content) to obtain the user profile vector U:
Step 3: This profile vector can then be compared against each row (item) of matrix I by using a similarity measure. In this case, we use Cosine Similarity, which measures the angle between two vectors, regardless of their magnitudes. This means that vectors pointing in the same direction, i.e., vectors parallel to each other, result in 1 and are treated as the same. This property is relevant for our use case because we don’t need to normalize the user profile vector before computing the similarities. Cosine Similarity is computed as follows:
We can now compute the similarity between the user profile against the items not rated by the user:
Finally, we can return the most similar items to those that the user had liked in the past. In this example, the vector in the fourth row of the item matrix is more like the user profile vector than the third.
The disadvantage of using the mean to derive a profile is that it gives recommendations that are the mean of preferred items. This would lead to the Recommender System not recommending items that are in the tails of the distribution.
Conclusion
We have talked about what are recommender systems, some real-world applications, different types of methods and their advantages and disadvantages, how data can be collected, and how CBF works with a simple implementation example. Remember to check out the Jupyter Notebook to see a Python implementation that highlights the concepts explored in this post. We will discuss CF and Neighborhood-Based Methods in the following post.
References
[1] Lee, D., & Hosanagar, K. (2014). Impact of recommender systems on sales volume and diversity. Proceedings of the 2014 International Conference on Information Systems.
[2] Bobadilla, J., Ortega, F., Hernando, A., & Gutiérrez, A. (2013). Recommender systems survey. Knowledge-based systems, 46, 109-132.
Acknowledgment
This piece was written by Daniel Pinheiro Franco, Innovation Expert at Encora’s Data Science & Engineering Technology Practices group. Thanks to João Caleffi and Caio Dadauto for reviews and insights.
About Encora
Fast-growing tech companies partner with Encora to outsource product development and drive growth. Contact us to learn more about our software engineering capabilities.