Update: This article is part of a series where I explore recommendation systems in academia and industry. Check out the full series: Part 1, Part 2, Part 3, Part 4, Part 5, Part 6, and Part 7.

In information retrieval, evaluation metrics are used to judge and compare the performance of recommendation models on benchmark datasets. Good quantitative assessments of their accuracy are crucial to building successful recommendation systems.

For a typical offline recommendation problem, we randomly select the training and test samples from the dataset. Then we build a predictive model on the training samples, evaluated the model on the test samples, and measure its performance using selected evaluation metrics.
For an online recommendation problem, we want to conduct online evaluations of the best performing offline models on a controlled A/B testing platform. We evaluate the differences in online user behavior and other performance metrics to determine whether such differences are statistically significant before choosing a new model.

Various properties are commonly considered when choosing the recommendation approach, whether for offline or online scenarios. These properties have trade-offs, so it is critical to understand and evaluate their effects on the overall performance and the user experience. During my research on this topic, I came across a comprehensive review from Guy Shani and Asela Gunawardana back in 2009 that discusses how to compare recommendations based on a set of properties that are relevant for the application. This blog post is my attempt to summarize these properties succinctly.

Note

This post is part of my ongoing series on Recommendation Systems:

Part 1 provided a high-level overview of recommendation systems, how they are built, and how they can be used to improve businesses across industries.
Part 2 provided a nice review of the ongoing research initiatives about these models' strengths and application scenarios.
Part 3 provided a couple of research directions relevant to the recommendation system scholar community.
Part 4 provided the nitty-gritty mathematical details of 7 variants of matrix factorization that can be constructed: ranging from the use of clever side features to the application of Bayesian methods.
Part 5 provided the architecture design of 5 variants of multi-layer perceptron based collaborative filtering models, which are discriminative models that can interpret the features in a non-linear fashion.
Part 6 provided a master class on 6 variants of autoencoders based collaborative filtering models, which are generative models that are superior in learning latent feature representation.
Part 7 provided a peek into 3 variants of Boltzmann machines based collaborative filtering models, which have remarkable similarities to learning in the brain.

1 — User Preference

An obvious option is to run a user study (within-subjects) and ask the participants to choose one of the systems. This evaluation does not restrict the subjects to specific properties, and it is generally easier for humans to make such judgments than to give scores for the experience. We can then select the system with the largest number of votes.

However, there are concerns that we must be aware of:

We assume that all users are equal, which may not be true. Assigning the right importance weights in a user study may not be easy.
We also need to find a way to calibrate scores across users since the level of preferences might vary widely.
While user satisfaction is important to measure, breaking satisfaction into smaller components helps understand the system and improve it.

2 — Prediction Accuracy

Prediction accuracy is the most discussed property in the recommendation system literature. At the base of the vast majority of recommender systems lies a prediction engine. This engine may predict user opinions over items (e.g., ratings of movies) or the probability of usage (e.g., purchase). A basic assumption in a recommender system is that the user will prefer a system that provides more accurate predictions. Thus, many researchers set out to find algorithms that provide better predictions.

There are three broad classes of prediction accuracy measures:

2.1 — Measuring Ratings Prediction Accuracy

Source: https://www.hulu.com/hub/originals

In applications like Netflix or Hulu, we wish to predict a user's rating to an item. In such cases, we wish to measure the accuracy of the system’s predicted ratings. Popular metrics used to evaluate the accuracy of predicted ratings are Root Mean Squared Error and Mean Absolute Error. RMSE disproportionately penalizes large errors compared to MAE.

Normalized RMSE and Normalized MAE are versions of RMSE and MAE that have been normalized by the rating range.
Average RMSE and Average MAE adjust for unbalanced test sets. If we need a measure that represents the prediction error on any item, it is preferable to compute MAE or RMSE separately for each item and then take the average over all items.

2.2 — Measuring Usage Prediction

In many applications, the recommendation system does not predict the user’s preferences of items, such as ratings, but tries to recommend to users items they may use. For example, when videos are added to the queue, YouTube suggests a set of videos that may also be interesting, given the added video. In this case, we are interested in whether the system properly predicts that the user will add these videos to the queue (use the items).

In an offline evaluation of usage prediction, we typically have a data set consisting of items each user has used. We then select a test user, hide some of her selections, and ask the recommender to predict a set of items that the user will use. We then have four possible outcomes for the recommended and hidden items: true positives, false positives, false negatives, and true negatives. Using these outcomes, we can compute the Precision and Recall quantities.

Typically we can expect a trade-off between these quantities — while allowing longer recommendation lists typically improves recall, it is also likely to reduce the precision. In applications where the number of recommendations presented to the user is preordained, the most useful measure of interest is Precision at K.

In other applications where the number of recommendations that are presented to the user is not preordained, it is preferable to evaluate algorithms over a range of recommendation list lengths rather than using a fixed length. Thus, we can compute precision-recall curves or ROC curves. While both curves measure the proportion of preferred items that are actually recommended, precision-recall curves emphasize the proportion of recommended items that are preferred. In contrast, ROC curves emphasize the proportion of not preferred items that end up being recommended.

2.3 — Ranking Measures

In search applications like Google or Bing, we are not interested in predicting an explicit preference, but rather in ordering results/items according to the user’s preferences. This ranking task can be measured in two ways: reference ranking or utility ranking.

In a reference ranking, we rank the items in decreasing order of the ratings and/or usage. The Normalized Distance-Based Performance Measure gives a perfect score to systems that correctly predicts every preference relation asserted by the reference and the worst score to systems that contradict every reference preference relation.
In a utility ranking, we rank the items in decreasing order of utility, which is discounted by a factor that depends on the item’s position in the recommendations list. Popular metrics include R-Score, Normalized Cumulative Discounted Gain, and Average Reciprocal Hit Rank.

3 — Utility

The utility is the value that either the system or the user gains from a recommendation. For example, e-commerce websites like Wayfair or Amazon use recommendations to improve their revenue via customer purchases. However, user utilities or preferences are difficult to capture and model, and considerable research has focused on this problem. Furthermore, it is unclear how to aggregate user utilities across users for computing a score for a recommender.

In applications where users rate items, we can use the ratings as a utility measurement.

For example, in Netflix, we can assume that recommending a 5-star movie has a higher utility for the user than recommending a 4-star movie. As users may interpret ratings differently, user ratings should be normalized before aggregating across users.
While we typically only assign positive utilities to successful recommendations, we can also assign negative utilities to unsuccessful recommendations. For example, if the user does not enjoy a particular recommended movie, then Netflix can assign a negative utility to that item (with respect to that user).

For any utility function, the standard evaluation is to compute the expected utility of a recommendation.

In the scenario where we only predict a single item, the value of a correct recommendation is the item’s utility.
In the scenario where we predict n items, we can sum the utilities of the correct recommendations in the list.
In the scenario where we incorporate negative utilities for incorrect recommendations, the sum is over all recommendations (correct or incorrect).

4 — Coverage

4.1 — Item Space Coverage

The term coverage refers to the proportion of items that the recommendation system can recommend. The simplest measure for this definition is the percentage of all items that can ever be recommended. Another measure is the sales diversity, which measures how unequally different items are chosen by users when a particular recommender system is used (like Gini Index and Shannon Entropy).

4.2 — User Space Coverage

Coverage can also be the proportion of users or user interactions for which the system can recommend items. In many applications, the system may not provide recommendations for some users due to low confidence in predictions for that user. In such cases, we may prefer systems that can provide recommendations for a wider range of users. Coverage thus can be measured by the richness of the user profile required to make a recommendation, such as the number of items that a user must rate before receiving recommendations.

4.3 — Cold-Start

The well-known cold-start problem measures the system coverage over a specific set of items and users. A generic way to decide a “cold” item is to use either the amount of time it exists in the system or the amount of data gathered. We can then credit the system more for properly predicting colder items and less for the hot items that are predicted. A rule of thumb is that, when computing the system accuracy on cold items, it may be wise to evaluate whether there is a trade-off with the entire system accuracy.

5 — Confidence

Confidence in the recommendation can be defined as the system’s trust in its recommendations or predictions. It grows with the amount of data. For example, if Instacart recommends broccolis with very high confidence and sprouts with the same rating but lower confidence, the shopper may add broccolis immediately to the shopping cart but may further read the description of the sprout item and perhaps its ingredients before deciding to shop it.

Perhaps the most common measurement of confidence is the probability that the predicted value is true or the interval around the predicted value. A pre-defined portion (95% of the true value lies). The most general confidence method is to provide a complete distribution over possible outcomes.

Given two recommenders that perform similarly on other relevant properties, such as prediction accuracy, it can be desirable to choose the one that can provide valid confidence estimates. In this case, given two recommenders with identical accuracy that report confidence bounds in the same way, we will prefer the recommender that better estimates its confidence bounds.

Another application of confidence bounds is filtering recommended items where the confidence in the predicted value is below some threshold. In this scenario, we assume that the recommender cannot predict a score for all values, as is typically the case when presenting top n recommendations. Hence, we can design an experiment around this filtering procedure by comparing the accuracy of two recommenders after their results were filtered by removing low confidence items. We can compute a curve in such experiments, estimating the prediction accuracy for each portion of filtered items or different filtering thresholds. This evaluation procedure does not require both algorithms to agree on the confidence method.

6 — Trust

While confidence is the system's trust in its ratings, we refer here to the user’s trust in the system recommendation. For example, it may be beneficial for the system to recommend a few items that the user already knows and likes. This way, even though the user gains no value from this recommendation, she observes that the system provides reasonable recommendations, which may increase her trust in the system recommendations for unknown items. Another common way of enhancing trust in the system is to explain the system's recommendations.

The obvious method for evaluating user trust is by asking users whether the system recommendations are reasonable in a user study. In an online test, one could associate the number of recommendations followed with the trust in the recommender, assuming that higher trust in the recommender would lead to more recommendations being used.

Alternatively, we could also assume that trust in the system is correlated with repeated users, as users who trust the system will return to when performing future tasks. However, such measurements may not separate well other user satisfaction factors and may not be accurate.

7 — Serendipity

Serendipity is a measure of how surprising the successful recommendations are. One can think of serendipity as the amount of relevant information new to the user in a recommendation. For example, by following a successful book recommendation on Scribd, the reader learns of a new author that she likes, which can be considered serendipitous.

To avoid human labeling, we could design a distance measurement between items based on content. We can then score a successful recommendation by its distance from a set of previously rated items in a collaborative filtering system or from the user profile in a content-based recommender. Thus, we are rewarding the system for successful recommendations far from the user profile.

We can also think of serendipity as the deviation from the “natural” prediction. Given a prediction engine with high accuracy, its recommendations are “obvious.” Therefore, we’ll give higher serendipity scores to successful recommendations that the prediction engine would deem unlikely.

We can evaluate a recommender's serendipity in a user study by asking the users to mark the recommendations that they find unexpected. We can also see whether the user followed these recommendations, which would make them unexpected and successful and, therefore, serendipitous. In an online experiment, we can assume that our distance metric is correct and evaluate only how distance from the user profile affected the probability that a user will follow the recommendation.

8 — Novelty

Novel recommendations are items that the user did not know about. An obvious and easy to implement approach is to filter out items that the user already rated or used. However, in many cases, users will not report all the items they have used in the past. This method is insufficient to filter out all items that the user already knows.

While we can obviously measure novelty in a user study by asking users whether they were already familiar with a recommended item, we can also understand a system’s novelty through offline experiments. For such an experiment, we could split the data set on time, i.e., hide all the user ratings that occurred after a specific point in time. We can also hide some ratings that occurred before that time, simulating the items that the user is familiar with but did not report ratings for. When recommending, the system is rewarded for each item recommended and rated after the split time, but would be punished for each item recommended but rated before the split time.

Another method for evaluating novelty assumes that popular items are less likely to be novel. Novelty can thus be taken into account by using an accuracy metric. The system does not get the same credit for correctly predicting popular items when it correctly predicts non-popular items.

Finally, we can evaluate the amount of new information in a recommendation, together with the relevance of the recommended item. For example, when item ratings are available, we can multiply the hidden rating by some information measurement of the recommended item to produce a novelty score.

9 — Diversity

Diversity is generally defined as the opposite of similarity. In some cases, suggesting a set of similar items may not be as useful for the user because it may take longer to explore the range of items. Consider, for example, TripAdvisor, where the system should recommend travel options. Presenting a list with 10 recommendations for the same location (varying only on the hotel and location) may not be as useful as suggesting 10 different locations. The user can browse various locations and request more details on those relevant to her interest.

The most explored method for measuring diversity uses item-item similarity, typically based on item content. We could then measure the diversity of a list based on the sum, average, min, or max distance between item pairs, or measure the value of adding each item to the recommendation list as the new item’s diversity from the items already in the list. As diversity may come at the expense of other properties (like accuracy), we can compute curves to evaluate the decrease in accuracy versus the increase in diversity.

In recommenders that assist in information search, we can assume that more diverse recommendations will result in shorter search iterations. We could use this in an online experiment measuring interaction sequence length as a proxy for diversification. To validate this claim, it is useful to experiment with different diversity thresholds using the same prediction engine before comparing different recommenders.

10 — Robustness

Robustness is the stability of the recommendation in the presence of fake information, typically inserted on purpose to influence the recommendations. For example, an Airbnb host may boost the rating for their listings by injecting fake guest profiles that rate the listings positively.

Such attempts to influence the recommendation are typically called attacks. Coordinated attacks occur when a malicious user intentionally queries the dataset or injects fake information to learn some private information from some users. In general, creating an immune system to any attack is unrealistic. It is more useful to estimate the cost of influencing a recommendation, which is typically measured by the amount of injected information. One way to carry that out is to simulate attacks by introducing fake information and empirically measuring the average cost of a successful attack.

Another type of robustness is the system's stability under extreme conditions, such as a large number of requests. In many cases, system robustness is related to the infrastructure, such as the database software, the hardware specs, and scalability.

11 — Risk

Source: https://robinhood.com/collections/100-most-popular

In many scenarios, a recommendation may be associated with a potential risk. For instance, in an application like Robinhood that recommends stocks for purchase, risk-averse users may prefer stocks with lower expected growth and a lower risk of collapsing. On the other hand, risk-seeking users may prefer stocks with a potentially high profit. We would want to evaluate the expected value generated from a recommendation and minimize the risk.

The standard way to evaluate risk-sensitive systems is by considering the expected utility and the utility variance. For example, we may use a parameter q and compare two systems on E[X] + q * Var(X). When q is positive, this approach prefers risk-seeking recommenders. When q is negative, the system prefers risk-averse recommenders.

12 — Privacy

In most collaborative filtering systems, a user willingly discloses her preferences over items to receive useful recommendations. However, it is important for most users that their preferences stay private. Otherwise, private information leaks can lead to situations like the beer-pregnancy correlation in Target.

It is generally considered inappropriate for a recommendation system to disclose private information, even for a single user. For this reason, analysis of privacy tends to focus on a worst-case scenario, illustrating theoretical cases under which users' private information may be revealed. Another alternative is defining different levels of privacy, such as k-identity, and comparing algorithm sensitivity to privacy breaches under varying levels of privacy.

13 — Adaptivity

Source: https://www.nytimes.com/trending/

Real recommendation systems may operate in a setting where the item collection changes rapidly or where trends in interest over items may shift. Obvious examples include news recommendations (New York Times) or hashtag recommendations (Twitter Trends). When an unexpected event occurs, people become interested in articles/tweets that may not have been interesting otherwise. This problem is different from the cold-start problem because old items that were not interesting in the past suddenly become interesting.

This type of adaptation can be evaluated offline by analyzing the amount of information needed before an item is recommended. If we sequentially model the recommendation process, we can record the amount of evidence needed before the algorithm recommends a story, even in an offline test.

Another type of adaptivity is how the system adapts to a user’s personal preferences or changes in the user profile. We can evaluate the changes in the recommendation list in an offline experiment after adding more information to the user profile. We can then evaluate the algorithm by measuring the difference between the recommendation lists before and after the new information was added.

14 — Scalability

Real-world recommendation systems must scale up to millions of items (Google News or YouTube), even in the expenses of other properties like accuracy or coverage. A standard computer science approach to evaluate scalability is to measure an algorithm's computational complexity in terms of time or space requirements. Another more practical one is to measure resource consumption over large datasets.

Scalability is typically measured by experimenting with growing datasets, showing how the speed and resource consumption behaves as the task scales up. It is important to measure the compromises that scalability dictates. For instance, measuring accuracy performance on data subsets can be useful for specific future tasks.

As recommender systems are expected to provide rapid recommendations online, it is also important to measure how fast the system provides recommendations. One such measurement is the system's throughput (the number of recommendations that the system can provide per second). We could also measure the latency (the required time for making a recommendation online).

Conclusion

Recommendation systems are certainly one of the main success stories of machine learning in practice. Despite their successes, there are still many opportunities for future research, especially in ways to evaluate them. The optimal solution is to conduct large-scale industry-academia collaboration, bridging the gap between academic metrics and business values. Another solution is to put more emphasis on user-centric and impact-oriented evaluation metrics. Finally, it is desirable to improve both current offline and online experimentation approaches to work with the complexity of multi-agent environments and take advantage of distributed computing.

Stay tuned for the next blog post of this series that explores recommendation systems in the marketplace!

Reference

Shani, G., & Gunawardana, A. (2011). Evaluating recommendation systems. In Recommender systems handbook (pp. 257–297). Springer, Boston, MA.