Dustin Lennon

Dustin Lennon

Applied Scientist

machine learning recommendation systems algorithms

Recommendation Systems, A Mathematical Overview

This post provides an overview of recommendation systems and algorithms.

Dustin Lennon
July 2019
July 2019

$ \newcommand{\min}{\mathop{\rm min}\nolimits} $
Recommendation Systems

Recommendation Systems

A faster online shopping experience is, in most cases, a preferable choice for most customers; efficiency is paramount. For this reason, recommendation systems are a crucial component of modern e-commerce sites as these systems serve to focus a customer’s attention on relevant products. This promotes efficiency by minimizing uninteresting search trajectories, provides a better overall experience, and has been shown to increase engagement and improve customer retention.

In it’s simplest formulation, the input to a recommendation system is a preference matrix: each user provides a ranking of a subset of items. One of the most famous instances of this problem is due to Netflix. What is the best algorithm for recommending movies to viewers? In 2009, Netflix awarded a million dollar prize to a team of researchers who produced a 10% gain over their existing solution.

Auxilliary data is often available as well. In the Netflix case, competitors were also given the title and release date of each film. This allowed the possibility of introducing additional, external data through knowledge of movie similarities. For example, movies that share cast members, or are of the same genre, may be more similar than movies without such connections. The problem of making a recommendation becomes, one hopes, more informed with the additional movie similarity data.

While it was not the case in the Netflix competition, modern instances of recommendation systems will incorporate similarities across the userbase as well. For example, a friendship indicator, as might be extracted from a social network, could be a viable predictor for estimating the ranking of a particular movie.

In short, then, our input to the general recommendation system is likely to be a preference matrix, an item similarity matrix, and a user similarity matrix; and the metrics used to compute the similarity matrices are open ended. Moreover, for large e-commerce sites, the preference matrix will often be large and sparse, so computational considerations will be important.


The most common solution to the recommendation system problem is known as collaborative filtering. This idea takes only an incomplete user-item preference matrix as input. The output is a complete user-item preference matrix, providing estimates of preferences that were unobserved.

k-Nearest Neighborhood

The simplest approach is to use the preference matrix itself to discern likeness between users. After adjusting for individual skew–some folks might like most movies; others, few films–it is easy to compute pairwise Pearson correlations of item-preference vectors between users. Choosing the top-k most similar users induces a peer group, a k-nearest neighborhood, for each user. Rankings can then be averaged over each peer group.

Clustering Methods

Recommendation systems often rely on clustering methods as a preprocessing step to a k-nearest neighborhood approach. This significantly reduces computational complexity, as k-nearest neighbors no longer need to be computed across the entire user base. While there are ad-hoc clustering methods available, key results are more often attributed to various matrix factorizations of the user-item preference matrix.

The key idea relies on a dimensional reduction. We suppose that users might be well represented in some k-dimensional vector space; and similarly, albeit a different k-dimensional vector space, for items. Let \(d_{ij}\) be the user-item preference for user i and item j; \(\tilde{u}_i\), the preferences for user i in the lower dimensional space; \(\tilde{v}_j\), the preferences associated with item j in its lower dimensional space. Then, \(d_{ij}\) is approximated by the inner product. We show both the scalar and matrix forms of the approximation below:

\[ \begin{align*} d_{ij} & \sim \tilde{u}_i \cdot \tilde{v}_j \\ D & \sim U \cdot V^{\top} \end{align*} \]

This framework calls to mind the well known singular value decomposition, also known as principal component analysis. For the purposes of recommendation systems, the SVD is a useful construct on which to develop more general matrix factorization methods. In particular, methods in this family exist that can handle both the partial missingness and the scaling issues associated with the large size of typical user-item preference matrices.

To be precise, we would want to minimize the Frobenius norm over \(\mathcal{I}\), the index set where \(d_{ij}\) is observed:

\[ \begin{align*} \underset{U,V}{\min} \sum_{i,j\in\mathcal{I}} \left(d_{ij} - \tilde{u}_i \cdot \tilde{v}_j\right)^2 \end{align*} \]

With additional regularization terms, penalizing norms of \(U\) and \(V\), this method is attributed to Simon Funk and was used as part of the Netflix prize-winning algorithm.

Other variations that consider non-negativity constraints are available, and off-the-shelf, stochastic gradient, solvers can be deployed successfully.

Graph Based Methods

The user-item preference matrix, where it contains data, induces a bipartite graph \(\mathcal{B}\), and the goal of our recommendation system, that is, to provide an estimate of unobserved preferences, might be recast as a link-prediction problem. To be precise, any edge missing from \(\mathcal{B}\) is an unknown preference between a specific user and a specific movie, and a signed-network, link-prediciton approach would provide a positive or negative value for such edges.

Interestingly, graph based methods generalize more easily to the scenario above, the one where similarity matrices are provided as auxilliary data. In terms of a graph structure, the similarity matrices induce edges within users as well as within movies. Thus, the bipartite property is lost. However, we can make progress via random walks. In particular, it is possible to generalize the PageRank algorithm with restarts at the user for whom we wish to make recommendations. Such a walk would necessarily alternate regularly between the users and movies, thus leveraging the original bipartite structure.