Introduction

The number of research publications on deep learning-based recommendation systems has increased exponentially in the past recent years. In particular, the leading international conference on recommendation systems, RecSys, started to organize regular workshops on deep learning since 2016. For example, in the 2019 conference in Copenhagen a couple of weeks ago, there is a whole category of papers on deep learning, which promotes research and encourages applications of such methods.

In this post and those to follow, I will be walking through the creation and training of recommendation systems, as I am currently working on this topic for my Master Thesis. In Part 1, I provided a high-level overview of recommendation systems, how they are built, and how they can be used to improve businesses across industries. Part 2 provides a nice review of the ongoing research initiatives with regard to the strengths, weaknesses, and application scenarios of these models.

Why Deep Learning for Recommendation?

Here are the 4 key strengths of deep learning-based recommendation systems compared to that of traditional content-based and collaborative filtering approaches:

Deep learning can model the non-linear interactions in the data with non-linear activations such as ReLU, Sigmoid, Tanh… This property makes it possible to capture the complex and intricate user-item interaction patterns. Conventional methods such as matrix factorization and factorization machines are essentially linear models. This linear assumption, acting as the basis of many traditional recommenders, is oversimplified and will greatly limit their modeling expressiveness. It is well-established that neural networks are able to approximate any continuous function with arbitrary precision by varying the activation choices and combinations. This property makes it possible to deal with complex interaction patterns and precisely reflect the user’s preference.
Deep learning can efficiently learn the underlying explanatory factors and useful representations from input data. In general, a large amount of descriptive information about items and users is available in real-world applications. Making use of this information provides a way to advance our understanding of items and users, thus, resulting in a better recommender. As such, it is a natural choice to apply deep neural networks to representation learning in recommendation models. The advantages of using deep neural networks to assist representation learning are in two-folds: (1) it reduces the efforts in hand-craft feature design; and (2) it enables recommendation models to include heterogeneous content information such as text, images, audio, and even video.
Deep learning is powerful for sequential modeling tasks. In tasks such as machine translation, natural language understanding, speech recognition, etc., RNNs and CNNs play critical roles. They are widely applicable and flexible in mining sequential structure in data. Modeling sequential signals is an important topic for mining the temporal dynamics of user behavior and item evolution. For example, next-item/basket prediction and session-based recommendations are typical applications. As such, deep neural networks become a perfect fit for this sequential pattern mining task.
Deep learning possesses high flexibility. There are many popular deep learning frameworks nowadays, including TensorFlow, Keras, Caffe, MXnet, DeepLearning4j, PyTorch, Theano… These tools are developed in a modular way and have active community/professional support. The good modularization makes development and engineering a lot more efficient. For example, it is easy to combine different neural structures to formulate powerful hybrid models or replace one module with others. Thus, we could easily build hybrid and composite recommendation models to simultaneously capture different characteristics and factors.

To provide a bird-eye’s view of this field, I will classify the existing models based on the types of employed deep learning techniques.

1> Multi-Layer Perceptron Based Recommendation

MLP is a feed-forward neural network with multiple hidden layers between the input layer and the output layer. You can interpret MLP as a stacked layer of non-linear transformations, learning hierarchical feature representations. It is a concise but effective network that can approximate any measurable function to any desired degree of accuracy. As such, it is the basis of numerous advanced approaches and is widely used in many areas.

MLP can add the non-linear transformation to existing recommendation system approaches and interpret them into neural extensions.

A recommendation can be viewed as a two-way interaction between users’ preferences and items’ features. For example, matrix factorization decomposes the rating matrix into low-dimensional user/item latent factors. Neural Collaborative Filtering is a representative work that constructs a dual neural network to model this two-way interaction between users and items.
Deep Factorization Machine is an end-to-end model that seamlessly integrates a factorization machine and an MLP. It can model the high-order feature interactions via deep neural network and low-order interactions with factorization machines.

Using MLP for feature representation is very straightforward and highly efficient, even though it might not be as expressive as auto-encoder, CNNs, and RNNs.

Wide and Deep Learning is a nice model that can solve both regression and classification problems that was initially introduced for app recommendation in Google Play. The wide learning component is a single layer perceptron which can also be regarded as a generalized linear model. The deep learning component is an MLP. Combining these two learning techniques enables the recommender to capture both memorization and generalization.
Deep Neural Networks for YouTube Recommendations divides the recommendation task into 2 stages: candidate generation and candidate ranking. The candidate generation network retrieves a subset from all video corpus. The ranking network generates a top-n list based on the nearest neighbors’ scores from the candidates.
Collaborative Metric Learning replaces the dot product of matrix factorization wit Euclidean distance because dot product does not satisfy the triangle inequality of distance function. The user and item embeddings are learned via maximizing the distance between users and their disliked items and minimizing that between users and their preferred items.

2> Autoencoder Based Recommendation

AE is an unsupervised model attempting to reconstruct its input data in the output layer. In general, the bottleneck layer is used as a salient feature representation of the input data. Almost all of its variants (denoting AE, variational AE, connective AE, and marginalized AE) can be applied to the recommendation task.

AE can be used to learn the lower-dimensional feature representations at the bottleneck layer.

Collaborative Deep Learning is a hierarchical Bayesian model that integrates stacked denoising auto-encoder (SDAE) into probabilistic matrix factorization (PMF). To seamlessly combine deep learning and recommendation model, the paper proposes a general Bayesian deep learning framework consisting of two tightly hinged components: a perception component (SDAE) and a task-specific component (PMF). This enables the model to balance the influences of side information and interaction history.
Collaborative Deep Ranking is devised specifically in a pairwise framework for the top-n recommendation. The paper shows that the pairwise model is more suitable for ranking lists generation.
Deep Collaborative Filtering is a general framework for unifying deep learning approaches with a collaborative filtering model. The framework makes it easier to utilize deep feature learning techniques to build hybrid collaborative models.

AE can be used to fill in the blanks of the user-item interaction matrix directly in the reconstruction layer.

AutoRec takes user/item partial vectors as input and aims to reconstruct them in the output layer.
Collaborative Denoising Auto-encoder is principally used for ranking prediction. The input of CDAE is user partially observed implicit feedback, which can be regarded as a preference vector that reflects a user’s interests to items. The paper also proposes a negative sampling technique to sample a small subset from the negative set (items with which the user has not interacted), which reduces the time complexity substantially without degrading the ranking quality.
Multi-VAE and Multi-DAE propose a variant of a variational autoencoder for recommendation with implicit data. The paper introduces a principled Bayesian inference approach for parameter estimation and shows favorable results than commonly used likelihood functions.

3> Convolutional Neural Networks based Recommendation

CNN is basically a feed-forward neural network with convolution layers and pooling operations. It can capture the global and local features, thus significantly enhancing the model’s efficiency and accuracy. It is very powerful in processing unstructured multi-media data.

CNN can be used to extract features from images.

What Your Images Reveal investigates the influences of visual features to Point-of-Interest recommendation, and proposes a visual content enhanced POI recommender system. This system adopts CNN to extract image features, which is built on Probabilistic Matrix Factorization by exploring the interactions between visual content and latent user/location factor.
Comparative Deep Learning of Hybrid Representations for Image Recommendations proposes a comparative deep learning model with CNNs for image recommendation. The network consists of 2 CNNs which are used for image representation learning and a MLP for user preferences modeling.
ConTagNet is a context-aware tag recommendation system. The image features are learned by CNNs. The context representations are processed by a two-layer fully-connected feedforward neural network. The outputs of 2 neural networks are concatenated and fed into a softmax function to predict the probability of candidate tags.

CNN can be used to extract features from text.

DeepCoNN adopts 2 parallel CNNs to model user behaviors and item properties from review texts. This model alleviates the sparsity problem and enhances the model interpretability by exploiting rich semantic representation of review texts with CNNs. It utilizes a word embedding technique to map the review texts into a lower-dimensional semantic space as well as keep the words sequences information. The extracted review representations then pass through a convolutional layer with different kernels, a max-pooling layer, and a fully-connected layer consecutively.
Automatic Recommendation Technology for Learning Resources with Convolutional Neural Network builds an e-learning resources recommendation model, which uses CNNs to extract item features from text information of learning resources such as the introduction and the content of learning material.

CNN can be used to extract features from audio and video.

Deep Content-Based Music Recommendation uses CNN to extract features from music signals. The convolutional kernels and pooling layers allow operations at multiple timescales. This content-based model can alleviate the cold-start problem of music recommendation.
Collaborative Deep Metric Learning for Video Understanding extracts audio features with the prominent CNN-based model ResNet. The recommendation is performed in the collaborative metric learning framework, similar to CML mentioned earlier.

CNN can be applied to vanilla collaborative filtering.

Outer Product-Based Neural Collaborative Filtering uses CNNs to improve Neural Collaborative Filtering. The so-called ConvNCF model uses an outer product instead of a dot product to model the user-item interaction patterns. The paper applies CNNs over the result of the outer product and thus can capture the high-order correlations among embeddings dimensions.
Personalized Top-N Sequential Recommendation via Convolutional Sequence Embedding presents sequential recommendations with CNNs, where a hierarchical and a vertical CNN are used to model the union-level sequential patterns and skip behaviors for the sequence-aware recommendation.

Graph-based CNNs can handle the interactions in recommendation tasks.

Graph Convolutional Matrix Completion considers the recommendation problem as a link prediction task with graph CNNs. This framework makes it easy to integrate user/item side information such as social networks and item relationships into the recommendation model.
Graph Convolutional Neural Networks for Web-Scale Recommender Systems uses graph CNNs for recommendations on Pinterest. This model generates item embeddings from both graph structure as well as item feature information using random walk and graph CNNs, and thus suits well for large-scale web recommender.

4> Recurrent Neural Network-based Recommendation

RNN is suitable for modeling sequential data. It has loops and memories to remember former computations. Variants of RNNs including LSTM and GRU are deployed to overcome the vanishing gradient problem.

RNN can deal with the temporal dynamics of interactions and sequential patterns of user behaviors in session-based recommendation tasks.

GRU4Rec is a session-based recommendation model, where the input is the actual state of a session with 1-of-N encoding, where N is the number of items. The coordinate will be 1 if the corresponding item is active in this session, otherwise 0. The output is the likelihood of being the next in the session for each item.
Personal Recommendation using Deep Recurrent Neural Networks in NetEase is a session-based recommendation model for a real-world e-commerce website. It utilizes the basic RNNs to predict what the user will buy next based on the click history. To minimize the computation costs, it only keeps a finite number of the latest states while collapsing the older states into a single history state. This method helps to balance the trade-off between computation costs and prediction accuracy.
Recurrent Recommender Network is a non-parametric recommendation model built on RNNs. It can model the seasonal evolution of items and changes in user preferences over time. It uses 2 LSTM networks as the building block to model the dynamic user/item states.

RNN is also a good choice to learn the side information with sequential patterns.

Recurrent Coevolutionary Latent Feature Processes for Continuous-Time Recommendation presents a co-evolutionary latent model to capture the co-evolution nature of users’ and items’ latent features. The interactions between users and items play an important role in driving the changes in user preferences and item status. To model the historical interactions, the author proposed using RNNs to automatically learn representations of the influences from drift, evolution, and co-evolution of the user and item features.
Ask the GRU proposes using GRUs to encode the text sequences into a latent factor model. This hybrid model solves both warm-start and cold-start problems. Furthermore, the authors adopted a multi-task regularizer to prevent overfitting and alleviate the sparsity of training data. The main task is rating prediction while the auxiliary task is item meta-data (e.g. tags, genres) prediction.
Embedding-based News Recommendation for Millions of Users proposes using GRUs to learn more expressive aggregation for user browsing history and recommend news articles with a latent factor model. The results show a significant improvement compared with the traditional word-based approach. The system has been fully deployed to online production services and serving over 10 million unique users every day.

5> Restricted Boltzmann Machine based Recommendation

RBM is a two-layer neural network consisting of a visible layer and a hidden layer. It can be easily stacked to a deep network. The term Restricted indicates that there are no intra-layer communications in a visible or a hidden layer.

Restricted Boltzmann Machines for Collaborative Filtering is the first recommendation model that was built on RBM. The visible unit of RBM is limited to binary values, thus, the rating score is represented in a one-hot vector to adapt to this restriction. Each user has a unique RBM with a shared parameter, and the parameters can be learned via the Contrastive Divergence algorithm. The essence here is that the users implicitly tell their preferences by giving ratings, regardless of how they rate items.

A Non-IID Framework for Collaborative Filtering with RBMs combines the user-based and item-based RBM-CF in a unified framework. In this case, the visible units are determined both by the user and the item hidden units.
Item Category Aware Conditional Restricted Boltzmann Machine Based Recommendation designs a hybrid RBM-CF which incorporates the item features and is based on conditional RBM. Here, the conditional layer is modeled with the binary item genres, thus affecting both the hidden layer and the visible layer with different connected weights.

6> Neural Attention-based Recommendation

Attentional models are differentiable neural architectures that operate based on soft content addressing over an input sequence or an input image. They are motivated by human visual attention and can filter out the un-informative features from raw inputs and reduce the side effects of noisy data. This attention mechanism is ubiquitous in Computer Vision and Natural Language Processing domains.

In the context of recommendation systems, we can leverage the attention mechanism to filter out the noisy content and selected the most representative items while providing good interpretability.

Attentive Collaborative Filtering uses an attentive collaborative filtering model with a 2-level attention mechanism inside a latent factor model. The model consists of item-level and component-level attention: where the item-level one selects the most representative items to characterize users; and the component-level one captures the most informative features form multimedia auxiliary information for each user.
Hashtag Recommendation with Topical Attention-Based LSTM uses an attention-based LSTM model for hashtag recommendation. The model takes advantage of both RNNs and an attention mechanism to capture the sequential property and recognize the informative words form microblog posts.
Hashtag Recommendation Using Attention-Based Convolutional Neural Network uses an attention-based CNN model for the same hashtag recommendation in microblog, which is treated as a multi-label classification problem. The model consists of a global channel and a local attention channel: where the global channel has convolution and max-pooling layers to encode all the words; and the local channel has an attention layer with given window size and threshold to select informative words.

7> Neural AutoRegressive based Recommendation

Neural Autoregressive Distribution Estimation (NADE) is an unsupervised neural network built on top of an autoregressive model and feedforward neural networks. It is a tractable and efficient estimator for modeling data distribution and densities, which can be considered a desirable alternative to Restricted Boltzmann Machines.

Based on my review, A Neural Auto-Regressive Approach to Collaborative Filtering is the sole paper that proposes a NADE based collaborative filtering model (CF-NADE) that can model the distribution of user ratings.

8> Deep Reinforcement Learning based Recommendation

Reinforcement Learning (RL) operates on a trial-and-error paradigm and consists of 5 components (agents, environments, states, actions, and rewards). The combination of deep neural networks and reinforcement learning formulate Deep Reinforcement Learning which has achieved human-level performance across multiple domains such as games and self-driving cars. Deep neural networks enable the agent to get knowledge from raw data and derive efficient representations without handcrafted features and domain heuristics.

Traditionally, most recommendation models consider the recommendation process to be static, making it challenging to capture user’s temporal intentions and to respond in a timely manner. In recent years, Deep Reinforcement Learning has been making its use into personalized recommendation.

Recommendations with Negative Feedback via Pairwise Deep Reinforcement Learning proposes something called DEERS for recommendation with both negative and positive feedback in a sequential interaction setting.
Deep Reinforcement Learning for Page-wise Recommendations explores a framework called DeepPage that can adaptively optimize a page of items based on user’s real-time actions.
DRN: A Deep Reinforcement Learning Framework for News Recommendation is a news recommendation system that uses Deep Reinforcement Learning to detect the dynamic changes of news content and user preference, incorporate return patterns of users, and increase the diversity of recommendation.

9> Adversarial Network-based Recommendation

Adversarial Network is a generative neural network which consists of a discriminator and a generator. These two neural networks are trained simultaneously by competing with each other in a minimax game framework.

IRGAN - A Minimax Game for Unifying Generative and Discriminative Information Retrieval Models is the first model that applies GAN to the information retrieval area. Specifically, the authors show the capability in 3 info retrieval tasks including web search, item recommendation, and question answering.
Adversarial Personalized Ranking for Recommendation proposes an adversarial personalized ranking approach which enhances the Bayesian personalized ranking with adversarial training. It plays a minimax game between the original BPR objective and the adversary which adds noises or permutations to maximize the BPR loss.
Generative Adversarial Network Based Heterogeneous Bibliographic Network Representation for Personalized Citation Recommendation uses a GAN-based representation learning approach for a heterogeneous bibliographic network which can effectively address the personalized citation recommendation task.
Neural Memory Streaming Recommender Networks with Adversarial Training has GAN generating negative samples for the memory network-based streaming recommender.

10> Deep Hybrid Models based Recommendation

With the good flexibility of deep neural networks, many neural building blocks can be integrated to formalize more powerful and expressive models. Recent research trends suggest that the hybrid model should be reasonably and carefully designed for specific tasks.

Collaborative Knowledge-Based Embedding combines CNNs with autoencoders to extract features in images. It leverages structural content, textual content, and visual content with different embedding techniques.
Quote Recommendation in Dialogue using Deep Neural Network is a hybrid model of RNNs and CNNs to recommend quotes, which entails generating a ranked list of quotes given the query texts or dialogues. It applies CNNs to learn the significant local semantics from tweets and maps them to a distributional vector. These vectors are then processed by LSTM to compute the relevance of target quotes to the given tweet dialogues.
Personalized Key Frame Recommendation integrates CNNs and RNNs for personalized keyframe recommendation within videos, in which CNNs are used to learn feature representations from keyframe images and RNNs are used to process the textual features.
Neural Citation Network for Context-Aware Citation Recommendation integrates CNNs and RNNs in an encoder-decoder framework for citation recommendation. CNNs is the encoder that captures the long-term dependencies from citation context, while RNNs is the decoder that learns the probability of a word in the cited paper’s title given all previous words together with representations attained by CNNs.
Collaborative Recurrent Autoencoder exploits integrating RNNs and denoising auto-encoder to overcome limitations such as lack of robustness and lack of capability to model the sequences of text information. The paper designs a generalization of RNNs called robust recurrent networks and proposes the hierarchical Bayesian recommendation model called CRAE. This model consists of encoding and decoding parts and uses feedforward neural layers with RNNs to capture the sequential information of item content.
Supervised Reinforcement Learning with Recurrent Neural Network for Dynamic Treatment Recommendation combines supervised deep reinforcement learning with RNNs for a treatment recommendation. This framework can learn the prescription policy from the indicator signal and evaluation signal.

Conclusion

Deep learning has become more and more popular throughout many fields including natural language processing, image and video processing, computer vision, and data mining, which is a remarkable phenomenon since there has not been such a common approach to be used in solving different kinds of computing problems before. With such aspects of deep learning techniques, they are not only highly capable of remedying complex problems in many fields, but they also form a shared vocabulary and common ground for these research fields. Deep learning methods even help these subfields to collaborate with each other where it was a bit problematical in the past due to the diversity and complexity of utilized techniques.

This article reviews the existing literature on deep learning-based recommender system approaches to help new researchers build a comprehensive understanding of the field. Mainly, I classify current literature in 10 categories based on the type of employed deep learning techniques, which I believe helps the reader constitute a holistic comprehension.