The Variational Autoencoder model is a neural network that provides collaborative filtering based on implicit feedback, specifically, it provides product recommendations based on user and item interactions. The training data for this model should contain a sequence of user ID, item ID pairs indicating that the specified user has interacted with, for example, was given a rating to or clicked on, the specified item. This model is optimized implementation of Variational Autoencoder proposed in Variational Autoencoders for Collaborative Filtering paper.
Figure 1. The architecture of an Autoencoder.
The following features were implemented in this model:
- Data-parallel multi-GPU training with Horovod
- Mixed precision support with TensorFlow Automatic Mixed Precision (TF-AMP), which enables mixed precision training without any changes to the code-base by performing automatic graph rewrites and loss scaling controlled by an environmental variable. To enable the full potential of this feature, your GPU should have Tensor Cores (e.g. NVIDIA Volta series GPU).
The following performance optimizations were implemented in this model:
- Feeding sparse matrix instead of dense to the model
- Loading data using new TF datasets
This model trains for 200 epochs, with default setup:
- Learning rate = 0.001
- Batch size containing 10000 Users
- Annealing with 200000 anneal steps and anneal cap = 0.2
- Regularisation loss factor = 0.01
- Dimensions of consecutive encoder layers =
[ITEMS_COUNT, 600, 200]
- Dimensions of consecutive decoder layers =
[200, 600, ITEMS_COUNT]
This model uses the following data preprocessing:
- Removing users who had less than 5 interactions with items
- Randomally shuffle users, to make more diverse batches
- Split users and items into train, test and validation
- Split items inside test and validation into 80% of items to feed into model and check if it predicts the rest.
The following section list the requirements that you need to meet in order to use the VAE model.
This repository contains Dockerfile which extends the Tensorflow NGC container and encapsulates all dependencies. Aside from these dependencies, ensure you have the following software and hardware:
For more information about how to get started with NGC containers, see the following sections from the NVIDIA GPU Cloud Documentation and the Deep Learning Documentation:
- Getting Started Using NVIDIA GPU Cloud,
- Accessing And Pulling From The NGC container registry
- Running Tensorflow.
To train your model using mixed precision with tensor cores, perform the following steps using the default parameters of the VAE model on the MovieLens 20M dataset. The dataset is automatically downloaded and prepared if needed.
git clone https://github.com/mkfilipiuk/recommendation-system-nvidia-zpp
cd recommendation-system-nvidia-zpp
After Docker is correctly set up, you can build the VAE image with:
nvidia-docker build . -t nvidia_vae_running
./scripts/docker/interactive.sh
./scripts/docker/install.sh
python3 main.py --train --use_tf_amp
The model is exported to export_dir and then can be loaded and tested using
python3 main.py --test
The following sections provide greater details of the dataset, running training and inference, and the training results.
To see the full list of available options and their descriptions, use the --help
command line option, for
example:
python main.py --help
Aside from options to set hyperparameters, the relevant options to control the behaviour of the script are:
Command | Description |
---|---|
--train |
Train the model. The training is run at the beginning and might be omitted by loading pretrained model from export_dir. |
--test |
Test the model. The tests are run after the training, and print metric results. If the train option is not specifed then the model is imported from directory export_dir. |
--benchmark |
Benchmark the training. It prints average epoch training time and average epoch validation time. |
--use_tf_amp |
Turn on Automatic Mixed Precision. |
--dataset |
Choose the dataset that the model is going to learn from. See below for details |
--gpu_number |
In case of multiple GPUs available, you can specify its number. |
--number_of_gpus |
Number of GPUs used during multi-GPU training. If equals zero, VAE will use CPU. |
--number_of_epochs |
Number of epochs for training. The default value is 200 epochs and it's enough to reach the maximum accuracy (with default batch size). |
--batch_size_train |
Batch size of training. The default value is 10000. It represents the number of users that is fed into the network at each step. |
--batch_size_validation |
Batch size for validation and test. The default value is 10000. |
--validation_step |
If it's set to n then validation is run once every n epochs. |
--warm_up_epochs |
Number of epochs to omit during benchmark. First epochs perform worse in comparing to next epochs due to framework initialisation. It is used only during benchmark. |
--total_anneal_steps |
Number of annealing steps. |
--anneal_cap |
Annealing cap |
--lam |
Lambda – L2 regularization parameter. |
--lr |
Learning rate. |
Script supports following datasets:
Name | Description | # of users | # of items |
---|---|---|---|
ml-20m |
Explicit user ratings of movies from movielens.org | 138493 | 26744 |
ml-20m_alt |
As above, but with alternative method of processing | 138493 | 26744 |
netflix |
Explicit user rating of movies from Netfix. It was used during Netfix prize contest | 480189 | 17770 |
lastfm |
Artists played by the user | 358868 | 295041 |
pinterest |
Images pinned to boards. We treat boards as users and pins as items. | 46000 | 2565241 |
The following sections provide details on how we achieved our results in training accuracy, performance and inference performance.
Our results were obtained by running the commands listed below in the tensorflow-19.05-py3 Docker container on NVIDIA DGX Station with 4 V100 32GB GPUs.
# Single GPU
python3 main.py --train --number_of_epochs 200 --batch_size_train 14504 --dataset ml-20m --use_tf_amp
python3 main.py --train --number_of_epochs 200 --batch_size_train 14504 --dataset ml-20m
# Multi GPU
python3 main.py --train --number_of_epochs 200 --batch_size_train 14504 --dataset ml-20m --number_of_gpus {2,4} --lr 0.007 --total_anneal_steps 1000 --lam 0.001 --use_tf_amp
python3 main.py --train --number_of_epochs 200 --batch_size_train 14504 --dataset ml-20m --number_of_gpus {2,4} --lr 0.007 --total_anneal_steps 1000 --lam 0.001
Number of GPUs | AMP NDCG@100 | AMP Recall@20 | AMP Recall@50 | FP32 NDCG@100 | FP32 Recall@20 | FP32 Recall@50 |
---|---|---|---|---|---|---|
1 | TODO | TODO | TODO | TODO | TODO | TODO |
2 | TODO | TODO | TODO | TODO | TODO | TODO |
4 | TODO | TODO | TODO | TODO | TODO | TODO |
Our results were obtained by running the commands listed below in the tensorflow-19.05-py3 Docker container on NVIDIA DGX Station with 4 V100 32GB GPUs.
# Single GPU
python3 main.py --train --number_of_epochs 200 --batch_size_train 14504 --dataset ml-20m --use_tf_amp
python3 main.py --train --number_of_epochs 200 --batch_size_train 14504 --dataset ml-20m
# Multi GPU
python3 main.py --train --number_of_epochs 200 --batch_size_train 14504 --dataset ml-20m --number_of_gpus {2,4} --lr 0.007 --total_anneal_steps 1000 --lam 0.001 --use_tf_amp
python3 main.py --train --number_of_epochs 200 --batch_size_train 14504 --dataset ml-20m --number_of_gpus {2,4} --lr 0.007 --total_anneal_steps 1000 --lam 0.001
Number of GPUs | AMP training time | FP32 training time | AMP Speedup |
---|---|---|---|
1 | TODO | TODO | TODO |
2 | TODO | TODO | TODO |
4 | TODO | TODO | TODO |
- July 18, 2019
- Initial release
- Underperforming on ML-20M dataset Currently, our model slightly underperforms in comparison with the original implementation. It is clearly visible on the plot beneath:
Figure 2. Metrics comparision between original model(legacy) and our implementation without mixed precision(no AMP)
We are currently working on finding the cause of this issue.