Amazon product data
Julian McAuley, UCSD
Note: This server has been retired! You will be redirected in 5 seconds.
New!: See our updated (2018) version of the Amazon data here
New!: Repository of Recommender Systems Datasets
See a variety of other datasets for recommender systems research on our lab's dataset webpage
Description
This dataset contains product reviews and metadata from Amazon, including 142.8 million reviews spanning May 1996 - July 2014.
This dataset includes reviews (ratings, text, helpfulness votes), product metadata (descriptions, category information, price, brand, and image features), and links (also viewed/also bought graphs).
Files
"Small" subsets for experimentation
If you're using this data for a class project (or similar) please consider using one of these smaller datasets below before requesting the larger files. To obtain the larger files you will need to contact me to obtain access.
K-cores (i.e., dense subsets): These data have been reduced to extract the k-core, such that each of the remaining users and items have k reviews each.
Ratings only: These datasets include no metadata or reviews, but only (user,item,rating,timestamp) tuples. Thus they are suitable for use with mymedialite (or similar) packages.
Books | 5-core (8,898,041 reviews) | ratings only (22,507,155 ratings) |
Electronics | 5-core (1,689,188 reviews) | ratings only (7,824,482 ratings) |
Movies and TV | 5-core (1,697,533 reviews) | ratings only (4,607,047 ratings) |
CDs and Vinyl | 5-core (1,097,592 reviews) | ratings only (3,749,004 ratings) |
Clothing, Shoes and Jewelry | 5-core (278,677 reviews) | ratings only (5,748,920 ratings) |
Home and Kitchen | 5-core (551,682 reviews) | ratings only (4,253,926 ratings) |
Kindle Store | 5-core (982,619 reviews) | ratings only (3,205,467 ratings) |
Sports and Outdoors | 5-core (296,337 reviews) | ratings only (3,268,695 ratings) |
Cell Phones and Accessories | 5-core (194,439 reviews) | ratings only (3,447,249 ratings) |
Health and Personal Care | 5-core (346,355 reviews) | ratings only (2,982,326 ratings) |
Toys and Games | 5-core (167,597 reviews) | ratings only (2,252,771 ratings) |
Video Games | 5-core (231,780 reviews) | ratings only (1,324,753 ratings) |
Tools and Home Improvement | 5-core (134,476 reviews) | ratings only (1,926,047 ratings) |
Beauty | 5-core (198,502 reviews) | ratings only (2,023,070 ratings) |
Apps for Android | 5-core (752,937 reviews) | ratings only (2,638,172 ratings) |
Office Products | 5-core (53,258 reviews) | ratings only (1,243,186 ratings) |
Pet Supplies | 5-core (157,836 reviews) | ratings only (1,235,316 ratings) |
Automotive | 5-core (20,473 reviews) | ratings only (1,373,768 ratings) |
Grocery and Gourmet Food | 5-core (151,254 reviews) | ratings only (1,297,156 ratings) |
Patio, Lawn and Garden | 5-core (13,272 reviews) | ratings only (993,490 ratings) |
Baby | 5-core (160,792 reviews) | ratings only (915,446 ratings) |
Digital Music | 5-core (64,706 reviews) | ratings only (836,006 ratings) |
Musical Instruments | 5-core (10,261 reviews) | ratings only (500,176 ratings) |
Amazon Instant Video | 5-core (37,126 reviews) | ratings only (583,933 ratings) |
Complete review data
Please see the per-category files below, and only download these (large!) files if you really need them:
raw review data (20gb) - all 142.8 million reviews
The above file contains some duplicate reviews, mainly due to near-identical products whose reviews Amazon merges, e.g. VHS and DVD versions of the same movie. These duplicates have been removed in the files below:
user review data (18gb) - duplicate items removed (83.68 million reviews), sorted by user
product review data (18gb) - duplicate items removed, sorted by product
ratings only (3.2gb) - same as above, in csv form without reviews or metadata
5-core (9.9gb) - subset of the data in which all users and items have at least 5 reviews (41.13 million reviews)
Finally, the following file removes duplicates more aggressively, removing duplicates even if they are written by different users. This accounts for users with multiple accounts or plagiarized reviews. Such duplicates account for less than 1 percent of reviews, though this dataset is probably preferable for sentiment analysis type tasks:
aggressively deduplicated data (18gb) - no duplicates whatsoever (82.83 million reviews)
Format is one-review-per-line in (loose) json. See examples below for further help reading the data.
Sample review:
where
- reviewerID - ID of the reviewer, e.g. A2SUAM1J3GNN3B
- asin - ID of the product, e.g. 0000013714
- reviewerName - name of the reviewer
- helpful - helpfulness rating of the review, e.g. 2/3
- reviewText - text of the review
- overall - rating of the product
- summary - summary of the review
- unixReviewTime - time of the review (unix time)
- reviewTime - time of the review (raw)
Metadata
Metadata includes descriptions, price, sales-rank, brand info, and co-purchasing links:
metadata (3.1gb) - metadata for 9.4 million products
Sample metadata:
where
- asin - ID of the product, e.g. 0000031852
- title - name of the product
- price - price in US dollars (at time of crawl)
- imUrl - url of the product image
- related - related products (also bought, also viewed, bought together, buy after viewing)
- salesRank - sales rank information
- brand - brand name
- categories - list of categories the product belongs to
Visual Features
We extracted visual features from each product image using a deep CNN (see citation below). Image features are stored in a binary format, which consists of 10 characters (the product ID), followed by 4096 floats (repeated for every product). See files below for further help reading the data.
visual features (141gb) - visual features for all products
The images themselves can be extracted from the imUrl field in the metadata files.
Per-category files
Below are files for individual product categories, which have already had duplicate item reviews removed.
Books | reviews (22,507,155 reviews) | metadata (2,370,585 products) | image features |
Electronics | reviews (7,824,482 reviews) | metadata (498,196 products) | image features |
Movies and TV | reviews (4,607,047 reviews) | metadata (208,321 products) | image features |
CDs and Vinyl | reviews (3,749,004 reviews) | metadata (492,799 products) | image features |
Clothing, Shoes and Jewelry | reviews (5,748,920 reviews) | metadata (1,503,384 products) | image features |
Home and Kitchen | reviews (4,253,926 reviews) | metadata (436,988 products) | image features |
Kindle Store | reviews (3,205,467 reviews) | metadata (434,702 products) | image features |
Sports and Outdoors | reviews (3,268,695 reviews) | metadata (532,197 products) | image features |
Cell Phones and Accessories | reviews (3,447,249 reviews) | metadata (346,793 products) | image features |
Health and Personal Care | reviews (2,982,326 reviews) | metadata (263,032 products) | image features |
Toys and Games | reviews (2,252,771 reviews) | metadata (336,072 products) | image features |
Video Games | reviews (1,324,753 reviews) | metadata (50,953 products) | image features |
Tools and Home Improvement | reviews (1,926,047 reviews) | metadata (269,120 products) | image features |
Beauty | reviews (2,023,070 reviews) | metadata (259,204 products) | image features |
Apps for Android | reviews (2,638,173 reviews) | metadata (61,551 products) | image features |
Office Products | reviews (1,243,186 reviews) | metadata (134,838 products) | image features |
Pet Supplies | reviews (1,235,316 reviews) | metadata (110,707 products) | image features |
Automotive | reviews (1,373,768 reviews) | metadata (331,090 products) | image features |
Grocery and Gourmet Food | reviews (1,297,156 reviews) | metadata (171,760 products) | image features |
Patio, Lawn and Garden | reviews (993,490 reviews) | metadata (109,094 products) | image features |
Baby | reviews (915,446 reviews) | metadata (71,317 products) | image features |
Digital Music | reviews (836,006 reviews) | metadata (279,899 products) | image features |
Musical Instruments | reviews (500,176 reviews) | metadata (84,901 products) | image features |
Amazon Instant Video | reviews (583,933 reviews) | metadata (30,648 products) | image features |
Citation
Please cite one or both of the following if you use the data in any way:
Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering
R. He, J. McAuley
WWW, 2016
pdf
Image-based recommendations on styles and substitutes
J. McAuley, C. Targett, J. Shi, A. van den Hengel
SIGIR, 2015
pdf
Code
Reading the data
Data can be treated as python dictionary objects. A simple script to read any of the above the data is as follows:
Convert to 'strict' json
The above data can be read with python 'eval', but is not strict json. If you'd like to use some language other than python, you can convert the data to strict json as follows:
Pandas data frame
This code reads the data into a pandas data frame:
Read image features
Example: compute average rating
Example: latent-factor model in mymedialite
Predicts ratings from a rating-only CSV file