Amazon Product Image

Posted on by admin

Julian McAuley, UCSD

Follow the technical Amazon image requirements. When setting up new product pages or updating. How To Download Image From Amazon Amazon Image Downloader-Subscribe: https://www.youtube.co.

Description

This dataset contains product reviews and metadata from Amazon, including 143.7 million reviews spanning May 1996 - July 2014.

This dataset includes reviews (ratings, text, helpfulness votes), product metadata (descriptions, category information, price, brand, and image features), and links (also viewed/also bought graphs).

Files

Complete review data

Please see the per-category files below, and only download these (large!) files if you absolutely need them:

raw review data (20gb) - all 143.7 million reviews

The above file contains some duplicate reviews, mainly due to near-identical products whose reviews Amazon merges, e.g. VHS and DVD versions of the same movie. These duplicates have been removed in the two files below:

user review data (18gb) - duplicate items removed (83.31 million reviews), sorted by user

product review data (19gb) - duplicate items removed, sorted by product

Finally, the following file removes duplicates more aggressively, removing duplicates even if they are written by different users. This accounts for users with multiple accounts or plagiarized reviews. Such duplicates account for less than 1 percent of reviews, though this dataset is probably preferable for sentiment analysis type tasks.

aggressively deduplicated data (18gb) - no duplicates whatsoever (83.08 million reviews)

Format is one-review-per-line in (loose) json. See files below for further help reading the data.

Sample review:

Amazon Image Guidelines

{ 'reviewerID': 'A2SUAM1J3GNN3B', 'asin': '0000013714', 'reviewerName': 'J. McDonald', 'helpful': [2, 3], 'reviewText': 'I bought this for my husband who plays the piano. He is having a wonderful time playing these old hymns. The music is at times hard to read because we think the book was published for singing from more than playing from. Great purchase though!', 'overall': 5.0, 'summary': 'Heavenly Highway Hymns', 'unixReviewTime': 1252800000, 'reviewTime': '09 13, 2009'}

where

  • reviewerID - ID of the reviewer, e.g. A1RSDE90N6RSZF
  • asin - ID of the product, e.g. 0000013714
  • reviewerName - name of the reviewer
  • helpful - helpfulness rating of the review, e.g. 2/3
  • reviewText - text of the review
  • overall - rating of the product
  • summary - summary of the review
  • unixReviewTime - time of the review (unix time)
  • reviewTime - time of the review (raw)

Metadata

Metadata includes descriptions, price, sales-rank, brand info, and co-purchasing links:

metadata (1.9gb) - metadata for 9.4 million products

Sample metadata:

{ 'asin': '0000031852', 'title': 'Girls Ballet Tutu Zebra Hot Pink', 'price': 3.17, 'imUrl': 'http://ecx.images-amazon.com/images/I/51fAmVkTbyL._SY300_.jpg', 'related': { 'also_bought': ['B00JHONN1S', 'B002BZX8Z6', 'B00D2K1M3O', '0000031909', 'B00613WDTQ', 'B00D0WDS9A', 'B00D0GCI8S', '0000031895', 'B003AVKOP2', 'B003AVEU6G', 'B003IEDM9Q', 'B002R0FA24', 'B00D23MC6W', 'B00D2K0PA0', 'B00538F5OK', 'B00CEV86I6', 'B002R0FABA', 'B00D10CLVW', 'B003AVNY6I', 'B002GZGI4E', 'B001T9NUFS', 'B002R0F7FE', 'B00E1YRI4C', 'B008UBQZKU', 'B00D103F8U', 'B007R2RM8W'], 'also_viewed': ['B002BZX8Z6', 'B00JHONN1S', 'B008F0SU0Y', 'B00D23MC6W', 'B00AFDOPDA', 'B00E1YRI4C', 'B002GZGI4E', 'B003AVKOP2', 'B00D9C1WBM', 'B00CEV8366', 'B00CEUX0D8', 'B0079ME3KU', 'B00CEUWY8K', 'B004FOEEHC', '0000031895', 'B00BC4GY9Y', 'B003XRKA7A', 'B00K18LKX2', 'B00EM7KAG6', 'B00AMQ17JA', 'B00D9C32NI', 'B002C3Y6WG', 'B00JLL4L5Y', 'B003AVNY6I', 'B008UBQZKU', 'B00D0WDS9A', 'B00613WDTQ', 'B00538F5OK', 'B005C4Y4F6', 'B004LHZ1NY', 'B00CPHX76U', 'B00CEUWUZC', 'B00IJVASUE', 'B00GOR07RE', 'B00J2GTM0W', 'B00JHNSNSM', 'B003IEDM9Q', 'B00CYBU84G', 'B008VV8NSQ', 'B00CYBULSO', 'B00I2UHSZA', 'B005F50FXC', 'B007LCQI3S', 'B00DP68AVW', 'B009RXWNSI', 'B003AVEU6G', 'B00HSOJB9M', 'B00EHAGZNA', 'B0046W9T8C', 'B00E79VW6Q', 'B00D10CLVW', 'B00B0AVO54', 'B00E95LC8Q', 'B00GOR92SO', 'B007ZN5Y56', 'B00AL2569W', 'B00B608000', 'B008F0SMUC', 'B00BFXLZ8M'], 'bought_together': ['B002BZX8Z6'] }, 'salesRank': {'Toys & Games': 211836}, 'brand': 'Coxlures', 'categories': [['Sports & Outdoors', 'Other Sports', 'Dance']]}

where

  • asin - ID of the product, e.g. 0000031852
  • title - name of the product
  • price - price in US dollars (at time of crawl)
  • imUrl - url of the product image
  • related - related products (also bought, also viewed, bought together, buy after viewing)
  • salesRank - sales rank information
  • brand - brand name
  • categories - list of categories the product belongs to

Amazon Product Image Downloader

Visual Features

We extracted visual features from each product image using a deep CNN (see citation below). Image features are stored in a binary format, which consists of 10 characters (the product ID), followed by 4096 floats (repeated for every product). See files below for further help reading the data.

visual features (141gb) - visual features for all products

Per-category files

Below are files for individual product categories, which have already had duplicate item reviews removed.

Booksreviewsmetadataimage features
Electronicsreviewsmetadataimage features
Movies and TVreviewsmetadataimage features
CDs and Vinylreviewsmetadataimage features
Clothing, Shoes and Jewelryreviewsmetadataimage features
Home and Kitchenreviewsmetadataimage features
Kindle Storereviewsmetadataimage features
Sports and Outdoorsreviewsmetadataimage features
Cell Phones and Accessoriesreviewsmetadataimage features
Health and Personal Carereviewsmetadataimage features
Toys and Gamesreviewsmetadataimage features
Video Gamesreviewsmetadataimage features
Tools and Home Improvementreviewsmetadataimage features
Beautyreviewsmetadataimage features
Apps for Androidreviewsmetadataimage features
Office Productsreviewsmetadataimage features
Pet Suppliesreviewsmetadataimage features
Automotivereviewsmetadataimage features
Grocery and Gourmet Foodreviewsmetadataimage features
Patio, Lawn and Gardenreviewsmetadataimage features
Babyreviewsmetadataimage features
Digital Musicreviewsmetadataimage features
Musical Instrumentsreviewsmetadataimage features
Amazon Instant Videoreviewsmetadataimage features

Citation

Please cite the following if you use the data in any way:

Image-based recommendations on styles and substitutes
J. McAuley, C. Targett, J. Shi, A. van den Hengel
SIGIR, 2015
draft

Code

Reading the data

Data can be treated as python dictionary objects. A simple script to read any of the above the data is as follows:

def parse(path): g = gzip.open(path, 'r') for l in g: yield eval(l)

Convert to 'strict' json

Amazon product image downloader

The above data can be read with python 'eval', but is not strict json. If you'd like to use some language other than python, you can convert the data to strict json as follows:

import jsondef parse(path): g = gzip.open(path, 'r') for l in g: yield json.dumps(eval(l))f = open('output.strict')for l in parse('reviews_Video_Games.json.gz'): f.write(l + 'n')

Read image features

import structdef readImageFeatures(path): f = open(path, 'rb') while True: asin = f.read(10) if asin ': break feature = [] for i in range(4096): feature.append(struct.unpack('f', f.read(4))) yield asin, feature

Example: compute average rating

ratings = []for review in parse('reviews_Video_Games.json.gz'): ratings.append(review['overall'])print sum(ratings) / len(ratings)