Product Reviews for Ordinal Quantification

1. TU Dortmund University
2. Consiglio Nazionale delle Ricerche

This data set comprises a labeled training set, validation samples, and testing samples for ordinal quantification. The goal of quantification is not to predict the class label of each individual instance, but the distribution of labels in unlabeled sets of data.

The data is extracted from the McAuley data set of product reviews in Amazon, where the goal is to predict the 5-star rating of each textual review. We have sampled this data according to three protocols that are suited for quantification research.

The first protocol is the artificial prevalence protocol (APP), where all possible distributions of labels are drawn with an equal probability. The second protocol, APP-OQ(50%), is a variant thereof, where only the smoothest 50% of all APP samples are considered. This variant is targeted at ordinal quantification, where classes are ordered and a similarity of neighboring classes can be assumed. 5-star ratings of product reviews lie on an ordinal scale and, hence, pose such an ordinal quantification task. The third protocol considers "real" distributions of labels. These distributions stem from actual products in the original data set.

The data is represented by a RoBERTa embedding. In our experience, logistic regression classifiers work well with this representation.

You can extract our data sets yourself, for instance, if you require a raw textual representation. The original McAuley data set is public already and we provide all of our extraction scripts.

Extraction scripts and experiments: https://github.com/mirkobunse/regularized-oq

Original data by McAuley: https://jmcauley.ucsd.edu/data/amazon/

Files

amazon-oq-bk.zip

Files (37.0 GB)

Name	Size	Download all
amazon-oq-bk.zip md5:1f82ff97e92587dd2ca6b752d2db7136	37.0 GB	Preview Download

Additional details

SoBigData-PlusPlus – SoBigData++: European Integrated Infrastructure for Social Mining and Big Data Analytics 871042: European Commission
AI4Media – A European Excellence Centre for Media, Society and Democracy 951911: European Commission

M. Bunse, A. Moreo, F. Sebastiani, M. Senz (2022). Ordinal Quantification through Regularization.
J. McAuley, C. Targett, Q. Shi, A. van den Hengel (2015). Image-based recommendations on styles and substitutes.

257

Views

Downloads

Show more details

	All versions	This version
Views	257	23
Downloads	17	1
Data volume	652.4 GB	37.0 GB

More info on how stats are collected....

DOI

Resource type

Dataset

Publisher

Zenodo

Languages

English

Technical metadata

Created: July 24, 2023
Modified: October 4, 2023

Product Reviews for Ordinal Quantification

Files

amazon-oq-bk.zip

Files (37.0 GB)

Additional details

Funding

References

Product Reviews for Ordinal Quantification

Creators

Description

Files

amazon-oq-bk.zip

Files (37.0 GB)

Additional details

Funding

References