research-article

Open access

LLP-Bench: A Large Scale Tabular Benchmark for Learning from Label Proportions

Authors:

Anand Brahmbhatt,

Aravindan RaghuveerAuthors Info & Claims

CIKM '24: Proceedings of the 33rd ACM International Conference on Information and Knowledge Management

Pages 4374 - 4381

https://doi.org/10.1145/3627673.3680032

Published: 21 October 2024 Publication History

Abstract

With large neural models becoming increasingly accurate and powerful, they have raised privacy and transparency concerns on data usage. Therefore, data platforms, regulations and user expectations are rapidly evolving leading to enforcing privacy via aggregation. We focus on the use case of online advertising where the emergence of aggregate data is imminent and can significantly impact the multi-billion dollar industry. In aggregated datasets, labels are assigned to groups of data points rather than individual data points. This leads to a formulation of a weakly supervised task - Learning from Label Proportions where a model is trained on groups (a.k.a bags) of instances and their corresponding label proportions to predict labels for individual instances. While learning on aggregate data due to privacy concerns is becoming increasingly popular there is no large scale benchmark for measuring performance and guiding improvements on this important task. We propose LLP-Bench - a web scale benchmark with ~ 70 datasets and 45 million datapoints. To the best of our knowledge, LLP-Bench is the first large scale tabular LLP benchmark with an extensive diversity in constituent datasets, realistic in terms of the sponsored search datasets used and aggregation mechanisms followed. Through more than 3000 experiments we compare the performance of 9 SOTA methods in detail. To the best of our knowledge, this is the first study that compares diverse approaches in such depth.

References

[1]

[n. d.]. Apple storekit ad network. https://developer.apple.com/documentation/storekit/skadnetwork/.

[2]

[n. d.]. Private aggregation api of chrome privacy sandbox. https://developer.chrome.com/docs/ privacy-sandbox/aggregation-service/.

[3]

Ehsan Mohammady Ardehaly and Aron Culotta. 2017. Co-Training for Demographic Classification Using Deep Learning from Label Proportions. In ICDM. 1017--1024.

[4]

Gerda Bortsova, Florian Dubost, Silas Ørting, Ioannis Katramados, Laurens Hogeweg, Laura Thomsen, Mathilde Wille, and Marleen de Bruijne. 2018. Deep learning from label proportions for emphysema quantification. In Medical Image Computing and Computer Assisted Intervention--MICCAI 2018: 21st International Conference, Granada, Spain, September 16--20, 2018, Proceedings, Part II 11. Springer, 768--776.

Digital Library

[5]

Gerda Bortsova, Florian Dubost, Silas N. Ørting, Ioannis Katramados, Laurens Hogeweg, Laura H. Thomsen, Mathilde M. W. Wille, and Marleen de Bruijne. 2018. Deep Learning from Label Proportions for Emphysema Quantification. In MICCAI (Lecture Notes in Computer Science, Vol. 11071). Springer, 768--776. https://arxiv.org/abs/1807.08601

Digital Library

[6]

Anand Brahmbhatt, Mohith Pokala, Rishi Saket, and Aravindan Raghuveer. 2023. LLP-Bench: A Large Scale Tabular Benchmark for Learning from Label Proportions. arXiv:2310.10096 [cs.LG]

[7]

Róbert Istvan Busa-Fekete, Heejin Choi, Travis Dick, Claudio Gentile, and Andrés Munoz Medina. 2023. Easy Learning from Label Proportions. CoRR abs/2302.03115 (2023). https://doi.org/10.48550/arXiv.2302.03115

[8]

Lin Chen, Gang Fu, Amin Karbasi, and Vahab Mirrokni. 2023. Learning from Aggregated Data: Curated Bags versus Random Bags. CoRR abs/2305.09557 (2023). https://doi.org/10.48550/arXiv.2305.09557 arXiv:2305.09557

[9]

Lin Chen, Thomas Fu, Amin Karbasi, and Vahab Mirrokni. 2023. Learning from Aggregated Data: Curated Bags versus Random Bags. arXiv preprint arXiv:2305.09557 (2023).

[10]

Lei Chen, Zheng Huang, and Raghu Ramakrishnan. 2004. Cost-based labeling of groups of mass spectra. In Proceedings of the 2004 ACM SIGMOD international conference on Management of data. 167--178.

Digital Library

[11]

Shuo Chen, Bin Liu, Mingjie Qian, and Changshui Zhang. 2009. Kernel K-means Based Framework for Aggregate Outputs Classification. In ICDM, Yücel Saygin, Jeffrey Xu Yu, Hillol Kargupta, Wei Wang, Sanjay Ranka, Philip S. Yu, and Xindong Wu (Eds.). 356--361.

[12]

Criteo. 2014. Kaggle Display Advertising Challenge Dataset. http://labs.criteo.com/2014/02/kaggle-display-advertising-challenge-dataset/

[13]

Nando de Freitas and Hendrik Kück. 2005. Learning about Individuals from Group Statistics. In UAI. 332--339.

[14]

L. M. Dery, B. Nachman, F. Rubbo, and A. Schwartzman. 2017. Weakly supervised classification in high energy physics. Journal of High Energy Physics 2017, 5 (2017), 1--11.

[15]

Gabriel Dulac-Arnold, Neil Zeghidour, Marco Cuturi, Lucas Beyer, and Jean-Philippe Vert. 2019. Deep multi-class learning from label proportions. CoRR abs/1905.12909 (2019). http://arxiv.org/abs/1905.12909

[16]

Gabriel Dulac-Arnold, Neil Zeghidour, Marco Cuturi, Lucas Beyer, and Jean-Philippe Vert. 2019. Deep multi-class learning from label proportions. arXiv preprint arXiv:1905.12909 (2019).

[17]

Xinran He, Junfeng Pan, Ou Jin, Tianbing Xu, Bo Liu, Tao Xu, Yanxin Shi, Antoine Atallah, Ralf Herbrich, Stuart Bowers, and Joaquin Quiñonero Candela. 2014. Practical Lessons from Predicting Clicks on Ads at Facebook. In Proceedings of the Eighth International Workshop on Data Mining for Online Advertising (New York, NY, USA) (ADKDD'14). Association for Computing Machinery, New York, NY, USA, 1--9. https://doi.org/10.1145/2648584.2648589

Digital Library

[18]

Jerónimo Hernández-González, Iñaki Inza, and José Antonio Lozano. 2013. Learning Bayesian network classifiers from label proportions. Pattern Recognit. 46, 12 (2013), 3425--3440.

Digital Library

[19]

Jerónimo Hernández-González, Iñaki Inza, Lorena Crisol-Ortíz, María A. Guembe, María J Iñarra, and Jose A Lozano. 2018. Fitting the data from embryo implantation prediction: Learning from label proportions. Statistical methods in medical research 27, 4 (2018), 1056--1066.

[20]

Dimitrios Kotzias, Misha Denil, Nando de Freitas, and Padhraic Smyth. 2015. From Group to Individual Labels Using Deep Features. In SIGKDD. 597--606.

[21]

Jiabin Liu, Zhiquan Qi, Bo Wang, YingJie Tian, and Yong Shi. 2022. SELF-LLP: Self-supervised learning from label proportions with self-ensemble. Pattern Recognition 129 (2022), 108767.

Digital Library

[22]

Jiabin Liu, Bo Wang, Zhiquan Qi, Yingjie Tian, and Yong Shi. 2019. Learning from label proportions with generative adversarial networks. Advances in neural information processing systems 32 (2019).

[23]

Jiabin Liu, Bo Wang, Zhiquan Qi, Yingjie Tian, and Yong Shi. 2019. Learning from Label Proportions with Generative Adversarial Networks. In NeurIPS. 7167--7177.

[24]

Jiabin Liu, Bo Wang, Xin Shen, Zhiquan Qi, and Yingjie Tian. 2021. Two-stage Training for Learning from Label Proportions. In Proc. IJCAI, Zhi-Hua Zhou (Ed.). 2737--2743.

[25]

H. Brendan McMahan, Gary Holt, D. Sculley, Michael Young, Dietmar Ebner, Julian Grady, Lan Nie, Todd Phillips, Eugene Davydov, Daniel Golovin, Sharat Chikkerur, Dan Liu, Martin Wattenberg, Arnar Mar Hrafnkelsson, Tom Boulos, and Jeremy Kubica. 2013. Ad click prediction: a view from the trenches. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Chicago, Illinois, USA) (KDD '13). Association for Computing Machinery, New York, NY, USA, 1222--1230. https://doi.org/10.1145/2487575.2488200

Digital Library

[26]

David R. Musicant, Janara M. Christensen, and Jamie F. Olson. 2007. Supervised Learning by Training on Aggregate Outputs. In ICDM. IEEE Computer Society, 252--261.

[27]

Conor O'Brien, Arvind Thiagarajan, Sourav Das, Rafael Barreto, Chetan Verma, Tim Hsu, James Neufeld, and Jonathan J. Hunt. 2022. Challenges and approaches to privacy preserving post-click conversion prediction. CoRR abs/2201.12666 (2022). https://arxiv.org/abs/2201.12666

[28]

Silas Nyboe Ørting, Jens Petersen, Mathilde Wille, Laura Thomsen, and Marleen de Bruijne. 2016. Quantifying emphysema extent from weakly labeled CT scans of the lungs using label proportions learning. In The Sixth International Workshop on Pulmonary Image Analysis. 31--42.

[29]

Giorgio Patrini, Richard Nock, Tibério S. Caetano, and Paul Rivera. 2014. (Almost) No Label No Cry. In Advances in Neural Information Processing Systems, Zoubin Ghahramani, Max Welling, Corinna Cortes, Neil D. Lawrence, and Kilian Q. Weinberger (Eds.). 190--198.

[30]

Novi Quadrianto, Alexander J. Smola, Tibério S. Caetano, and Quoc V. Le. 2009. Estimating Labels from Label Proportions. J. Mach. Learn. Res. 10 (2009), 2349--2374.

Digital Library

[31]

Stefan Rüping. 2010. SVM Classifier Estimation from Group Probabilities. In ICML, Johannes Fürnkranz and Thorsten Joachims (Eds.). 911--918.

[32]

Rishi Saket. 2021. Learnability of Linear Thresholds from Label Proportions. In NeurIPS. 6555--6566.

[33]

Rishi Saket. 2022. Algorithms and Hardness for Learning Linear Thresholds from Label Proportions. In NeurIPS.

[34]

Rishi Saket, Aravindan Raghuveer, and Balaraman Ravindran. 2022. On Combining Bags to Better Learn from Label Proportions. In AISTATS (Proceedings of Machine Learning Research, Vol. 151). PMLR, 5913--5927. https://proceedings.mlr.press/v151/saket22a.html

[35]

Clayton Scott and Jianxin Zhang. 2020. Learning from Label Proportions: A Mutual Contamination Framework. In NeurIPS.

[36]

Marco Stolpe and Katharina Morik. 2011. Learning from Label Proportions by Optimizing Cluster Model Selection. In ECML PKDD Proceedings, Part III, Dimitrios Gunopulos, Thomas Hofmann, Donato Malerba, and Michalis Vazirgiannis (Eds.), Vol. 6913. Springer, 349--364.

[37]

Marcelo Tallis and Pranjul Yadav. 2018. Reacting to variations in product demand: An application for conversion rate (CR) prediction in sponsored search. In 2018 IEEE International Conference on Big Data (Big Data). IEEE, 1856--1864.

[38]

Kuen-Han Tsai and Hsuan-Tien Lin. 2020. Learning from label proportions with consistency regularization. In Asian Conference on Machine Learning. PMLR, 513--528.

[39]

J. Wojtusiak, K. Irvin, A. Birerdinc, and A. V. Baranova. 2011. Using published medical results and non-homogenous data in rule learning. In Proc. International Conference on Machine Learning and Applications and Workshops, Vol. 2. IEEE, 84--89.

[40]

Felix X. Yu, Dong Liu, Sanjiv Kumar, Tony Jebara, and Shih-Fu Chang. 2013. aSVM for Learning with Label Proportions. In ICML (JMLR Workshop and Conference Proceedings, Vol. 28). 504--512.

[41]

Jianxin Zhang, Yutong Wang, and Clay Scott. 2022. Learning from label proportions by learning with label noise. Advances in Neural Information Processing Systems 35 (2022), 26933--26942.

Index Terms

LLP-Bench: A Large Scale Tabular Benchmark for Learning from Label Proportions
1. Computing methodologies
  1. Machine learning
2. Security and privacy

Recommendations

Active learning from label proportions via pSVM
Abstract
Learning from label proportions (LLP), in which the training data is divided into different bags and only the proportions of samples belonging to certain categories in each bag are known, has attracted widespread interest in many ...
LLP-AAE: Learning from label proportions with adversarial autoencoder
Abstract
This paper presents an effective weakly supervised learning algorithm LLP-AAE to leverage the adversarial autoencoder (AAE) for learning from label proportions (LLP), in which only the bag-level proportional information is available. ...
Domain-Agnostic Contrastive Representations for Learning from Label Proportions
CIKM '22: Proceedings of the 31st ACM International Conference on Information & Knowledge Management

We study the weak supervision learning problem of Learning from Label Proportions (LLP) where the goal is to learn an instance-level classifier using proportions of various class labels in a bag -- a collection of input instances that often can be highly ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

CIKM '24: Proceedings of the 33rd ACM International Conference on Information and Knowledge Management

October 2024

5705 pages

ISBN:9798400704369

DOI:10.1145/3627673

General Chairs:
Edoardo Serra
Boise State University, USA
,
Francesca Spezzano
Boise State University, USA

Copyright © 2024 Owner/Author.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

SIGIR: ACM Special Interest Group on Information Retrieval

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 October 2024

Check for updates

Author Tags

Qualifiers

Research-article

Conference

CIKM '24

Sponsor:

SIGIR

CIKM '24: The 33rd ACM International Conference on Information and Knowledge Management

October 21 - 25, 2024

ID, Boise, USA

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Sponsor:
sigir
sigir

The 34th ACM International Conference on Information and Knowledge Management

November 10 - 14, 2025

Seoul , Republic of Korea

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
96
Total Downloads

Downloads (Last 12 months)96
Downloads (Last 6 weeks)29

Reflects downloads up to 25 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten