Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3627673.3680032acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article
Open access

LLP-Bench: A Large Scale Tabular Benchmark for Learning from Label Proportions

Published: 21 October 2024 Publication History

Abstract

With large neural models becoming increasingly accurate and powerful, they have raised privacy and transparency concerns on data usage. Therefore, data platforms, regulations and user expectations are rapidly evolving leading to enforcing privacy via aggregation. We focus on the use case of online advertising where the emergence of aggregate data is imminent and can significantly impact the multi-billion dollar industry. In aggregated datasets, labels are assigned to groups of data points rather than individual data points. This leads to a formulation of a weakly supervised task - Learning from Label Proportions where a model is trained on groups (a.k.a bags) of instances and their corresponding label proportions to predict labels for individual instances. While learning on aggregate data due to privacy concerns is becoming increasingly popular there is no large scale benchmark for measuring performance and guiding improvements on this important task. We propose LLP-Bench - a web scale benchmark with ~ 70 datasets and 45 million datapoints. To the best of our knowledge, LLP-Bench is the first large scale tabular LLP benchmark with an extensive diversity in constituent datasets, realistic in terms of the sponsored search datasets used and aggregation mechanisms followed. Through more than 3000 experiments we compare the performance of 9 SOTA methods in detail. To the best of our knowledge, this is the first study that compares diverse approaches in such depth.

References

[1]
[n. d.]. Apple storekit ad network. https://developer.apple.com/documentation/storekit/skadnetwork/.
[2]
[n. d.]. Private aggregation api of chrome privacy sandbox. https://developer.chrome.com/docs/ privacy-sandbox/aggregation-service/.
[3]
Ehsan Mohammady Ardehaly and Aron Culotta. 2017. Co-Training for Demographic Classification Using Deep Learning from Label Proportions. In ICDM. 1017--1024.
[4]
Gerda Bortsova, Florian Dubost, Silas Ørting, Ioannis Katramados, Laurens Hogeweg, Laura Thomsen, Mathilde Wille, and Marleen de Bruijne. 2018. Deep learning from label proportions for emphysema quantification. In Medical Image Computing and Computer Assisted Intervention--MICCAI 2018: 21st International Conference, Granada, Spain, September 16--20, 2018, Proceedings, Part II 11. Springer, 768--776.
[5]
Gerda Bortsova, Florian Dubost, Silas N. Ørting, Ioannis Katramados, Laurens Hogeweg, Laura H. Thomsen, Mathilde M. W. Wille, and Marleen de Bruijne. 2018. Deep Learning from Label Proportions for Emphysema Quantification. In MICCAI (Lecture Notes in Computer Science, Vol. 11071). Springer, 768--776. https://arxiv.org/abs/1807.08601
[6]
Anand Brahmbhatt, Mohith Pokala, Rishi Saket, and Aravindan Raghuveer. 2023. LLP-Bench: A Large Scale Tabular Benchmark for Learning from Label Proportions. arXiv:2310.10096 [cs.LG]
[7]
Róbert Istvan Busa-Fekete, Heejin Choi, Travis Dick, Claudio Gentile, and Andrés Munoz Medina. 2023. Easy Learning from Label Proportions. CoRR abs/2302.03115 (2023). https://doi.org/10.48550/arXiv.2302.03115
[8]
Lin Chen, Gang Fu, Amin Karbasi, and Vahab Mirrokni. 2023. Learning from Aggregated Data: Curated Bags versus Random Bags. CoRR abs/2305.09557 (2023). https://doi.org/10.48550/arXiv.2305.09557 arXiv:2305.09557
[9]
Lin Chen, Thomas Fu, Amin Karbasi, and Vahab Mirrokni. 2023. Learning from Aggregated Data: Curated Bags versus Random Bags. arXiv preprint arXiv:2305.09557 (2023).
[10]
Lei Chen, Zheng Huang, and Raghu Ramakrishnan. 2004. Cost-based labeling of groups of mass spectra. In Proceedings of the 2004 ACM SIGMOD international conference on Management of data. 167--178.
[11]
Shuo Chen, Bin Liu, Mingjie Qian, and Changshui Zhang. 2009. Kernel K-means Based Framework for Aggregate Outputs Classification. In ICDM, Yücel Saygin, Jeffrey Xu Yu, Hillol Kargupta, Wei Wang, Sanjay Ranka, Philip S. Yu, and Xindong Wu (Eds.). 356--361.
[12]
Criteo. 2014. Kaggle Display Advertising Challenge Dataset. http://labs.criteo.com/2014/02/kaggle-display-advertising-challenge-dataset/
[13]
Nando de Freitas and Hendrik Kück. 2005. Learning about Individuals from Group Statistics. In UAI. 332--339.
[14]
L. M. Dery, B. Nachman, F. Rubbo, and A. Schwartzman. 2017. Weakly supervised classification in high energy physics. Journal of High Energy Physics 2017, 5 (2017), 1--11.
[15]
Gabriel Dulac-Arnold, Neil Zeghidour, Marco Cuturi, Lucas Beyer, and Jean-Philippe Vert. 2019. Deep multi-class learning from label proportions. CoRR abs/1905.12909 (2019). http://arxiv.org/abs/1905.12909
[16]
Gabriel Dulac-Arnold, Neil Zeghidour, Marco Cuturi, Lucas Beyer, and Jean-Philippe Vert. 2019. Deep multi-class learning from label proportions. arXiv preprint arXiv:1905.12909 (2019).
[17]
Xinran He, Junfeng Pan, Ou Jin, Tianbing Xu, Bo Liu, Tao Xu, Yanxin Shi, Antoine Atallah, Ralf Herbrich, Stuart Bowers, and Joaquin Quiñonero Candela. 2014. Practical Lessons from Predicting Clicks on Ads at Facebook. In Proceedings of the Eighth International Workshop on Data Mining for Online Advertising (New York, NY, USA) (ADKDD'14). Association for Computing Machinery, New York, NY, USA, 1--9. https://doi.org/10.1145/2648584.2648589
[18]
Jerónimo Hernández-González, Iñaki Inza, and José Antonio Lozano. 2013. Learning Bayesian network classifiers from label proportions. Pattern Recognit. 46, 12 (2013), 3425--3440.
[19]
Jerónimo Hernández-González, Iñaki Inza, Lorena Crisol-Ortíz, María A. Guembe, María J Iñarra, and Jose A Lozano. 2018. Fitting the data from embryo implantation prediction: Learning from label proportions. Statistical methods in medical research 27, 4 (2018), 1056--1066.
[20]
Dimitrios Kotzias, Misha Denil, Nando de Freitas, and Padhraic Smyth. 2015. From Group to Individual Labels Using Deep Features. In SIGKDD. 597--606.
[21]
Jiabin Liu, Zhiquan Qi, Bo Wang, YingJie Tian, and Yong Shi. 2022. SELF-LLP: Self-supervised learning from label proportions with self-ensemble. Pattern Recognition 129 (2022), 108767.
[22]
Jiabin Liu, Bo Wang, Zhiquan Qi, Yingjie Tian, and Yong Shi. 2019. Learning from label proportions with generative adversarial networks. Advances in neural information processing systems 32 (2019).
[23]
Jiabin Liu, Bo Wang, Zhiquan Qi, Yingjie Tian, and Yong Shi. 2019. Learning from Label Proportions with Generative Adversarial Networks. In NeurIPS. 7167--7177.
[24]
Jiabin Liu, Bo Wang, Xin Shen, Zhiquan Qi, and Yingjie Tian. 2021. Two-stage Training for Learning from Label Proportions. In Proc. IJCAI, Zhi-Hua Zhou (Ed.). 2737--2743.
[25]
H. Brendan McMahan, Gary Holt, D. Sculley, Michael Young, Dietmar Ebner, Julian Grady, Lan Nie, Todd Phillips, Eugene Davydov, Daniel Golovin, Sharat Chikkerur, Dan Liu, Martin Wattenberg, Arnar Mar Hrafnkelsson, Tom Boulos, and Jeremy Kubica. 2013. Ad click prediction: a view from the trenches. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Chicago, Illinois, USA) (KDD '13). Association for Computing Machinery, New York, NY, USA, 1222--1230. https://doi.org/10.1145/2487575.2488200
[26]
David R. Musicant, Janara M. Christensen, and Jamie F. Olson. 2007. Supervised Learning by Training on Aggregate Outputs. In ICDM. IEEE Computer Society, 252--261.
[27]
Conor O'Brien, Arvind Thiagarajan, Sourav Das, Rafael Barreto, Chetan Verma, Tim Hsu, James Neufeld, and Jonathan J. Hunt. 2022. Challenges and approaches to privacy preserving post-click conversion prediction. CoRR abs/2201.12666 (2022). https://arxiv.org/abs/2201.12666
[28]
Silas Nyboe Ørting, Jens Petersen, Mathilde Wille, Laura Thomsen, and Marleen de Bruijne. 2016. Quantifying emphysema extent from weakly labeled CT scans of the lungs using label proportions learning. In The Sixth International Workshop on Pulmonary Image Analysis. 31--42.
[29]
Giorgio Patrini, Richard Nock, Tibério S. Caetano, and Paul Rivera. 2014. (Almost) No Label No Cry. In Advances in Neural Information Processing Systems, Zoubin Ghahramani, Max Welling, Corinna Cortes, Neil D. Lawrence, and Kilian Q. Weinberger (Eds.). 190--198.
[30]
Novi Quadrianto, Alexander J. Smola, Tibério S. Caetano, and Quoc V. Le. 2009. Estimating Labels from Label Proportions. J. Mach. Learn. Res. 10 (2009), 2349--2374.
[31]
Stefan Rüping. 2010. SVM Classifier Estimation from Group Probabilities. In ICML, Johannes Fürnkranz and Thorsten Joachims (Eds.). 911--918.
[32]
Rishi Saket. 2021. Learnability of Linear Thresholds from Label Proportions. In NeurIPS. 6555--6566.
[33]
Rishi Saket. 2022. Algorithms and Hardness for Learning Linear Thresholds from Label Proportions. In NeurIPS.
[34]
Rishi Saket, Aravindan Raghuveer, and Balaraman Ravindran. 2022. On Combining Bags to Better Learn from Label Proportions. In AISTATS (Proceedings of Machine Learning Research, Vol. 151). PMLR, 5913--5927. https://proceedings.mlr.press/v151/saket22a.html
[35]
Clayton Scott and Jianxin Zhang. 2020. Learning from Label Proportions: A Mutual Contamination Framework. In NeurIPS.
[36]
Marco Stolpe and Katharina Morik. 2011. Learning from Label Proportions by Optimizing Cluster Model Selection. In ECML PKDD Proceedings, Part III, Dimitrios Gunopulos, Thomas Hofmann, Donato Malerba, and Michalis Vazirgiannis (Eds.), Vol. 6913. Springer, 349--364.
[37]
Marcelo Tallis and Pranjul Yadav. 2018. Reacting to variations in product demand: An application for conversion rate (CR) prediction in sponsored search. In 2018 IEEE International Conference on Big Data (Big Data). IEEE, 1856--1864.
[38]
Kuen-Han Tsai and Hsuan-Tien Lin. 2020. Learning from label proportions with consistency regularization. In Asian Conference on Machine Learning. PMLR, 513--528.
[39]
J. Wojtusiak, K. Irvin, A. Birerdinc, and A. V. Baranova. 2011. Using published medical results and non-homogenous data in rule learning. In Proc. International Conference on Machine Learning and Applications and Workshops, Vol. 2. IEEE, 84--89.
[40]
Felix X. Yu, Dong Liu, Sanjiv Kumar, Tony Jebara, and Shih-Fu Chang. 2013. aSVM for Learning with Label Proportions. In ICML (JMLR Workshop and Conference Proceedings, Vol. 28). 504--512.
[41]
Jianxin Zhang, Yutong Wang, and Clay Scott. 2022. Learning from label proportions by learning with label noise. Advances in Neural Information Processing Systems 35 (2022), 26933--26942.

Index Terms

  1. LLP-Bench: A Large Scale Tabular Benchmark for Learning from Label Proportions

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      CIKM '24: Proceedings of the 33rd ACM International Conference on Information and Knowledge Management
      October 2024
      5705 pages
      ISBN:9798400704369
      DOI:10.1145/3627673
      Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 21 October 2024

      Check for updates

      Author Tags

      1. benchmark
      2. classification
      3. learning from label proportions
      4. regression

      Qualifiers

      • Research-article

      Conference

      CIKM '24
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

      Upcoming Conference

      CIKM '25

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 96
        Total Downloads
      • Downloads (Last 12 months)96
      • Downloads (Last 6 weeks)29
      Reflects downloads up to 25 Feb 2025

      Other Metrics

      Citations

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Login options

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media