abstract

PECOS: Prediction for Enormous and Correlated Output Spaces

Authors:

Wei Li,

Cho-Jui HsiehAuthors Info & Claims

KDD '22: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

Pages 4848 - 4849

https://doi.org/10.1145/3534678.3542629

Published: 14 August 2022 Publication History

Get Access

Abstract

Different from traditional machine learning tasks and benchmarks, real-world problems are usually accompanied by enormous output spaces, from hundred thousands of diseases in medical diagnosis, to millions of items and billions of websites in product and web search engines. Unfortunately, conventional machine learning tools and libraries are incapable of efficiently and accurately tackling large-scale output spaces. To address this issue, PECOS (Prediction for Enormous and Correlated Output Spaces) [11] is a state-of-the-art and open-sourced machine learning library1, which not only provides high-level and user-friendly interfaces of both linear and deep learning models, but also supplies considerable flexibility for solving diverse machine learning problems. Specifically, PECOS eases complicated semantic indexing for organizing enormous output spaces, thereby efficiently training models and deriving predictions by magnitude orders on correlated output labels. As a powerful and useful framework, PECOS has already been adopted in various real- world large-scale products like semantic search in Amazon [1], as well as achieved state-of-the-art on public extreme multi-label classification (XMC) benchmarks [2, 11, 12 ] and various downstream applications [3, 7, 9].

In this tutorial, we will introduce several key functions and features of the PECOS library. By way of real-world examples, the attendees will learn how to efficiently train large-scale machine learning models for enormous output spaces, and obtain predictions in less than 1 millisecond for a data input with million labels, in the context of product recommendation and natural language processing. We will also show the flexibility of dealing with diverse machine learning problems and data formats with assorted built-in utilities in PECOS. By the end of the tutorial, we believe that attendees will be easily capable of adopting certain concepts to their own projects and address different machine learning problems with enormous output spaces

References

[1]

Wei-Cheng Chang, Daniel Jiang, Hsiang-Fu Yu, Choon-Hui Teo, Jiong Zhang, Kai Zhong, Kedarnath Kolluri, Qie Hu, Nikhil Shandilya, Vyacheslav Ievgrafov, Japinder Singh, and Inderjit S Dhillon. Extreme multi-label learning for semantic matching in product search. In Proceedings of the 27th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2021.

Digital Library

Google Scholar

[2]

Wei-Cheng Chang, Hsiang-Fu Yu, Kai Zhong, Yiming Yang, and Inderjit S Dhillon. Taming pretrained transformers for extreme multi-label text classification. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 3163--3171, 2020.

Digital Library

Google Scholar

[3]

Eli Chien, Wei-Cheng Chang, Cho-Jui Hsieh, Hsiang-Fu Yu, Jiong Zhang, Olgica Milenkovic, and Inderjit S Dhillon. Node feature extraction by self-supervised multi-scale neighborhood prediction. In International Conference on Learning Representations (ICLR), 2022.

Google Scholar

[4]

Ruiqi Guo, Philip Sun, Erik Lindgren, Quan Geng, David Simcha, Felix Chern, and Sanjiv Kumar. Accelerating large-scale inference with anisotropic vector quantization. In International Conference on Machine Learning, 2020.

Google Scholar

[5]

Jeff Johnson, Matthijs Douze, and Hervé Jégou. Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, 7(3):535--547, 2019.

Crossref

Google Scholar

[6]

Y. A. Malkov and D. A. Yashunin. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(4):824--836, 2020.

Digital Library

Google Scholar

[7]

Rajat Sen, Alexander Rakhlin, Lexing Ying, Rahul Kidambi, Dean Foster, Daniel N Hill, and Inderjit S Dhillon. Top-k extreme contextual bandits with arm hierarchy. In International Conference on Machine Learning, pages 9422--9433. PMLR, 2021.

Google Scholar

[8]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, 2017.

Digital Library

Google Scholar

[9]

Nishant Yadav, Rajat Sen, Daniel N Hill, Arya Mazumdar, and Inderjit S Dhillon. Session-aware query auto-completion using extreme multi-label ranking. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pages 3835--3844, 2021.

Digital Library

Google Scholar

[10]

Hsiang-Fu Yu, Cho-Jui Hsieh, Qi Lei, and Inderjit S Dhillon. A greedy approach for budgeted maximum inner product search. In Advances in Neural Information Processing Systems, pages 5453--5462, 2017.

Google Scholar

[11]

Hsiang-Fu Yu, Kai Zhong, Jiong Zhang, Wei-Cheng Chang, and Inderjit S Dhillon. Pecos: Prediction for enormous and correlated output spaces. Journal of Machine Learning Research, 2022.

Digital Library

Google Scholar

[12]

Jiong Zhang, Wei-Cheng Chang, Hsiang-Fu Yu, and Inderjit S Dhillon. Fast multi-resolution transformer fine-tuning for extreme multi-label text classification. In Advances in Neural Information Processing Systems, 2021.

Google Scholar

Cited By

View all

Jiang JChang WZhang JHsieh CYu HChua TNgo CKa-Wei Lee RKumar RLauw H(2024)Entity Disambiguation with Extreme Multi-label RankingProceedings of the ACM Web Conference 202410.1145/3589334.3645498(4172-4180)Online publication date: 13-May-2024
https://dl.acm.org/doi/10.1145/3589334.3645498
Ye HSunderraman RJi S(2024)MatchXML: An Efficient Text-Label Matching Framework for Extreme Multi-Label Text ClassificationIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.337475036:9(4781-4793)Online publication date: Sep-2024
https://doi.org/10.1109/TKDE.2024.3374750
Zhao FTao RWang WCui BXu YAi Q(2024)Collaborative learning of supervision and correlation for generalized zero-shot extreme multi-label learningApplied Intelligence10.1007/s10489-024-05498-854:8(6285-6298)Online publication date: 9-May-2024
https://dl.acm.org/doi/10.1007/s10489-024-05498-8
Show More Cited By

Index Terms

PECOS: Prediction for Enormous and Correlated Output Spaces
1. Computing methodologies
  1. Machine learning
2. Information systems
  1. Information retrieval

Recommendations

PECOS: prediction for enormous and correlated output spaces

Many large-scale applications amount to finding relevant results from an enormous output space of potential candidates. For example, finding the best matching product from a large catalog or suggesting related search phrases on a search engine. The size ...
Optimized residual vector quantization for efficient approximate nearest neighbor search

In this paper, an optimized residual vector quantization-based approach is presented for improving the quality of vector quantization and approximate nearest neighbor search. The main contributions are as follows. Based on residual vector quantization (...
Autodesk Inventor 2011 for Designers

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

KDD '22: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 2022

5033 pages

ISBN:9781450393850

DOI:10.1145/3534678

General Chairs:
Aidong Zhang
University of Virginia
,
Huzefa Rangwala
Amazon/George Mason University

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 August 2022

Check for updates

Author Tags

Qualifiers

Abstract

Conference

KDD '22

Sponsor:

KDD '22: The 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 14 - 18, 2022

Washington DC, USA

Acceptance Rates

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

15
Total Citations
View Citations
130
Total Downloads

Downloads (Last 12 months)13
Downloads (Last 6 weeks)2

Reflects downloads up to 23 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Jiang JChang WZhang JHsieh CYu HChua TNgo CKa-Wei Lee RKumar RLauw H(2024)Entity Disambiguation with Extreme Multi-label RankingProceedings of the ACM Web Conference 202410.1145/3589334.3645498(4172-4180)Online publication date: 13-May-2024
https://dl.acm.org/doi/10.1145/3589334.3645498
Ye HSunderraman RJi S(2024)MatchXML: An Efficient Text-Label Matching Framework for Extreme Multi-Label Text ClassificationIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.337475036:9(4781-4793)Online publication date: Sep-2024
https://doi.org/10.1109/TKDE.2024.3374750
Zhao FTao RWang WCui BXu YAi Q(2024)Collaborative learning of supervision and correlation for generalized zero-shot extreme multi-label learningApplied Intelligence10.1007/s10489-024-05498-854:8(6285-6298)Online publication date: 9-May-2024
https://dl.acm.org/doi/10.1007/s10489-024-05498-8
Chen HMason CWang QZhao Y(2024)DBSSM: Deep BERT-Based Semantic Skill Matching from Resumes to a Public Skill TaxonomyAI 2024: Advances in Artificial Intelligence10.1007/978-981-96-0348-0_23(316-328)Online publication date: 18-Nov-2024
https://doi.org/10.1007/978-981-96-0348-0_23
Chien EZhang JHsieh CJiang JChang WMilenkovic OYu HKrause ABrunskill ECho KEngelhardt BSabato SScarlett J(2023)PINAProceedings of the 40th International Conference on Machine Learning10.5555/3618408.3618632(5616-5630)Online publication date: 23-Jul-2023
https://dl.acm.org/doi/10.5555/3618408.3618632
Cai QNiu LShang XDing H(2023)A Self-Supervised Tree-Structured Framework for Fine-Grained ClassificationApplied Sciences10.3390/app1307445313:7(4453)Online publication date: 31-Mar-2023
https://doi.org/10.3390/app13074453
Balaji BVunnava VDomingo NGupta SGupta HGuest GSrinivasan A(2023)Flamingo: Environmental Impact Factor Matching for Life Cycle Assessment with Zero-shot Machine LearningACM Journal on Computing and Sustainable Societies10.1145/36163851:2(1-23)Online publication date: 6-Dec-2023
https://dl.acm.org/doi/10.1145/3616385
Zhang JWang YChang WLi WJiang JHsieh CYu HFrommholz IHopfgartner FLee MOakes MLalmas MZhang MSantos R(2023)Build Faster with Less: A Journey to Accelerate Sparse Model Building for Semantic Matching in Product SearchProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3614661(4960-4966)Online publication date: 21-Oct-2023
https://dl.acm.org/doi/10.1145/3583780.3614661
Pavlovski MRavindran SGligorijevic DAgrawal SStojkovic ISegura-Nunez NGligorijevic JSingh ASun YAkoglu LGunopulos DYan XKumar ROzcan FYe J(2023)Extreme Multi-Label Classification for Ad Targeting using Factorization MachinesProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3580305.3599822(4705-4716)Online publication date: 6-Aug-2023
https://dl.acm.org/doi/10.1145/3580305.3599822
Monath NZaheer MMcCallum ASingh ASun YAkoglu LGunopulos DYan XKumar ROzcan FYe J(2023)Online Level-wise Hierarchical ClusteringProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3580305.3599455(1733-1745)Online publication date: 6-Aug-2023
https://dl.acm.org/doi/10.1145/3580305.3599455
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Index Terms

Recommendations

PECOS: prediction for enormous and correlated output spaces

Optimized residual vector quantization for efficient approximate nearest neighbor search

Autodesk Inventor 2011 for Designers