Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article
Free access

PECOS: prediction for enormous and correlated output spaces

Published: 01 January 2022 Publication History

Abstract

Many large-scale applications amount to finding relevant results from an enormous output space of potential candidates. For example, finding the best matching product from a large catalog or suggesting related search phrases on a search engine. The size of the output space for these problems can range from millions to billions, and can even be infinite in some applications. Moreover, training data is often limited for the "long-tail" items in the output space. Fortunately, items in the output space are often correlated thereby presenting an opportunity to alleviate the data sparsity issue. In this paper, we propose the Prediction for Enormous and Correlated Output Spaces (PECOS) framework, a versatile and modular machine learning framework for solving prediction problems for very large output spaces, and apply it to the eXtreme Multilabel Ranking (XMR) problem: given an input instance, find and rank the most relevant items from an enormous but fixed and finite output space. We propose a three phase framework for PECOS: (i) in the first phase, PECOS organizes the output space using a semantic indexing scheme, (ii) in the second phase, PECOS uses the indexing to narrow down the output space by orders of magnitude using a machine learned matching scheme, and (iii) in the third phase, PECOS ranks the matched items using a final ranking scheme. The versatility and modularity of PECOS allows for easy plug-and-play of various choices for the indexing, matching, and ranking phases. The indexing and matching phases alleviate the data sparsity issue by leveraging correlations across different items in the output space. For the critical matching phase, we develop a recursive machine learned matching strategy with both linear and neural matchers. When applied to eXtreme Multilabel Ranking where the input instances are in textual form, we find that the recursive Transformer matcher gives state-of-the-art accuracy results, at the cost of two orders of magnitude increased training time compared to the recursive linear matcher. For example, on a dataset where the output space is of size 2.8 million, the recursive Transformer matcher results in a 6% increase in precision@1 (from 48.6% to 54.2%) over the recursive linear matcher but takes 100x more time to train. Thus it is up to the practitioner to evaluate the trade-offs and decide whether the increased training time and infrastructure cost is warranted for their application; indeed, the exibility of the PECOS framework seamlessly allows different strategies to be used. We also develop very fast inference procedures which allow us to perform XMR predictions in real time; for example, inference takes less than 1 millisecond per input on the dataset with 2.8 million labels.

References

[1]
R. Babbar and B. Schölkopf. DiSMEC: distributed sparse machines for extreme multi-label classification. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, pages 721-729, 2017.
[2]
R. Babbar and B. Schölkopf. Data scarcity, robustness and extreme multi-label classification. Machine Learning, pages 1-23, 2019.
[3]
S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks. In Advances in Neural Information Processing Systems, pages 1171- 1179, 2015.
[4]
K. Bhatia, H. Jain, P. Kar, M. Varma, and P. Jain. Sparse local embeddings for extreme multilabel classification. In Advances in Neural Information Processing Systems, pages 730-738, 2015.
[5]
C. M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.
[6]
W.-C. Chang, F. X. Yu, Y.-W. Chang, Y. Yang, and S. Kumar. Pre-training tasks for embedding-based large-scale retrieval. In International Conference on Learning Representations, 2020a.
[7]
W.-C. Chang, H.-F. Yu, K. Zhong, Y. Yang, and I. S. Dhillon. Taming pretrained transformers for extreme multi-label text classification. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 3163-3171, 2020b.
[8]
K. Dahiya, D. Saini, A. Mittal, A. Shaw, K. Dave, A. Soni, H. Jain, S. Agarwal, and M. Varma. DeepXML: A deep extreme multi-label learning framework applied to short text documents. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining, pages 31-39, 2021.
[9]
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171-4186. Association for Computational Linguistics, 2019.
[10]
I. S. Dhillon. Co-clustering documents and words using bipartite spectral graph partitioning. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, pages 269-274, 2001.
[11]
I. S. Dhillon and D. S. Modha. Concept decompositions for large sparse text data using clustering. Machine learning, 42(1-2):143-175, 2001.
[12]
R. O. Duda, P. E. Hart, and D. G. Stork. Pattern classification. John Wiley & Sons, 2012.
[13]
P. A. Etter, K. Zhong, H.-F. Yu, L. Ying, and I. Dhillon. Accelerating inference for sparse extreme multi-label ranking trees. In Proceedings of the Web Conference 2022, 2022.
[14]
R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. LIBLINEAR: a library for large linear classification. Journal of machine learning research, 9(Aug):1871-1874, 2008.
[15]
Google. How search works. https://www.google.com/search/howsearchworks/, 2019. Accessed: 2019-1-18.
[16]
C.-J. Hsieh, K.-W. Chang, C.-J. Lin, S. S. Keerthi, and S. Sundararajan. A dual coordinate descent method for large-scale linear SVM. In Proceedings of the 25th international conference on Machine learning, pages 408-415, 2008.
[17]
H. Jain, Y. Prabhu, and M. Varma. Extreme multi-label loss functions for recommendation, tagging, ranking & other missing label applications. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 935-944, 2016.
[18]
H. Jain, V. Balasubramanian, B. Chunduri, and M. Varma. SLICE: Scalable linear extreme classifiers trained on 100 million labels for related searches. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, pages 528-536. ACM, 2019.
[19]
K. Jasinska-Kobus, M. Wydmuch, K. Dembczynski, M. Kuznetsov, and R. Busa-Fekete. Probabilistic label trees for extreme multi-label classification. arXiv preprint arXiv:2009.11218, 2020.
[20]
T. Jiang, D. Wang, L. Sun, H. Yang, Z. Zhao, and F. Zhuang. LightXML: Transformer with dynamic negative sampling for high-performance extreme multi-label text classification. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 7987-7994, 2021.
[21]
A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov. Bag of tricks for efficient text classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 427-431. Association for Computational Linguistics, April 2017.
[22]
S. Khandagale, H. Xiao, and R. Babbar. Bonsai: diverse and shallow trees for extreme multilabel classification. Machine Learning, 109(11):2099-2119, 2020.
[23]
D. Kingma and J. Ba. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations, 2014.
[24]
A. M. Lamb, A. G. A. P. Goyal, Y. Zhang, S. Zhang, A. C. Courville, and Y. Bengio. Professor forcing: A new algorithm for training recurrent networks. In Advances In Neural Information Processing Systems, pages 4601-4609, 2016.
[25]
K. Lee, M.-W. Chang, and K. Toutanova. Latent retrieval for weakly supervised open domain question answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6086-6096. Association for Computational Linguistics, 2019.
[26]
W. Li, Y. Zhang, Y. Sun, W. Wang, M. Li, W. Zhang, and X. Lin. Approximate nearest neighbor search on high dimensional data--experiments, analyses, and improvement. IEEE Transactions on Knowledge and Data Engineering, 32(8):1475-1488, 2019.
[27]
J. Liu, W.-C. Chang, Y. Wu, and Y. Yang. Deep learning for extreme multi-label text classification. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 115-124. ACM, 2017.
[28]
X. Liu, W.-C. Chang, H.-F. Yu, C.-J. Hsieh, and I. Dhillon. Label disentanglement in partition-based extreme multilabel classification. In Advances in Neural Information Processing Systems, 2021.
[29]
Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov. RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
[30]
Y. A. Malkov and D. A. Yashunin. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(4):824-836, 2020.
[31]
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111-3119, 2013.
[32]
A. Mittal, K. Dahiya, S. Agrawal, D. Saini, S. Agarwal, P. Kar, and M. Varma. DECAF: Deep extreme classification with label features. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining, pages 49-57, 2021a.
[33]
A. Mittal, N. Sachdeva, S. Agrawal, S. Agarwal, P. Kar, and M. Varma. ECLARE: Extreme classification with label graph correlations. In Proceedings of the Web Conference 2021, pages 3721-3732, 2021b.
[34]
I. Partalas, A. Kosmopoulos, N. Baskiotis, T. Artieres, G. Paliouras, E. Gaussier, I. Androutsopoulos, M.-R. Amini, and P. Galinari. LSHTC: A benchmark for large-scale text classification. arXiv preprint arXiv:1503.08581, 2015.
[35]
M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2227-2237. Association for Computational Linguistics, 2018.
[36]
Y. Prabhu and M. Varma. FastXML: A fast, accurate and stable tree-classifier for extreme multilabel learning. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 263-272, 2014.
[37]
Y. Prabhu, A. Kag, S. Harsola, R. Agrawal, and M. Varma. Parabel: Partitioned label trees for extreme classification with application to dynamic search advertising. In Proceedings of the Web Conference 2018, pages 993-1002, 2018.
[38]
S. J. Reddi, S. Kale, F. Yu, D. Holtmann-Rice, J. Chen, and S. Kumar. Stochastic negative mining for learning with large output spaces. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 1940-1949. PMLR, 2019.
[39]
D. Saini, A. K. Jain, K. Dave, J. Jiao, A. Singh, R. Zhang, and M. Varma. GalaXC: Graph neural networks with labelwise attention for extreme classification. In Proceedings of the Web Conference 2021, pages 3733-3744, 2021.
[40]
A. J. Smola and R. Kondor. Kernels and regularization on graphs. In Learning theory and kernel machines, pages 144-158. Springer, 2003.
[41]
S. J. Subramanya, F. Devvrit, H. V. Simhadri, R. Krishnawamy, and R. Kadekodi. Rand-NSG: Fast accurate billion-point nearest neighbor search on a single node. In Advances in Neural Information Processing Systems, pages 13748-13758, 2019.
[42]
Y. Tagami. AnnexML: Approximate nearest neighbor search for extreme multi-label classification. In Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, pages 455-464, 2017.
[43]
M. Varma. The extreme classification repository: Multi-label datasets & code. http://manikvarma.org/downloads/XC/XMLRepository.html, 2019.
[44]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, 2017.
[45]
A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the International Conference on Learning Representations, 2019.
[46]
J. J. Whang, Y. Hou, I. S. Dhillon, and D. F. Gleich. Non-exhaustive, overlapping k-means. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(11):2644-2659, 2019.
[47]
R. J. Williams and D. Zipser. A learning algorithm for continually running fully recurrent neural networks. Neural computation, 1(2):270-280, 1989.
[48]
Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and P. S. Yu. A comprehensive survey on graph neural networks. Computational Social Networks, 6(11), 2019.
[49]
M. Wydmuch, K. Jasinska, M. Kuznetsov, R. Busa-Fekete, and K. Dembczynski. A no-regret generalization of hierarchical softmax to extreme multi-label classification. In Advances in Neural Information Processing Systems, 2018.
[50]
Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q. V. Le. XLNet: Generalized autoregressive pretraining for language understanding. In Advances in Neural Information Processing Systems, 2019.
[51]
I. E. Yen, X. Huang, K. Zhong, P. Ravikumar, and I. S. Dhillon. PD-Sparse: A primal and dual sparse approach to extreme multiclass and multilabel classification. In International Conference on Machine Learning, pages 3069-3077. PMLR, 2016.
[52]
I. E. Yen, X. Huang, W. Dai, P. Ravikumar, I. Dhillon, and E. Xing. PPDsparse: A parallel primal-dual sparse method for extreme classification. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 545-553, 2017.
[53]
R. You, Z. Zhang, Z. Wang, S. Dai, H. Mamitsuka, and S. Zhu. AttentionXML: Label tree-based attention-aware deep model for high-performance extreme multi-label text classification. In Advances in Neural Information Processing Systems, pages 5812-5822, 2019.
[54]
J. Zhang, W.-C. Chang, H.-F. Yu, and I. S. Dhillon. Fast multi-resolution transformer fine-tuning for extreme multi-label text classification. In Advances in Neural Information Processing Systems, 2021.

Index Terms

  1. PECOS: prediction for enormous and correlated output spaces
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Please enable JavaScript to view thecomments powered by Disqus.

        Information & Contributors

        Information

        Published In

        cover image The Journal of Machine Learning Research
        The Journal of Machine Learning Research  Volume 23, Issue 1
        January 2022
        15939 pages
        ISSN:1532-4435
        EISSN:1533-7928
        Issue’s Table of Contents
        CC-BY 4.0

        Publisher

        JMLR.org

        Publication History

        Accepted: 01 February 2022
        Published: 01 January 2022
        Revised: 01 January 2022
        Received: 01 January 2021
        Published in JMLR Volume 23, Issue 1

        Author Tags

        1. extreme multi-label text classification
        2. large output space learning
        3. transformers

        Qualifiers

        • Research-article

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • 0
          Total Citations
        • 129
          Total Downloads
        • Downloads (Last 12 months)97
        • Downloads (Last 6 weeks)7
        Reflects downloads up to 26 Nov 2024

        Other Metrics

        Citations

        View Options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Login options

        Full Access

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media