article

Free access

RCV1: A New Benchmark Collection for Text Categorization Research

Authors:

David D. Lewis,

Yiming Yang,

Tony G. Rose,

Fan LiAuthors Info & Claims

The Journal of Machine Learning Research, Volume 5

Pages 361 - 397

Published: 01 December 2004 Publication History

PDF eReader

Abstract

Reuters Corpus Volume I (RCV1) is an archive of over 800,000 manually categorized newswire stories recently made available by Reuters, Ltd. for research purposes. Use of this data for research on text categorization requires a detailed understanding of the real world constraints under which the data was produced. Drawing on interviews with Reuters personnel and access to Reuters documentation, we describe the coding policy and quality control procedures used in producing the RCV1 data, the intended semantics of the hierarchical category taxonomies, and the corrections necessary to remove errorful data. We refer to the original data as RCV1-v1, and the corrected data as RCV1-v2. We benchmark several widely used supervised learning methods on RCV1-v2, illustrating the collection's properties, suggesting new directions for research, and providing baseline results for future studies. We make available detailed, per-category experimental results, as well as corrected versions of the category assignments and taxonomy structures, via online appendices.

References

[1]

D. G. Altman. Practical Statistics for Medical Research. Chapman & Hall/CRC, 1991.]]

Digital Library

Google Scholar

[2]

T. Ault and Y. Yang. kNN, Rocchio and metrics for information filtering at TREC-10. In The Tenth Text REtrieval Conference (TREC 2001), pages 84-93, Gaithersburg, MD 20899-0001, 2002. National Institute of Standards and Technology. http://trec.nist.gov/pubs/trec10/papers/cmucatcorrect.pdf.]]

Google Scholar

[3]

C. Buckley, G. Salton, and J. Allan. The effect of adding relevance information in a relevance feed-back environment. In Proceedings of the Seventeenth Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval (SIGIR 94), pages 292-300, 1994.]]

Digital Library

Google Scholar

[4]

C. W. Cleverdon. The significance of the Cranfield tests of index languages. In Proceedings of the Fourteenth Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval (SIGIR 91), pages 3-12, 1991.]]

Digital Library

Google Scholar

[5]

Great Britain Office for National Statistics. Indexes to UK Standard Industrial Classification of Economic Activities 1992 UK SIC(92). Office for National Statistics, London, 1997.]]

Google Scholar

[6]

Great Britain Office for National Statistics. UK Standard Industrial Classification of Economic Activities UK SIC(92), December 20, 2002. http://www.statistics.gov.uk/methods_quality/sic/contents.asp.]]

Google Scholar

[7]

R. Grishman and B. Sundheim. Design of the MUC-6 evaluation. In Sixth Message Understanding Evaluation (MUC-6), pages 1-12. Defense Advanced Research Projects Agency, Morgan Kaufmann, 1995.]]

Digital Library

Google Scholar

[8]

P. J. Hayes and S. P. Weinstein. CONSTRUE/TIS: A System for Content-Based Indexing of a Database of News Stories. In Second Annual Conference on Innovative Applications of Artificial Intelligence, pages 49-64, 1990.]]

Digital Library

Google Scholar

[9]

W. Hersh, C. Buckley, T. J. Leone, and D. Hickman. OHSUMED: an interactive retrieval evaluation and new large text collection for research. In Proceedings of the Seventeenth Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval (SIGIR 94), pages 192-201, 1994.]]

Digital Library

Google Scholar

[10]

D. J. Ittner, D. D. Lewis, and D. D. Ahn. Text Categorization of Low Quality Images. In Symposium on Document Analysis and Information Retrieval, pages 301-315, Las Vegas, 1995.]]

Google Scholar

[11]

T. Joachims. Text categorization with support vector machines: Learning with many relevant features. In European Conference on Machine Learning (ECML'98), pages 137-142, Berlin, 1998.]]

Digital Library

Google Scholar

[12]

T. Joachims. Transductive inference for text classification using support vector machines. In International Conference on Machine Learning (ICML'99), pages 200-209, San Francisco, CA, 1999.]]

Digital Library

Google Scholar

[13]

T. Joachims. SVM Light: Support Vector Machine, May 13th, 2002. http://svmlight.joachims.org.]]

Crossref

Google Scholar

[14]

D. V. Khmelev and W. J. Teahan. A repetition based measure for verification of text collections and for text categorization. In Proceedings of the Twenty-Sixth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 03), pages 104-110, 2003.]]

Digital Library

Google Scholar

[15]

D. Koller and M. Sahami. Hierarchically classifying documents using very few words. In International Conference on Machine Learning (ICML'97), pages 170-178, Nashville, 1997.]]

Digital Library

Google Scholar

[16]

F. W. Lancaster. Indexing and Abstracting in Theory and Practice. Second edition. University of Illinois, Champaign, IL, 1998.]]

Google Scholar

[17]

D. D. Lewis. Evaluating text categorization. In Proceedings of Speech and Natural Language Workshop, pages 312-318. Defense Advanced Research Projects Agency, Morgan Kaufmann, 1991.]]

Digital Library

Google Scholar

[18]

D. D. Lewis. An evaluation of phrasal and clustered representations on a text categorization task. In Proceedings of the Fifteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 92), pages 37-50, 1992.]]

Digital Library

Google Scholar

[19]

D. D. Lewis. Evaluating and optimizing autonomous text classification systems. In Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 95), pages 246-254, 1995.]]

Digital Library

Google Scholar

[20]

D. D. Lewis. Reuters-21578 text Categorization test collection. Distribution 1.0. README file (version 1.2). Manuscript, September 26, 1997. http://www.daviddlewis.com/resources/testcollections/reuters21578/readme.txt]]

Google Scholar

[21]

D. D. Lewis. Applying support vector machines to the TREC-2001 batch filtering and routing tasks. In The Tenth Text REtrieval Conference (TREC 2001), pages 286-292, Gaithersburg, MD 20899-0001, 2002. National Institute of Standards and Technology. http://trec.nist.gov/pubs/trec10/papers/daviddlewis-trec2001-draft4.pdf.]]

Google Scholar

[22]

D. D. Lewis, R. E. Schapire, J. P. Callan, and R. Papka. Training algorithms for linear text classifiers. In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 96), pages 298-306, 1996.]]

Digital Library

Google Scholar

[23]

D. D. Lewis and R. M. Tong. Text filtering in MUC-3 and MUC-4. In Proceedings of the Fourth Message Understanding Conference (MUC-4), pages 51-66. Defense Advanced Research Projects Agency, Morgan Kaufmann, 1992.]]

Digital Library

Google Scholar

[24]

M. F. Porter. An algorithm for suffix stripping. Program, 14(3):130-137, 1980.]]

Digital Library

Google Scholar

[25]

M. F. Porter. The Porter Stemming Algorithm, 2003. http://www.tartarus.org/~martin/PorterStemmer.]]

Google Scholar

[26]

S. Robertson and I. Soboroff. The TREC 2001 filtering track report. In The Tenth Text REtrieval Conference (TREC 2001), pages 26-37, Gaithersburg, MD 20899-0001, 2002. National Institute of Standards and Technology. http://trec.nist.gov/pubs/trec10/papers/filtering2_track.pdf.]]

Google Scholar

[27]

J. J. Rocchio, Jr., Relevance feedback in information retrieval. In G. Salton, editor, The SMART Retrieval System: Experiments in Automatic Document Processing, pages 313-323. Prentice-Hall, 1971.]]

Google Scholar

[28]

M. Rogati and Y. Yang. High performing and scalable feature selection for text classification. In Proceedings of the Eleventh International Conference on Information and Knowledge Management , pages 659-661, 2002.]]

Digital Library

Google Scholar

[29]

T. Rose, M. Stevenson, and M. Whitehead. The Reuters Corpus Volume 1 - from Yesterday's News to Tomorrow's Language Resources. In Proceedings of the Third International Conference on Language Resources and Evaluation, 2002. http://about.reuters.com/researchandstandards/corpus/LREC_camera_ready.pdf]]

Google Scholar

[30]

T. Rose. Electronic mail message to [email protected], June 11, 2002. http://groups.yahoo.com/group/ReutersCorpora/message/70.]]

Google Scholar

[31]

G. Salton and C. Buckley. Improving retrieval performance by relevance feedback. Journal of American Society for Information Sciences, 41:288-297, 1990.]]

Crossref

Google Scholar

[32]

G. Salton, editor. The SMART Retrieval System: Experiments in Automatic Document Processing. Prentice-Hall, 1971.]]

Digital Library

Google Scholar

[33]

R. E. Schapire, Y. Singer, and A. Singhal. Boosting and Rocchio applied to text filtering. In Proceedings of the Twenty-First Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 98), pages 215-223, 1998.]]

Digital Library

Google Scholar

[34]

F. Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1-47, 2002.]]

Digital Library

Google Scholar

[35]

J. M. Tague. The pragmatics of information retrieval experimentation. In K. Sparck Jones, editor, Information Retrieval Experiment, chapter 5. Butterworths, 1981.]]

Google Scholar

[36]

C. J. van Rijsbergen. Automatic Information Structuring and Retrieval. PhD thesis, King's College, Cambridge, 1972.]]

Google Scholar

[37]

C. J. van Rijsbergen. Information Retrieval. Butterworths, 1979.]]

Digital Library

Google Scholar

[38]

M. Whitehead. Electronic mail message to [email protected], November 14, 2002. http://groups.yahoo.com/group/ReutersCorpora/message/106.]]

Google Scholar

[39]

A. S. Weigend, E. D. Wiener, and J. O. Pedersen. Exploiting hierarchy in text categorization. Information Retrieval, 1(3):193-216, 1999.]]

Digital Library

Google Scholar

[40]

Y. Yang. An evaluation of statistical approaches to text categorization. Information Retrieval, 1(1/2):67-88, 1999.]]

Digital Library

Google Scholar

[41]

Y. Yang. A study on thresholding strategies for text categorization. In The Twenty-Fourth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 01), pages 137-145, 2001.]]

Digital Library

Google Scholar

[42]

Y. Yang, T. Ault, T. Pierce, and C. W. Lattimer. Improving text categorization methods for event tracking. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 00), pages 65-72, 2000.]]

Digital Library

Google Scholar

[43]

Y. Yang and X. Liu. A re-examination of text categorization methods. In Proceedings of the Twenty-Second Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 99), pages 42-49, 1999.]]

Digital Library

Google Scholar

[44]

Y. Yang and J. O. Pedersen. A comparative study on feature selection in text categorization. In The Fourteenth International Conference on Machine Learning (ICML'97), pages 412-420. Morgan Kaufmann, 1997.]]

Digital Library

Google Scholar

[45]

T. Zhang and F. J. Oles. Text categorization based on regularized linear classification methods. Information Retrieval, 4(1):5-31, 2001.]]

Digital Library

Google Scholar

Cited By

View all

Zhang JLi HAllenby G(2024)Using Text Analysis in Parallel Mediation AnalysisMarketing Science10.1287/mksc.2023.004543:5(953-970)Online publication date: 1-Sep-2024
https://dl.acm.org/doi/10.1287/mksc.2023.0045
Zheng HZhang RWang HGurrin CKongkachandra RSchoeffmann KDang-Nguyen DRossetto LSatoh SZhou L(2024)Deep Image Clustering Based on Curriculum Learning and Density InformationProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3658081(330-338)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3652583.3658081
Geeganage DXu YLi Y(2024)A Semantics-enhanced Topic Modelling Technique: Semantic-LDAACM Transactions on Knowledge Discovery from Data10.1145/363940918:4(1-27)Online publication date: 12-Feb-2024
https://dl.acm.org/doi/10.1145/3639409
Show More Cited By

Index Terms

RCV1: A New Benchmark Collection for Text Categorization Research

Recommendations

An Evaluation of Passage-Based Text Categorization

Researches in text categorization have been confined to whole-document-level classification, probably due to lack of full-text test collections. However, full-length documents available today in large quantities pose renewed interests in text ...
Automatic text categorization by unsupervised learning
COLING '00: Proceedings of the 18th conference on Computational linguistics - Volume 1

The goal of text categorization is to classify documents into a certain number of predefined categories. The previous works in this area have used a large number of labeled training documents for supervised learning. One problem is that it is difficult ...
Text Categorization by MILO Tree Traversals
ICGEC '10: Proceedings of the 2010 Fourth International Conference on Genetic and Evolutionary Computing

This paper presents a new method based on MILO for automatic text categorization. MILO classification technique is a new rule-based classification technique, which is different from traditional rule-based technique such as decision tree and association ...

Reviews

Reviewer: Fabrizio Sebastiani

Reuters-21578 is a familiar name to information retrieval (IR) researchers, since it denotes the standard benchmark of research in text categorization (TC), the subfield of IR concerned with the classification of textual documents into a set of predefined categories. A TC benchmark is a set of pre-classified documents, and it serves two main purposes: first, a text classifier can be automatically built by learning the semantic characteristics of the categories from a subset of the benchmark known as the training set; second, the accuracy of the classifier can be tested by comparing its classification decisions with those encoded in the remaining documents, collectively known as the test set. Throughout the 1990s, the availability of Reuters-21578, a set of 12,902 newswires pre-classified according to 118 topical classes, contributed to stimulating research in TC, by providing a common testing ground for competing TC approaches. It is now time, however, for Reuters-21578 to go into early retirement, especially because its size is now considered too small to warrant significant conclusions from experiments (in other IR subfields, standard benchmarks easily reach up to tens of millions of documents). This paper reports on a newly available test collection, Reuters Corpus Volume I (RCV1), which promises to be the new standard benchmark of TC research, and a better substitute for Reuters-21578, because of sheer size (804,414 documents), Extensible Markup Language (XML) tagging, a smaller amount of noise, and more clearly specified semantics. The authors' report on the coding practices that human coders used in preparing the data, on the nature of the data, and on the statistics about it, will be necessary background for anyone wishing to use RCV1 in their TC experiments, while the authors' report on experimental results they obtained by running several known TC techniques (ranging from example-based methods to support vector machines) on the data will provide baselines upon which new techniques will need to improve. This is an essential paper for those wishing to engage in text categorization research.

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image The Journal of Machine Learning Research

The Journal of Machine Learning Research Volume 5, Issue

12/1/2004

1571 pages

ISSN:1532-4435

EISSN:1533-7928

Issue’s Table of Contents

Publisher

JMLR.org

Publication History

Published: 01 December 2004

Published in JMLR Volume 5

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

802
Total Citations
View Citations
3,670
Total Downloads

Downloads (Last 12 months)214
Downloads (Last 6 weeks)30

Reflects downloads up to 18 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Zhang JLi HAllenby G(2024)Using Text Analysis in Parallel Mediation AnalysisMarketing Science10.1287/mksc.2023.004543:5(953-970)Online publication date: 1-Sep-2024
https://dl.acm.org/doi/10.1287/mksc.2023.0045
Zheng HZhang RWang HGurrin CKongkachandra RSchoeffmann KDang-Nguyen DRossetto LSatoh SZhou L(2024)Deep Image Clustering Based on Curriculum Learning and Density InformationProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3658081(330-338)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3652583.3658081
Geeganage DXu YLi Y(2024)A Semantics-enhanced Topic Modelling Technique: Semantic-LDAACM Transactions on Knowledge Discovery from Data10.1145/363940918:4(1-27)Online publication date: 12-Feb-2024
https://dl.acm.org/doi/10.1145/3639409
Saha SImtiaz H(2024)Privacy-Preserving Non-Negative Matrix Factorization with OutliersACM Transactions on Knowledge Discovery from Data10.1145/363296118:3(1-26)Online publication date: 12-Jan-2024
https://dl.acm.org/doi/10.1145/3632961
Yang EHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)Contextualization with SPLADE for High Recall RetrievalProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657919(2337-2341)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657919
Bin-Hezam RStevenson MHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)RLStop: A Reinforcement Learning Stopping Method for TARProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657911(2604-2608)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657911
Liu YChen CHuang HChen HChua TNgo CKumar RLauw HKa-Wei Lee R(2024)News-Driven Price Movement Forecasting with Label-Prior Graph AttentionCompanion Proceedings of the ACM Web Conference 202410.1145/3589335.3651539(569-572)Online publication date: 13-May-2024
https://dl.acm.org/doi/10.1145/3589335.3651539
He BNag SCui LWang SLi ZGoutam RLi ZZhang HChua TNgo CKumar RLauw HKa-Wei Lee R(2024)Hierarchical Query Classification in E-commerce SearchCompanion Proceedings of the ACM Web Conference 202410.1145/3589335.3648332(338-345)Online publication date: 13-May-2024
https://dl.acm.org/doi/10.1145/3589335.3648332
Ma CYu RZhang YChua TNgo CKa-Wei Lee RKumar RLauw H(2024)A Fast Similarity Matrix Calibration Method with Incomplete QueryProceedings of the ACM Web Conference 202410.1145/3589334.3645456(1419-1430)Online publication date: 13-May-2024
https://dl.acm.org/doi/10.1145/3589334.3645456
Wu JHu JZhang HWen Z(2024)Convergence Analysis of an Adaptively Regularized Natural Gradient MethodIEEE Transactions on Signal Processing10.1109/TSP.2024.339849672(2527-2542)Online publication date: 9-May-2024
https://dl.acm.org/doi/10.1109/TSP.2024.3398496
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Abstract

References

Cited By

Index Terms

Recommendations

An Evaluation of Passage-Based Text Categorization

Automatic text categorization by unsupervised learning

Text Categorization by MILO Tree Traversals

Reviews

Access critical reviews of Computing literature here

Comments

Information

Published In

Publisher

Publication History

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations