Nothing Special   »   [go: up one dir, main page]

skip to main content
article
Free access

RCV1: A New Benchmark Collection for Text Categorization Research

Published: 01 December 2004 Publication History

Abstract

Reuters Corpus Volume I (RCV1) is an archive of over 800,000 manually categorized newswire stories recently made available by Reuters, Ltd. for research purposes. Use of this data for research on text categorization requires a detailed understanding of the real world constraints under which the data was produced. Drawing on interviews with Reuters personnel and access to Reuters documentation, we describe the coding policy and quality control procedures used in producing the RCV1 data, the intended semantics of the hierarchical category taxonomies, and the corrections necessary to remove errorful data. We refer to the original data as RCV1-v1, and the corrected data as RCV1-v2. We benchmark several widely used supervised learning methods on RCV1-v2, illustrating the collection's properties, suggesting new directions for research, and providing baseline results for future studies. We make available detailed, per-category experimental results, as well as corrected versions of the category assignments and taxonomy structures, via online appendices.

References

[1]
D. G. Altman. Practical Statistics for Medical Research. Chapman & Hall/CRC, 1991.]]
[2]
T. Ault and Y. Yang. kNN, Rocchio and metrics for information filtering at TREC-10. In The Tenth Text REtrieval Conference (TREC 2001), pages 84-93, Gaithersburg, MD 20899-0001, 2002. National Institute of Standards and Technology. http://trec.nist.gov/pubs/trec10/papers/cmucatcorrect.pdf.]]
[3]
C. Buckley, G. Salton, and J. Allan. The effect of adding relevance information in a relevance feed-back environment. In Proceedings of the Seventeenth Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval (SIGIR 94), pages 292-300, 1994.]]
[4]
C. W. Cleverdon. The significance of the Cranfield tests of index languages. In Proceedings of the Fourteenth Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval (SIGIR 91), pages 3-12, 1991.]]
[5]
Great Britain Office for National Statistics. Indexes to UK Standard Industrial Classification of Economic Activities 1992 UK SIC(92). Office for National Statistics, London, 1997.]]
[6]
Great Britain Office for National Statistics. UK Standard Industrial Classification of Economic Activities UK SIC(92), December 20, 2002. http://www.statistics.gov.uk/methods_quality/sic/contents.asp.]]
[7]
R. Grishman and B. Sundheim. Design of the MUC-6 evaluation. In Sixth Message Understanding Evaluation (MUC-6), pages 1-12. Defense Advanced Research Projects Agency, Morgan Kaufmann, 1995.]]
[8]
P. J. Hayes and S. P. Weinstein. CONSTRUE/TIS: A System for Content-Based Indexing of a Database of News Stories. In Second Annual Conference on Innovative Applications of Artificial Intelligence, pages 49-64, 1990.]]
[9]
W. Hersh, C. Buckley, T. J. Leone, and D. Hickman. OHSUMED: an interactive retrieval evaluation and new large text collection for research. In Proceedings of the Seventeenth Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval (SIGIR 94), pages 192-201, 1994.]]
[10]
D. J. Ittner, D. D. Lewis, and D. D. Ahn. Text Categorization of Low Quality Images. In Symposium on Document Analysis and Information Retrieval, pages 301-315, Las Vegas, 1995.]]
[11]
T. Joachims. Text categorization with support vector machines: Learning with many relevant features. In European Conference on Machine Learning (ECML'98), pages 137-142, Berlin, 1998.]]
[12]
T. Joachims. Transductive inference for text classification using support vector machines. In International Conference on Machine Learning (ICML'99), pages 200-209, San Francisco, CA, 1999.]]
[13]
T. Joachims. SVM Light: Support Vector Machine, May 13th, 2002. http://svmlight.joachims.org.]]
[14]
D. V. Khmelev and W. J. Teahan. A repetition based measure for verification of text collections and for text categorization. In Proceedings of the Twenty-Sixth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 03), pages 104-110, 2003.]]
[15]
D. Koller and M. Sahami. Hierarchically classifying documents using very few words. In International Conference on Machine Learning (ICML'97), pages 170-178, Nashville, 1997.]]
[16]
F. W. Lancaster. Indexing and Abstracting in Theory and Practice. Second edition. University of Illinois, Champaign, IL, 1998.]]
[17]
D. D. Lewis. Evaluating text categorization. In Proceedings of Speech and Natural Language Workshop, pages 312-318. Defense Advanced Research Projects Agency, Morgan Kaufmann, 1991.]]
[18]
D. D. Lewis. An evaluation of phrasal and clustered representations on a text categorization task. In Proceedings of the Fifteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 92), pages 37-50, 1992.]]
[19]
D. D. Lewis. Evaluating and optimizing autonomous text classification systems. In Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 95), pages 246-254, 1995.]]
[20]
D. D. Lewis. Reuters-21578 text Categorization test collection. Distribution 1.0. README file (version 1.2). Manuscript, September 26, 1997. http://www.daviddlewis.com/resources/testcollections/reuters21578/readme.txt]]
[21]
D. D. Lewis. Applying support vector machines to the TREC-2001 batch filtering and routing tasks. In The Tenth Text REtrieval Conference (TREC 2001), pages 286-292, Gaithersburg, MD 20899-0001, 2002. National Institute of Standards and Technology. http://trec.nist.gov/pubs/trec10/papers/daviddlewis-trec2001-draft4.pdf.]]
[22]
D. D. Lewis, R. E. Schapire, J. P. Callan, and R. Papka. Training algorithms for linear text classifiers. In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 96), pages 298-306, 1996.]]
[23]
D. D. Lewis and R. M. Tong. Text filtering in MUC-3 and MUC-4. In Proceedings of the Fourth Message Understanding Conference (MUC-4), pages 51-66. Defense Advanced Research Projects Agency, Morgan Kaufmann, 1992.]]
[24]
M. F. Porter. An algorithm for suffix stripping. Program, 14(3):130-137, 1980.]]
[25]
M. F. Porter. The Porter Stemming Algorithm, 2003. http://www.tartarus.org/~martin/PorterStemmer.]]
[26]
S. Robertson and I. Soboroff. The TREC 2001 filtering track report. In The Tenth Text REtrieval Conference (TREC 2001), pages 26-37, Gaithersburg, MD 20899-0001, 2002. National Institute of Standards and Technology. http://trec.nist.gov/pubs/trec10/papers/filtering2_track.pdf.]]
[27]
J. J. Rocchio, Jr., Relevance feedback in information retrieval. In G. Salton, editor, The SMART Retrieval System: Experiments in Automatic Document Processing, pages 313-323. Prentice-Hall, 1971.]]
[28]
M. Rogati and Y. Yang. High performing and scalable feature selection for text classification. In Proceedings of the Eleventh International Conference on Information and Knowledge Management , pages 659-661, 2002.]]
[29]
T. Rose, M. Stevenson, and M. Whitehead. The Reuters Corpus Volume 1 - from Yesterday's News to Tomorrow's Language Resources. In Proceedings of the Third International Conference on Language Resources and Evaluation, 2002. http://about.reuters.com/researchandstandards/corpus/LREC_camera_ready.pdf]]
[30]
T. Rose. Electronic mail message to [email protected], June 11, 2002. http://groups.yahoo.com/group/ReutersCorpora/message/70.]]
[31]
G. Salton and C. Buckley. Improving retrieval performance by relevance feedback. Journal of American Society for Information Sciences, 41:288-297, 1990.]]
[32]
G. Salton, editor. The SMART Retrieval System: Experiments in Automatic Document Processing. Prentice-Hall, 1971.]]
[33]
R. E. Schapire, Y. Singer, and A. Singhal. Boosting and Rocchio applied to text filtering. In Proceedings of the Twenty-First Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 98), pages 215-223, 1998.]]
[34]
F. Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1-47, 2002.]]
[35]
J. M. Tague. The pragmatics of information retrieval experimentation. In K. Sparck Jones, editor, Information Retrieval Experiment, chapter 5. Butterworths, 1981.]]
[36]
C. J. van Rijsbergen. Automatic Information Structuring and Retrieval. PhD thesis, King's College, Cambridge, 1972.]]
[37]
C. J. van Rijsbergen. Information Retrieval. Butterworths, 1979.]]
[38]
M. Whitehead. Electronic mail message to [email protected], November 14, 2002. http://groups.yahoo.com/group/ReutersCorpora/message/106.]]
[39]
A. S. Weigend, E. D. Wiener, and J. O. Pedersen. Exploiting hierarchy in text categorization. Information Retrieval, 1(3):193-216, 1999.]]
[40]
Y. Yang. An evaluation of statistical approaches to text categorization. Information Retrieval, 1(1/2):67-88, 1999.]]
[41]
Y. Yang. A study on thresholding strategies for text categorization. In The Twenty-Fourth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 01), pages 137-145, 2001.]]
[42]
Y. Yang, T. Ault, T. Pierce, and C. W. Lattimer. Improving text categorization methods for event tracking. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 00), pages 65-72, 2000.]]
[43]
Y. Yang and X. Liu. A re-examination of text categorization methods. In Proceedings of the Twenty-Second Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 99), pages 42-49, 1999.]]
[44]
Y. Yang and J. O. Pedersen. A comparative study on feature selection in text categorization. In The Fourteenth International Conference on Machine Learning (ICML'97), pages 412-420. Morgan Kaufmann, 1997.]]
[45]
T. Zhang and F. J. Oles. Text categorization based on regularized linear classification methods. Information Retrieval, 4(1):5-31, 2001.]]

Cited By

View all
  • (2024)Using Text Analysis in Parallel Mediation AnalysisMarketing Science10.1287/mksc.2023.004543:5(953-970)Online publication date: 1-Sep-2024
  • (2024)Deep Image Clustering Based on Curriculum Learning and Density InformationProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3658081(330-338)Online publication date: 30-May-2024
  • (2024)A Semantics-enhanced Topic Modelling Technique: Semantic-LDAACM Transactions on Knowledge Discovery from Data10.1145/363940918:4(1-27)Online publication date: 12-Feb-2024
  • Show More Cited By

Recommendations

Reviews

Fabrizio Sebastiani

Reuters-21578 is a familiar name to information retrieval (IR) researchers, since it denotes the standard benchmark of research in text categorization (TC), the subfield of IR concerned with the classification of textual documents into a set of predefined categories. A TC benchmark is a set of pre-classified documents, and it serves two main purposes: first, a text classifier can be automatically built by learning the semantic characteristics of the categories from a subset of the benchmark known as the training set; second, the accuracy of the classifier can be tested by comparing its classification decisions with those encoded in the remaining documents, collectively known as the test set. Throughout the 1990s, the availability of Reuters-21578, a set of 12,902 newswires pre-classified according to 118 topical classes, contributed to stimulating research in TC, by providing a common testing ground for competing TC approaches. It is now time, however, for Reuters-21578 to go into early retirement, especially because its size is now considered too small to warrant significant conclusions from experiments (in other IR subfields, standard benchmarks easily reach up to tens of millions of documents). This paper reports on a newly available test collection, Reuters Corpus Volume I (RCV1), which promises to be the new standard benchmark of TC research, and a better substitute for Reuters-21578, because of sheer size (804,414 documents), Extensible Markup Language (XML) tagging, a smaller amount of noise, and more clearly specified semantics. The authors' report on the coding practices that human coders used in preparing the data, on the nature of the data, and on the statistics about it, will be necessary background for anyone wishing to use RCV1 in their TC experiments, while the authors' report on experimental results they obtained by running several known TC techniques (ranging from example-based methods to support vector machines) on the data will provide baselines upon which new techniques will need to improve. This is an essential paper for those wishing to engage in text categorization research.

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image The Journal of Machine Learning Research
The Journal of Machine Learning Research  Volume 5, Issue
12/1/2004
1571 pages
ISSN:1532-4435
EISSN:1533-7928
Issue’s Table of Contents

Publisher

JMLR.org

Publication History

Published: 01 December 2004
Published in JMLR Volume 5

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)214
  • Downloads (Last 6 weeks)30
Reflects downloads up to 18 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Using Text Analysis in Parallel Mediation AnalysisMarketing Science10.1287/mksc.2023.004543:5(953-970)Online publication date: 1-Sep-2024
  • (2024)Deep Image Clustering Based on Curriculum Learning and Density InformationProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3658081(330-338)Online publication date: 30-May-2024
  • (2024)A Semantics-enhanced Topic Modelling Technique: Semantic-LDAACM Transactions on Knowledge Discovery from Data10.1145/363940918:4(1-27)Online publication date: 12-Feb-2024
  • (2024)Privacy-Preserving Non-Negative Matrix Factorization with OutliersACM Transactions on Knowledge Discovery from Data10.1145/363296118:3(1-26)Online publication date: 12-Jan-2024
  • (2024)Contextualization with SPLADE for High Recall RetrievalProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657919(2337-2341)Online publication date: 10-Jul-2024
  • (2024)RLStop: A Reinforcement Learning Stopping Method for TARProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657911(2604-2608)Online publication date: 10-Jul-2024
  • (2024)News-Driven Price Movement Forecasting with Label-Prior Graph AttentionCompanion Proceedings of the ACM Web Conference 202410.1145/3589335.3651539(569-572)Online publication date: 13-May-2024
  • (2024)Hierarchical Query Classification in E-commerce SearchCompanion Proceedings of the ACM Web Conference 202410.1145/3589335.3648332(338-345)Online publication date: 13-May-2024
  • (2024)A Fast Similarity Matrix Calibration Method with Incomplete QueryProceedings of the ACM Web Conference 202410.1145/3589334.3645456(1419-1430)Online publication date: 13-May-2024
  • (2024)Convergence Analysis of an Adaptively Regularized Natural Gradient MethodIEEE Transactions on Signal Processing10.1109/TSP.2024.339849672(2527-2542)Online publication date: 9-May-2024
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media