Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/584792.584878acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
Article

Strategies for minimising errors in hierarchical web categorisation

Published: 04 November 2002 Publication History

Abstract

On the Web, browsing and searching categories is a popular method of finding documents. Two well-known category-based search systems are the Yahoo!~and DMOZ hierarchies, which are maintained by experts who assign documents to categories. However, manual categorisation by experts is costly, subjective, and not scalable with the increasing volumes of data that must be processed. Several methods have been investigated for effective automatic text categorisation. These include selection of categorisation methods, selection of pre-categorised training samples, use of hierachies, and selection of document fragments or features. In this paper, we further investigate categorisation into Web hierarchies and the role of hierarchical information in improving categorisation effectiveness. We introduce new strategies to reduce errors in hierarchical categorisation. In particular, we propose novel techniques that shift the assignment into higher level categories when lower level assignment is uncertain. Our results show that absolute error rates can be reduced by over 2%.

References

[1]
C. Apte, F. Damerau, and S. Weiss. Automated learning of decision rules for text categorization. ACM Transactions on Information Systems, 12(3):233--251, 1994.
[2]
B. E. Boser, I. M. Guyon, and V. N. Vapnik. A training algorithm for optimal margin classifiers. In D. Haussler, editor, Proceedings of the 5th Annual ACM Workshop on Computational Learning Theory, pages 144--152, Pittsburgh, PA, July 1992. ACM Press.
[3]
S. D'Alessio, K. Murray, R. Schiaffino, and A. Kershenbaum. The effect of using hierarchical classifiers in text categorization. In Proceeding of RIAO-00, 6th International Conference "Recherche d'Information Assistee par Ordinateur", pages 302--313, Paris, FR, 2000.
[4]
S. D'Alessio, K. Murray, R. Schiaffino, and A. Kershenbaum. Category levels in hierarchical text categorization. In Proc. of EMNLP-98, 3rd Conference on Empirical Methods in Natural Language Processing, Granada, Spain, 1998. Association for Computational Linguistics, Morristown.
[5]
S. T. Dumais and H. Chen. Hierarchical classification of Web content. In N.J. Belkin, P. Ingwersen, and M.-K. Leong, editors, Proc. ACM-SIGIR International Conference on Research and Development in Information Retrieval, pages 256--263, Athens, Greece, 2000. ACM Press, New York.
[6]
P.J. Hayes and S.P. Weinstein. CONSTRUE/TIS: a system for content-based indexing of a database of news stories. In A. Rappaport and R. Smith, editors, Proceedings of IAAI-90, 2nd Conference on Innovative Applications of Artificial Intelligence, pages 49--66. AAAI Press, Menlo Park, 1990.
[7]
T. Joachims. A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. In D.H. Fisher, editor, Proc. of the 14th International Conference on Machine Learning, pages 143--151, Nashville, 1997. Morgan Kaufmann, San Francisco.
[8]
D. Koller and M. Sahami. Hierarchically classifying documents using very few words. In D.H. Fisher, editor, Proc. of the 14th International Conference on Machine Learning (ICML97), pages 170--178, Nashville, 1997. Morgan Kaufmann, San Francisco.
[9]
L.S. Larkey and W.B. Croft. Combining classifiers in text categorization. In H.-P. Frei, D. Harman, P. Schauble, and R. Wilkinson, editors, Proc. ACM-SIGIR International Conference on Research and Development in Information Retrieval, pages 289--297, Zurich, Switzerland, 1996.
[10]
D.D. Lewis, R.E. Schapire, J.P. Callan, and R. Papka. Training algorithms for linear text classifiers. In H.-P. Frei, D. Harman, P. Schäuble, and R. Wilkinson, editors, Proc. ACM SIGIR International Conference on Research and Development in Information Retrieval, pages 298--306, Zurich, Switzerland, 1996.
[11]
D. Mladenic and M. Grobelnik. Feature selection for classification based on text hierarchy. In Working notes of Learning from Text and the Web, Conference on Automated Learning and Discovery CONALD-98, Pittsburg, USA, 1998.
[12]
J.J. Rocchio. Relevance feedback in information retrieval. In The Smart Retrieval System --- Experiments in Automatic Document Processing, pages 313--323. Prentice-Hall, Englewood, Cliffs, New Jersey, 1971.
[13]
M. E. Ruiz and P. Srinivasan. Hierarchical neural networks for text categorization. In M.A. Hearst, F. Gey, and R. Tong, editors, Proc. ACM-SIGIR International Conference on Research and Development in Information Retrieval, pages 281--282, Berkeley, CA, 1999.
[14]
G. Salton. Automatic Text Processing. Addison Wesley, Massachusetts, 1989.
[15]
V. Shanks and H.E. Williams. Fast categorisation of large document collections. In 8th International Symposium on String Processing and Information Retrieval (SPIRE2001), pages 194--204, San Rafael, Chile, 2001.
[16]
A.S. Weigend, E.D. Wiener, and J.O. Pedersen. Exploiting hierarchy in text categorization.
[17]
W. Wibowo and H.E. Williams. On using hierarchies for document classification. In Proc. Australian Document Computing Conference, pages 31--37, Coffs Harbour, Australia, 1999.
[18]
H.E. Williams and J. Zobel. Searchable words on the web. International Journal of Digital Libraries. To appear.
[19]
Y. Yang. An evaluation of statistical approaches to text categorization. Information Retrieval, 1(1-2):69--90, 1999.

Cited By

View all

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
CIKM '02: Proceedings of the eleventh international conference on Information and knowledge management
November 2002
704 pages
ISBN:1581134924
DOI:10.1145/584792
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 November 2002

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. categorisation
  2. error reduction
  3. hierarchical categorisation

Qualifiers

  • Article

Conference

CIKM02

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1
  • Downloads (Last 6 weeks)0
Reflects downloads up to 28 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2015)Modelling Progressive FilteringFundamenta Informaticae10.5555/2756707.2756708138:3(285-320)Online publication date: 1-Jul-2015
  • (2009)Web page classificationACM Computing Surveys10.1145/1459352.145935741:2(1-31)Online publication date: 23-Feb-2009
  • (2008)Topic taxonomy adaptation for group profilingACM Transactions on Knowledge Discovery from Data10.1145/1324172.13241731:4(1-28)Online publication date: 1-Feb-2008
  • (2006)Knowing a web page by the company it keepsProceedings of the 15th ACM international conference on Information and knowledge management10.1145/1183614.1183650(228-237)Online publication date: 6-Nov-2006
  • (2006)GO for gene documentsProceedings of the 1st international workshop on Text mining in bioinformatics10.1145/1183535.1183546(43-51)Online publication date: 10-Nov-2006
  • (2006)Acclimatizing Taxonomic Semantics for Hierarchical Content ClassificationProceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining10.1145/1150402.1150446(384-393)Online publication date: 20-Aug-2006
  • (2003)Index construction for linear categorisationProceedings of the twelfth international conference on Information and knowledge management10.1145/956863.956926(334-341)Online publication date: 3-Nov-2003

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media