Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/956863.956938acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
Article

Combining link-based and content-based methods for web document classification

Published: 03 November 2003 Publication History

Abstract

This paper studies how link information can be used to improve classification results for Web collections. We evaluate four different measures of subject similarity, derived from the Web link structure, and determine how accurate they are in predicting document categories. Using a Bayesian network model, we combine these measures with the results obtained by traditional content-based classifiers. Experiments on a Web directory show that best results are achieved when links from pages outside the directory are considered. Link information alone is able to obtain gains of up to 46 points in F1, when compared to a traditional content-based classifier. The combination with content-based methods can further improve the results, but too much noise may be introduced, since the text of Web pages is a much less reliable source of information. This work provides an important insight on which measures derived from links are more appropriate to compare Web documents and how these measures can be combined with content-based algorithms to improve the effectiveness of Web classification.

References

[1]
R. Amsler. Application of citation-based automatic classification. Technical report, The University of Texas at Austin, Linguistics Research Center, Austin, TX, December 1972.]]
[2]
S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In Proceedings of the 7th International World Wide Web Conference, pages 107--117, April 1998.]]
[3]
P. Calado, B. Ribeiro-Neto, N. Ziviani, E. Moura, and I. Silva. Local versus global link information in the W eb. ACM Transactions On Information Systems, 21(1):42--63, January 2003.]]
[4]
S. Chakrabarti, B. Dom, and P. Indyk. Enhanced hypertext categorization using hyperlinks. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 307--318, Seattle, Washington, June 1998.]]
[5]
D. Cohn and T. Hofmann. The missing link - a probabilistic model of document content and hypertext connectivity. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems 13, pages 430--436. MIT Press, 2001.]]
[6]
J. Dean and M. R. Henzinger. Finding related pages in the World Wide Web. Computer Networks, 31(11--16):1467--1479, May 1999. Also in Proceedings of the 8th International World Wide Web Conference.]]
[7]
M. Fisher and R. Everson. When are links useful? Experiments in text classification. In F. Sebastianini, editor, Proceedings of the 25th annual European conference on Information Retrieval Research, ECIR 2003, pages 41--56. Springer-Verlag, Berlin, Heidelberg, DE, 2003.]]
[8]
J. Furnkranz. Exploiting structural information for text classification on the WWW. In Intelligent Data Analysis, pages 487--498, 1999.]]
[9]
E. J. Glover, K. Tsioutsiouliklis, S. Lawrence, D. M. Pennock, and G. W. Flake. Using Web structure for classifying and describing Web pages. In Proceedings of WWW -02, International Conference on the World Wide Web, 2002.]]
[10]
N. Gövert, M. Lalmas, and N. Fuhr. A probabilistic description-oriented approach for categorizing web documents. In Proceedings of the 8th International Conference on Information and Knowledge Management CIKM 99, pages 475--482, Kansas City, Missouri, USA, November 1999.]]
[11]
D. Hawking and N. Craswell. Overview of TREC-2001 Web track. In The Tenth Text REtrieval Conference (TREC-2001), pages 61--67, Gaithersburg, Maryland, USA, November 2001.]]
[12]
X. He, H. Zha, C. H. Q. Ding, and H. D. Simon. Web document clustering using hyperlink structures. Computational Statistics & Data Analysis, 41(1):19--45, November 2002.]]
[13]
T. Joachims. Text categorization with support vector machines: learning with many relevant features. In Proceedings of ECML -98, 10th European Conference on Machine Learning, pages 137--142, Chemnitz, Germany, April 1998.]]
[14]
T. Joachims, N. Cristianini, and J. Shawe-Taylor. Composite kernels for hypertext categorisation. In C. Brodley and A. Danyluk, editors, Proceedings of ICML -01, 18th International Conference on Machine Learning, pages 250--257, Williams College, US, 2001. Morgan Kaufmann Publishers, San Francisco, US.]]
[15]
M. M. Kessler. Bibliographic coupling between scientific papers. American Documentation, 14(1):10--25, January 1963.]]
[16]
J. M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM (JACM), 46(5):604--632, 1999.]]
[17]
S. R. Kumar, P. Raghavan, S. Rajagopalan, D. Sivakumar, A. Tomkins, and E. Upfal. The Web as a graph. In Proceedings of the 19th Symposium on Principles of Database Systems, pages 1--10, Dallas, Texas, USA, May 2000.]]
[18]
A. McCallum and K. Nigam. A comparison of event models for Naive Bayes text classification. In Proceedings of AAAI/ICML -98, Workshop on Learning for Text Categorization, pages 41--48. AAAI Press, 1998.]]
[19]
T. Mitchell. Machine Learning. McGraw-Hill, March 1997.]]
[20]
H.-J. Oh, S. H. Myaeng, and M.-H. Lee. A practical hypertext catergorization method using links and incrementally available class information. In Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, pages 264--271. ACM Press, 2000.]]
[21]
J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of plausible inference. Morgan Kaufmann Publishers, 2nd edition, 1988.]]
[22]
B. Ribeiro-Neto and R. Muntz. A belief network model for IR. In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 253--260, Zurich, Switzerland, August 1996.]]
[23]
B. Ribeiro-Neto, I. Silva, and R. Muntz. Soft Computing in Information Retrieval: Techniques and Applications, chapter 11---Bayesian Network Models for IR, pages 259--291. Springer Verlag, 1st edition, 2000.]]
[24]
A. Silva, E. Veloso, P. Golgher, B. Ribeiro-Neto, A. Laender, and N. Ziviani. CobWeb - a crawler for the brazilian web. In Proceedings of the String Processing and Information Retrieval Symposium (SPIRE'99), pages 184--191, Cancun, Mexico, September 1999.]]
[25]
S. Slattery and M. Craven. Discovering test set regularities in relational domains. In P. Langley, editor, Proceedings of ICML -00, 17th International Conference on Machine Learning, pages 895--902, Stanford, US, 2000. Morgan Kaufmann Publishers, San Francisco, US.]]
[26]
H. G. Small. Co-citation in the scientific literature: A new measure of relationship between two documents. Journal of the American Society for Information Science, 24(4):265--269, July 1973.]]
[27]
A. Sun, E.-P. Lim, and W.-K. Ng. Web classification using support vector machine. In Proceedings of the fourth international workshop on Web information and data management, pages 96--99. ACM Press, 2002.]]
[28]
L. Terveen, W. Hill, and B. Amento. Constructing, organizing, and visualizing collections of topically related Web resources. ACM Transactions on Computer-Human Interaction, 6(1):67--94, March 1999.]]
[29]
M. Thelwall and D. Wilkinson. Finding similar academic Web sites with links, bibliometric couplings and colinks. Information Processing & Management, 2003. (in press).]]
[30]
H. Turtle and W. B. Croft. Evaluation of an inference network-based retrieval model. ACM Transactions on Information Systems, 9(3):187--222, July 1991.]]
[31]
Y. Yang. Expert network: Effective and efficient learning from human decisions in text categorization and retrieval. In W. B. Croft and e. C. J. van Rijsbergen, editors, Proceedings of the 17rd annual international ACM SIGIR conference on Research and development in information retrieval, pages 13--22. Springer-Verlag, 1994.]]
[32]
Y. Yang, S. Slattery, and R. Ghani. A study of approaches to hypertext categorization. Journal of Intelligent Information Systems, 18(2-3):219--241, 2002.]]

Cited By

View all
  • (2022)An approach for selecting countermeasures against harmful information based on uncertainty managementComputer Science and Information Systems10.2298/CSIS210211057K19:1(415-433)Online publication date: 2022
  • (2022)Identifying Categories of Domain Names by Using Deep Learning Methods2022 International Conference on Machine Learning, Cloud Computing and Intelligent Mining (MLCCIM)10.1109/MLCCIM55934.2022.00092(504-510)Online publication date: Aug-2022
  • (2019)On The Current State of Scholarly Retrieval SystemsEngineering, Technology & Applied Science Research10.48084/etasr.24489:1(3863-3870)Online publication date: 16-Feb-2019
  • Show More Cited By

Index Terms

  1. Combining link-based and content-based methods for web document classification

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      CIKM '03: Proceedings of the twelfth international conference on Information and knowledge management
      November 2003
      592 pages
      ISBN:1581137230
      DOI:10.1145/956863
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 03 November 2003

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Bayesian networks
      2. classification
      3. link analysis
      4. web

      Qualifiers

      • Article

      Conference

      CIKM03

      Acceptance Rates

      Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

      Upcoming Conference

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)17
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 19 Sep 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2022)An approach for selecting countermeasures against harmful information based on uncertainty managementComputer Science and Information Systems10.2298/CSIS210211057K19:1(415-433)Online publication date: 2022
      • (2022)Identifying Categories of Domain Names by Using Deep Learning Methods2022 International Conference on Machine Learning, Cloud Computing and Intelligent Mining (MLCCIM)10.1109/MLCCIM55934.2022.00092(504-510)Online publication date: Aug-2022
      • (2019)On The Current State of Scholarly Retrieval SystemsEngineering, Technology & Applied Science Research10.48084/etasr.24489:1(3863-3870)Online publication date: 16-Feb-2019
      • (2019)Industry Specific Word Embedding and its Application in Log ClassificationProceedings of the 28th ACM International Conference on Information and Knowledge Management10.1145/3357384.3357827(2713-2721)Online publication date: 3-Nov-2019
      • (2019)A computational study of mental health awareness campaigns on social mediaTranslational Behavioral Medicine10.1093/tbm/ibz028Online publication date: 5-Mar-2019
      • (2019)Learning edge weights in file co-occurrence graphs for malware detectionData Mining and Knowledge Discovery10.1007/s10618-018-0593-733:1(168-203)Online publication date: 1-Jan-2019
      • (2018)Information and data management at PUC-rio and UFMGProceedings of the VLDB Endowment10.14778/3229863.324049011:12(2114-2129)Online publication date: 1-Aug-2018
      • (2018)Domain2Vec: Vector representation of mobile server's domain based on mobile user visiting sequences2018 IEEE 3rd International Conference on Big Data Analysis (ICBDA)10.1109/ICBDA.2018.8367683(232-236)Online publication date: Mar-2018
      • (2017)Categorisation of web pages for protection against inappropriate content in the internetInternational Journal of Internet Protocol Technology10.1504/IJIPT.2017.08303810:1(61-71)Online publication date: 1-Jan-2017
      • (2017)An Efficient Multiclass Classifier Using On-Page Positive Personality Features for Web Page Classification for the Next Generation Wireless Communication NetworksWireless Personal Communications: An International Journal10.1007/s11277-016-3173-493:2(503-522)Online publication date: 1-Mar-2017
      • Show More Cited By

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media