Summary
This paper investigates how citation-based information and structural content (e.g., title, abstract) can be combined to improve classification of text documents into predefined categories. We evaluate different measures of similarity, five derived from the citation structure of the collection, and three measures derived from the structural content, and determine how they can be fused to improve classification effectiveness. To discover the best fusion framework, we apply Genetic Programming (GP) techniques. Our empirical experiments using documents from the ACM digital library and the ACM classification scheme show that we can discover similarity functions that work better than any evidence in isolation and whose combined performance through a simple majority voting is comparable to that of Support Vector Machine classifiers.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Robert Amsler. Application of citation-based automatic classification. Technical report, The University of Texas at Austin, Linguistics Research Center, Austin, TX, December 1972.
Pável Calado, Marco Cristo, Edleno Silva de Moura, Nivio Ziviani, Berthier A. Ribeiro-Neto, and Marcos André Gonçalves. Combining link-based and content-based methods for Web document classification. In Proceedings of CIKM-03, 12th ACM International Conference on Information and Knowledge Management, pages 394–401, New Orleans, US, 2003. ACM Press, New York, US.
Soumen Chakrabarti, Byron Dom, and Piotr Indyk. Enhanced hypertext categorization using hyperlinks. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 307–318, Seattle, Washington, June 1998.
Sin Man Cheang, Kin Hong Lee, and Kwong Sak Leung. Data classification using genetic parallel programming. In E. Cantú-Paz, J. A. Foster, K. Deb, D. Davis, R. Roy, U.-M. O’Reilly, H.-G. Beyer, R. Standish, G. Kendall, S. Wilson, M. Harman, J. Wegener, D. Dasgupta, M. A. Potter, A. C. Schultz, K. Dowsland, N. Jonoska, and J. Miller, editors, Genetic and Evolutionary Computation — GECCO-2003, volume 2724 of LNCS, pages 1918–1919, Chicago, 12–16 July 2003. Springer-Verlag.
Chris Clack, Johnny Farringdon, Peter Lidwell, and Tina Yu. Autonomous document classification for business. In AGENTS’ 97: Proceedings of the first international conference on Autonomous agents, pages 201–208. ACM Press, 1997.
David Cohn and Thomas Hofmann. The missing link-a probabilistic model of document content and hypertext connectivity. In Todd K. Leen, Thomas G. Dietterich, and Volker Tresp, editors, Advances in Neural Information Processing Systems 13, pages 430–436. MIT Press, 2001.
I. De Falco, A. Della Cioppa, and E. Tarantino. Discovering interesting classification rules with genetic programming. Applied Soft Computing, 1(4F):257–269, May 2001.
Jeffrey Dean and Monika Rauch Henzinger. Finding related pages in the World Wide Web. Computer Networks, 31(11–16):1467–1479, May 1999. Also in Proceedings of the 8th International World Wide Web Conference.
M. Dolores del Castillo and José Ignacio Serrano. A multistrategy approach for digital text categorization from imbalanced documents. SIGKDD Explor. Newsl., 6(1):70–79, 2004.
J. Eggermont, J. N. Kok, and W. A. Kosters. Genetic programming for data classification: Refining the search space. In T. Heskes, P. Lucas, L. Vuurpijl, and W. Wiegerinck, editors, Proceedings of the Fivteenth Belgium/Netherlands Conference on Artificial Intelligence (BNAIC’03), pages 123–130, Nijmegen, The Netherlands, 23–24 October 2003.
Weiguo Fan, Edward A. Fox, Praveen Pathak, and Harris Wu. The effects of fitness functions on genetic programming-based ranking discovery for web search. Journal of the American Society for Information Science and Technology, 55(7):628–636, 2004.
Weiguo Fan, Michael D. Gordon, and Praveen Pathak. Personalization of search engine services for effective retrieval and knowledge management. In The Proceedings of the International Conference on Information Systems 2000, pages 20–34, 2000.
Weiguo Fan, Michael D. Gordon, and Praveen Pathak. Discovery of context-specific ranking functions for effective information retrieval using genetic programming. IEEE Transactions on Knowledge and Data Engineering, 16(4):523–527, 2004.
Weiguo Fan, Michael D. Gordon, and Praveen Pathak. A generic ranking function discovery framework by genetic programming for information retrieval. Information Processing and Management, 40(4):587–602, 2004.
Weiguo Fan, Michael D. Gordon, Praveen Pathak, Wensi Xi, and Edward A. Fox. Ranking function optimization for effective web search by genetic programming: An empirical study. In Proceedings of 37th Hawaii International Conference on System Sciences, Hawaii, 2004. IEEE.
Weiguo Fan, Ming Luo, Li Wang, Wensi Xi, and Edward A. Fox. Tuning before feedback: combining ranking function discovery and blind feedback for robust retrieval. In Proceedings of the 27th Annual International ACM SIGIR Conference, U.K., 2004. ACM.
Michelle Fisher and Richard Everson. When are links useful? Experiments in text classification. In F. Sebastianini, editor, Proceedings of the 25th annual European conference on Information Retrieval Research, ECIR 2003, pages 41–56. Springer-Verlag, Berlin, Heidelberg, DE, 2003.
Johannes Furnkranz. Exploiting structural information for text classification on the WWW. In Intelligent Data Analysis, pages 487–498, 1999.
Lee Giles. Citeseer: An automatic citation indexing system. December 16 1998.
Eric J. Glover, Kostas Tsioutsiouliklis, Steve Lawrence, David M. Pennock, and Gary W. Flake. Using Web structure for classifying and describing Web pages. In Proceedings of WWW-02, International Conference on the World Wide Web, 2002.
M. D. Gordon. User-based document clustering by redescribing subject descriptions with a genetic algorithm. Journal of the American Society for Information Science, 42(5):311–322, June 1991.
Michael Gordon. Probabilistic and genetic algorithms for document retrieval. Communications of the ACM, 31(10):1208–1218, October 1988.
Norbert Gövert, Mounia Lalmas, and Norbert Fuhr. A probabilistic description-oriented approach for categorizing web documents. In Proceedings of the 8th International Conference on Information and Knowledge Management CIKM 99, pages 475–482, Kansas City, Missouri, USA, November 1999.
Thorsten Joachims. Text categorization with support vector machines: learning with many relevant features. In Proceedings of ECML-98, 10th European Conference on Machine Learning, pages 137–142, Chemnitz, Germany, April 1998.
Thorsten Joachims, Nello Cristianini, and John Shawe-Taylor. Composite kernels for hypertext categorisation. In Carla Brodley and Andrea Danyluk, editors, Proceedings of ICML-01, 18th International Conference on Machine Learning, pages 250–257, Williams College, US, 2001. Morgan Kaufmann Publishers, San Francisco, US.
M. M. Kessler. Bibliographic coupling between scientific papers. American Documentation, 14(1):10–25, January 1963.
J. K. Kishore, L. M. Patnaik, V. Mani, and V. K. Agrawal. Genetic programming based pattern classification with feature space partitioning. Information Sciences, 131(1–4):65–86, January 2001.
J. K. Kishore, Lalit M. Patnaik, V. Mani, and V. K. Agrawal. Application of genetic programming for multicategory pattern classification. IEEE Trans. Evolutionary Computation, 4(3):242–258, 2000.
Jon M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM (JACM), 46(5):604–632, 1999.
John R. Koza. Genetic programming: On the programming of computers by natural selection. MIT Press, Cambridge, Mass., 1992.
S. Lawrence, C. L. Giles, and K. Bollacker. “Digital Libraries and Autonomous Citation Indexing”. IEEE Computer, 32(6):67–71, 1999.
Steve Lawrence, C. Lee Giles, and Kurt D. Bollacker. Autonomous citation matching. In Oren Etzioni, Jörg P. Müller, and Jeffrey M. Bradshaw, editors, Proceedings of the Third International Conference on Autonomous Agents (Agents’99), pages 392–393, Seattle, WA, USA, 1999. ACM Press.
M. J. Martin-Bautista, M. Vila, and H. L. Larsen. A fuzzy genetic algorithm approach to an adaptive information retrieval agent. American Society for Information Science, 50:760–771, 1999.
Andrew Kachites McCallum and Kamal Nigam. Employing EM and pool-based active learning for text classification. In Proc. 15th International Conf. on Machine Learning, pages 350–358. Morgan Kaufmann, San Francisco, CA, 1998.
Frederic C. Misch, editor. Webster’s Ninth New Collegiate Dictionary. Merriam-Webster Inc., Springfield, Massachusetts, 1988.
Hyo-Jung Oh, Sung Hyon Myaeng, and Mann-Ho Lee. A practical hypertext catergorization method using links and incrementally available class information. In Proceedings of the 23rd annual international ACM SIGIR conference on research and development in information retrieval, pages 264–271. ACM Press, 2000.
P. Pathak, M. Gordon, and W. Fan. Effective information retrieval using genetic algorithms based matching function adaptation. In Proceedings of the 33rd Hawaii International Conference on System Science (HICSS), Hawaii, USA, 2000.
Vijay V. Raghavan and Brijesh Agarwal. Optimal determination of user-oriented clusters: an application for the reproductive plan. In John J. Grefenstette, editor, Proceedings of the 2nd International Conference on Genetic Algorithms and their Applications, pages 241–246, Cambridge, MA, July 1987. Lawrence Erlbaum Associates.
S. E. Robertson, S. Walker, and M. M. Beaulieu. Okapi at TREC-4. In NIST Special Publication 500-236: The Fourth Text REtrieval Conference (TREC-4), pages 73–96, 1995.
Maytal Saar-Tsechansky and Foster Provost. Active learning for class probability estimation and ranking. In Bernhard Nebel, editor, Proceedings of the Seventeenth International Conference on Artificial Intelligence (IJCAI-01), pages 911–920, San Francisco, CA, August 4–10 2001. Morgan Kaufmann Publishers, Inc.
Gerard Salton. Automatic Text Processing. Addison-Wesley, Boston, Massachusetts, USA, 1989.
Gerard Salton and Chris Buckley. Term-weighting approaches in automatic text retrieval. Information Processing and Management, 24(5):513–523, 1988.
Henry G. Small. Co-citation in the scientific literature: A new measure of relationship between two documents. Journal of the American Society for Information Science, 24(4):265–269, July 1973.
A. Srinivasan. A study of two sampling methods for analysing large datasets with ILP. Data Mining and Knowledge Discovery, 3(1):95–123, 1999.
Aixin Sun, Ee-Peng Lim, and Wee-Keong Ng. Web classification using support vector machine. In Proceedings of the fourth international workshop on Web information and data management, pages 96–99. ACM Press, 2002.
Yiming Yang. Expert network: effective and efficient learning from human decisions in text categorisation and retrieval. In W. Bruce Croft and Cornelis J. van Rijsbergen, editors, Proceedings of SIGIR-94, 17th ACM International Conference on Research and Development in Information Retrieval, pages 13–22, Dublin, IE, 1994. Springer Verlag, Heidelberg, DE.
Yiming Yang, Seán Slattery, and Rayid Ghani. A study of approaches to hypertext categorization. Journal of Intelligent Information Systems, 18(2–3):219–241, 2002.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Zhang, B. et al. (2006). A Genetic Programming Approach for Combining Structural and Citation-Based Evidence for Text Classification in Web Digital Libraries. In: Herrera-Viedma, E., Pasi, G., Crestani, F. (eds) Soft Computing in Web Information Retrieval. Studies in Fuzziness and Soft Computing, vol 197. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-31590-X_4
Download citation
DOI: https://doi.org/10.1007/3-540-31590-X_4
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-31588-9
Online ISBN: 978-3-540-31590-2
eBook Packages: EngineeringEngineering (R0)