Nothing Special   »   [go: up one dir, main page]

skip to main content
article

Learning to integrate web taxonomies

Published: 01 December 2004 Publication History

Abstract

We investigate machine learning methods for automatically integrating objects from different taxonomies into a master taxonomy. This problem is not only currently pervasive on the Web, but is also important to the emerging Semantic Web. A straightforward approach to automating this process would be to build classifiers through machine learning and then use these classifiers to classify objects from the source taxonomies into categories of the master taxonomy. However, conventional machine learning algorithms totally ignore the availability of the source taxonomies. In fact, source and master taxonomies often have common categories under different names or other more complex semantic overlaps. We introduce two techniques that exploit the semantic overlap between the source and master taxonomies to build better classifiers for the master taxonomy. The first technique, Cluster Shrinkage, biases the learning algorithm against splitting source categories by making objects in the same category appear more similar to each other. The second technique, Co-Bootstrapping, tries to facilitate the exploitation of inter-taxonomy relationships by providing category indicator functions as additional features for the objects. Our experiments with real-world Web data show that these proposed add-on techniques can enhance various machine learning algorithms to achieve substantial improvements in performance for taxonomy integration.

References

[1]
Agrawal, R. and Srikant, R., On integrating catalogs. In: Proceedings of the 10th International World Wide Web Conference (WWW), pp. 603-612.
[2]
T. Berners-Lee, J. Hendler, O. Lassila, The Semantic Web, Scientific American, 2001.
[3]
Lacher, M.S. and Groh, G., Facilitating the exchange of explicit knowledge through ontology mappings. In: Proceedings of the 14th International Florida Artificial Intelligence Research Society Conference (FLAIRS), pp. 305-309.
[4]
Doan, A., Madhavan, J., Domingos, P. and Halevy, A., Learning to map between ontologies on the semantic web. In: Proceedings of the 11th International World Wide Web Conference (WWW), pp. 662-673.
[5]
Mitchell, T., Machine Learning. 1997. International ed. McGraw Hill, New York.
[6]
Vapnik, V.N., Statistical Learning Theory. 1998. Wiley, New York, NY.
[7]
Bennett, K., Combining support vector and mathematical programming methods for classification. In: Scholkopf, B., Burges, C., Smola, A. (Eds.), Advances in Kernel Methods-Support Vector Learning, MIT-Press.
[8]
Joachims, T., Transductive inference for text classification using support vector machines. In: Proceedings of the 16th International Conference on Machine Learning (ICML), pp. 200-209.
[9]
Joachims, T., Transductive learning via spectral graph partitioning. In: Proceedings of the 20th International Conference on Machine Learning (ICML), pp. 290-297.
[10]
Freund, Y. and Schapire, R.E., A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. v55. 119-139.
[11]
Schapire, R.E. and Singer, Y., BoosTexter: a boosting-based system for text categorization. Machine Learning. v39. 135-168.
[12]
Schapire, R.E. and Singer, Y., Improved boosting algorithms using confidence-rated predictions. Machine Learning. v37. 297-336.
[13]
Cristianini, N. and Shawe-Taylor, J., An Introduction to Support Vector Machines. 2000. Cambridge University Press, Cambridge, UK.
[14]
Joachims, T., Text categorization with support vector machines: learning with many relevant features. In: Proceedings of the 10th European Conference on Machine Learning (ECML), pp. 137-142.
[15]
McCallum, A. and Nigam, K., A comparison of event models for naive bayes text classification. In: AAAI-98 Workshop on Learning for Text Categorization, pp. 41-48.
[16]
Agrawal, R., Bayardo, R. and Srikant, R., Athena: mining-based interactive management of text databases. In: Proceedings of the 7th International Conference on Extending Database Technology (EDBT), pp. 365-379.
[17]
Baeza-Yates, R. and Ribeiro-Neto, B., Modern Information Retrieval. 1999. Addison-Wesley, New York, NY.
[18]
Vapnik, V.N., The Nature of Statistical Learning Theory. 2000. second ed. Springer-Verlag, New York, NY.
[19]
Schapire, R.E., The boosting approach to machine learning: an overview. In: MSRI Workshop on Nonlinear Estimation and Classification,
[20]
Meir, R. and Ratsch, G., An introduction to boosting and leveraging. In: Mendelson, S., Smola, A.J. (Eds.), Advanced Lectures on Machine Learning, LNCS, Springer-Verlag. pp. 119-184.
[21]
Sarawagi, S., Chakrabarti, S. and Godbole, S., Cross-training: learning probabilistic mappings between topics. In: Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pp. 177-186.
[22]
Cai, L. and Hofmann, T., Text categorization by boosting automatically extracted concepts. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pp. 182-189.
[23]
Yang, Y. and Liu, X., A re-examination of text categorization methods. In: Proceedings of the 22nd ACM International Conference on Research and Development in Information Retrieval (SIGIR), pp. 42-49.
[24]
Mladenic, D., Turning Yahoo to automatic web-page classifier. In: Proceedings of the 13th European Conference on Artificial Intelligence (ECAI), pp. 473-474.
[25]
Fensel, D., Ontologies: A Silver Bullet for Knowledge Management and Electronic Commerce. 2001. Springer-Verlag.
[26]
Ichise, R., Takeda, H. and Honiden, S., Rule induction for concept hierarchy alignment. In: Proceedings of the Workshop on Ontologies and Information Sharing at the 17th International Joint Conference on Artificial Intelligence (IJCAI), pp. 26-29.
[27]
Rijsbergen, C.J.v., Information Retrieval. 1979. second ed. Butterworths, London, UK.
[28]
Rocchio, J.J., Relevance feedback in information retrieval. In: Salton, G. (Ed.), The SMART Retrieval System: Experiments in Automatic Document Processing, Prentice-Hall. pp. 313-323.
[29]
Nigam, K., McCallum, A., Thrun, S. and Mitchell, T., Text classification from labeled and unlabeled documents using EM. Machine Learning. v39. 103-134.
[30]
Blum, A. and Mitchell, T., Combining labeled and unlabeled data with co-training. In: Proceedings of the 11th Annual Conference on Computational Learning Theory (COLT), pp. 92-100.
[31]
Collins, M. and Singer, Y., Unsupervised models for named entity classification. In: Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (EMNLP), pp. 189-196.
[32]
Nigam, K. and Ghani, R., Analyzing the effectiveness and applicability of co-training. In: Proceedings of the 9th International Conference on Information and Knowledge Management (CIKM), pp. 86-93.
[33]
Brefeld, U. and Scheffer, T., Co-EM support vector learning. In: Proceedings of the 21st International Conference on Machine Learning (ICML),
[34]
Chakrabarti, S., Dom, B., Agrawal, R. and Raghavan, P., Using taxonomy, discriminants, and signatures for navigating in text databases. In: Proceedings of the 23rd International Conference on Very Large Data Bases (VLDB), pp. 446-455.
[35]
Dumais, S. and Chen, H., Hierarchical classification of web content. In: Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pp. 256-263.
[36]
McCallum, A., Rosenfeld, R., Mitchell, T. and Ng, A.Y., Improving text classification by shrinkage in a hierarchy of classes. In: Proceedings of the 15th International Conference on Machine Learning (ICML), pp. 359-367.

Cited By

View all
  • (2012)PROBABILISTIC HEURISTICS FOR HIERARCHICAL WEB DATA CLUSTERINGComputational Intelligence10.1111/j.1467-8640.2012.00414.x28:2(209-233)Online publication date: 1-May-2012
  • (2011)Automatic maintenance of web directories by mining web browsing dataJournal of Web Engineering10.5555/2011114.201111710:2(153-173)Online publication date: 1-Jun-2011
  • (2006)An experimental comparative study of web mining methods for recommender systemsProceedings of the 6th WSEAS International Conference on Distance Learning and Web Engineering10.5555/1369827.1369838(56-61)Online publication date: 22-Sep-2006
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Web Semantics: Science, Services and Agents on the World Wide Web
Web Semantics: Science, Services and Agents on the World Wide Web  Volume 2, Issue 2
December, 2004
76 pages

Publisher

Elsevier Science Publishers B. V.

Netherlands

Publication History

Published: 01 December 2004

Author Tags

  1. Classification
  2. Machine learning
  3. Ontology mapping
  4. Semantic Web
  5. Taxonomy integration

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 29 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2012)PROBABILISTIC HEURISTICS FOR HIERARCHICAL WEB DATA CLUSTERINGComputational Intelligence10.1111/j.1467-8640.2012.00414.x28:2(209-233)Online publication date: 1-May-2012
  • (2011)Automatic maintenance of web directories by mining web browsing dataJournal of Web Engineering10.5555/2011114.201111710:2(153-173)Online publication date: 1-Jun-2011
  • (2006)An experimental comparative study of web mining methods for recommender systemsProceedings of the 6th WSEAS International Conference on Distance Learning and Web Engineering10.5555/1369827.1369838(56-61)Online publication date: 22-Sep-2006
  • (2005)Classifying search engine queries using the web as background knowledgeACM SIGKDD Explorations Newsletter10.1145/1117454.11174697:2(117-122)Online publication date: 1-Dec-2005

View Options

View options

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media