article

Learning to integrate web taxonomies

Authors:

Wee Sun LeeAuthors Info & Claims

Web Semantics: Science, Services and Agents on the World Wide Web, Volume 2, Issue 2

Pages 131 - 151

https://doi.org/10.1016/j.websem.2004.10.001

Published: 01 December 2004 Publication History

Abstract

We investigate machine learning methods for automatically integrating objects from different taxonomies into a master taxonomy. This problem is not only currently pervasive on the Web, but is also important to the emerging Semantic Web. A straightforward approach to automating this process would be to build classifiers through machine learning and then use these classifiers to classify objects from the source taxonomies into categories of the master taxonomy. However, conventional machine learning algorithms totally ignore the availability of the source taxonomies. In fact, source and master taxonomies often have common categories under different names or other more complex semantic overlaps. We introduce two techniques that exploit the semantic overlap between the source and master taxonomies to build better classifiers for the master taxonomy. The first technique, Cluster Shrinkage, biases the learning algorithm against splitting source categories by making objects in the same category appear more similar to each other. The second technique, Co-Bootstrapping, tries to facilitate the exploitation of inter-taxonomy relationships by providing category indicator functions as additional features for the objects. Our experiments with real-world Web data show that these proposed add-on techniques can enhance various machine learning algorithms to achieve substantial improvements in performance for taxonomy integration.

References

[1]

Agrawal, R. and Srikant, R., On integrating catalogs. In: Proceedings of the 10th International World Wide Web Conference (WWW), pp. 603-612.

[2]

T. Berners-Lee, J. Hendler, O. Lassila, The Semantic Web, Scientific American, 2001.

[3]

Lacher, M.S. and Groh, G., Facilitating the exchange of explicit knowledge through ontology mappings. In: Proceedings of the 14th International Florida Artificial Intelligence Research Society Conference (FLAIRS), pp. 305-309.

[4]

Doan, A., Madhavan, J., Domingos, P. and Halevy, A., Learning to map between ontologies on the semantic web. In: Proceedings of the 11th International World Wide Web Conference (WWW), pp. 662-673.

[5]

Mitchell, T., Machine Learning. 1997. International ed. McGraw Hill, New York.

[6]

Vapnik, V.N., Statistical Learning Theory. 1998. Wiley, New York, NY.

[7]

Bennett, K., Combining support vector and mathematical programming methods for classification. In: Scholkopf, B., Burges, C., Smola, A. (Eds.), Advances in Kernel Methods-Support Vector Learning, MIT-Press.

[8]

Joachims, T., Transductive inference for text classification using support vector machines. In: Proceedings of the 16th International Conference on Machine Learning (ICML), pp. 200-209.

[9]

Joachims, T., Transductive learning via spectral graph partitioning. In: Proceedings of the 20th International Conference on Machine Learning (ICML), pp. 290-297.

[10]

Freund, Y. and Schapire, R.E., A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. v55. 119-139.

Digital Library

[11]

Schapire, R.E. and Singer, Y., BoosTexter: a boosting-based system for text categorization. Machine Learning. v39. 135-168.

[12]

Schapire, R.E. and Singer, Y., Improved boosting algorithms using confidence-rated predictions. Machine Learning. v37. 297-336.

[13]

Cristianini, N. and Shawe-Taylor, J., An Introduction to Support Vector Machines. 2000. Cambridge University Press, Cambridge, UK.

[14]

Joachims, T., Text categorization with support vector machines: learning with many relevant features. In: Proceedings of the 10th European Conference on Machine Learning (ECML), pp. 137-142.

[15]

McCallum, A. and Nigam, K., A comparison of event models for naive bayes text classification. In: AAAI-98 Workshop on Learning for Text Categorization, pp. 41-48.

[16]

Agrawal, R., Bayardo, R. and Srikant, R., Athena: mining-based interactive management of text databases. In: Proceedings of the 7th International Conference on Extending Database Technology (EDBT), pp. 365-379.

[17]

Baeza-Yates, R. and Ribeiro-Neto, B., Modern Information Retrieval. 1999. Addison-Wesley, New York, NY.

[18]

Vapnik, V.N., The Nature of Statistical Learning Theory. 2000. second ed. Springer-Verlag, New York, NY.

[19]

Schapire, R.E., The boosting approach to machine learning: an overview. In: MSRI Workshop on Nonlinear Estimation and Classification,

[20]

Meir, R. and Ratsch, G., An introduction to boosting and leveraging. In: Mendelson, S., Smola, A.J. (Eds.), Advanced Lectures on Machine Learning, LNCS, Springer-Verlag. pp. 119-184.

[21]

Sarawagi, S., Chakrabarti, S. and Godbole, S., Cross-training: learning probabilistic mappings between topics. In: Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pp. 177-186.

[22]

Cai, L. and Hofmann, T., Text categorization by boosting automatically extracted concepts. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pp. 182-189.

[23]

Yang, Y. and Liu, X., A re-examination of text categorization methods. In: Proceedings of the 22nd ACM International Conference on Research and Development in Information Retrieval (SIGIR), pp. 42-49.

[24]

Mladenic, D., Turning Yahoo to automatic web-page classifier. In: Proceedings of the 13th European Conference on Artificial Intelligence (ECAI), pp. 473-474.

[25]

Fensel, D., Ontologies: A Silver Bullet for Knowledge Management and Electronic Commerce. 2001. Springer-Verlag.

[26]

Ichise, R., Takeda, H. and Honiden, S., Rule induction for concept hierarchy alignment. In: Proceedings of the Workshop on Ontologies and Information Sharing at the 17th International Joint Conference on Artificial Intelligence (IJCAI), pp. 26-29.

[27]

Rijsbergen, C.J.v., Information Retrieval. 1979. second ed. Butterworths, London, UK.

[28]

Rocchio, J.J., Relevance feedback in information retrieval. In: Salton, G. (Ed.), The SMART Retrieval System: Experiments in Automatic Document Processing, Prentice-Hall. pp. 313-323.

[29]

Nigam, K., McCallum, A., Thrun, S. and Mitchell, T., Text classification from labeled and unlabeled documents using EM. Machine Learning. v39. 103-134.

[30]

Blum, A. and Mitchell, T., Combining labeled and unlabeled data with co-training. In: Proceedings of the 11th Annual Conference on Computational Learning Theory (COLT), pp. 92-100.

[31]

Collins, M. and Singer, Y., Unsupervised models for named entity classification. In: Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (EMNLP), pp. 189-196.

[32]

Nigam, K. and Ghani, R., Analyzing the effectiveness and applicability of co-training. In: Proceedings of the 9th International Conference on Information and Knowledge Management (CIKM), pp. 86-93.

[33]

Brefeld, U. and Scheffer, T., Co-EM support vector learning. In: Proceedings of the 21st International Conference on Machine Learning (ICML),

[34]

Chakrabarti, S., Dom, B., Agrawal, R. and Raghavan, P., Using taxonomy, discriminants, and signatures for navigating in text databases. In: Proceedings of the 23rd International Conference on Very Large Data Bases (VLDB), pp. 446-455.

[35]

Dumais, S. and Chen, H., Hierarchical classification of web content. In: Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pp. 256-263.

[36]

McCallum, A., Rosenfeld, R., Mitchell, T. and Ng, A.Y., Improving text classification by shrinkage in a hierarchy of classes. In: Proceedings of the 15th International Conference on Machine Learning (ICML), pp. 359-367.

Cited By

Haghir Chehreghani MHaghir Chehreghani MAbolhassani H(2012)PROBABILISTIC HEURISTICS FOR HIERARCHICAL WEB DATA CLUSTERINGComputational Intelligence10.1111/j.1467-8640.2012.00414.x28:2(209-233)Online publication date: 1-May-2012
https://dl.acm.org/doi/10.1111/j.1467-8640.2012.00414.x
Hurtado CMendoza M(2011)Automatic maintenance of web directories by mining web browsing dataJournal of Web Engineering10.5555/2011114.201111710:2(153-173)Online publication date: 1-Jun-2011
https://dl.acm.org/doi/10.5555/2011114.2011117
Segrera SMoreno M(2006)An experimental comparative study of web mining methods for recommender systemsProceedings of the 6th WSEAS International Conference on Distance Learning and Web Engineering10.5555/1369827.1369838(56-61)Online publication date: 22-Sep-2006
https://dl.acm.org/doi/10.5555/1369827.1369838
Show More Cited By

Learning to integrate web taxonomies
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Supervised learning
    2. Machine learning approaches

Recommendations

Web taxonomy integration through co-bootstrapping
SIGIR '04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval

We address the problem of integrating objects from a source taxonomy into a master taxonomy. This problem is not only currently pervasive on the web, but also important to the emerging semantic web. A straightforward approach to automating this process ...
Web taxonomy integration using support vector machines
WWW '04: Proceedings of the 13th international conference on World Wide Web

We address the problem of integrating objects from a source taxonomy into a master taxonomy. This problem is not only currently pervasive on the web, but also important to the emerging semantic web. A straightforward approach to automating this process ...
Learning to integrate web taxonomies with fine-grained relations: a case study using maximum entropy model
AIRS'05: Proceedings of the Second Asia conference on Asia Information Retrieval Technology

As web taxonomy integration is an emerging issue on the Internet, many research topics, such as personalization, web searches, and electronic markets, would benefit from further development of taxonomy integration techniques. The integration task is to ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Web Semantics: Science, Services and Agents on the World Wide Web

Web Semantics: Science, Services and Agents on the World Wide Web Volume 2, Issue 2

December, 2004

76 pages

ISSN:1570-8268

Issue’s Table of Contents

Copyright © Elsevier B.V. © 2004.

Publisher

Elsevier Science Publishers B. V.

Netherlands

Publication History

Published: 01 December 2004

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 29 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Haghir Chehreghani MHaghir Chehreghani MAbolhassani H(2012)PROBABILISTIC HEURISTICS FOR HIERARCHICAL WEB DATA CLUSTERINGComputational Intelligence10.1111/j.1467-8640.2012.00414.x28:2(209-233)Online publication date: 1-May-2012
https://dl.acm.org/doi/10.1111/j.1467-8640.2012.00414.x
Hurtado CMendoza M(2011)Automatic maintenance of web directories by mining web browsing dataJournal of Web Engineering10.5555/2011114.201111710:2(153-173)Online publication date: 1-Jun-2011
https://dl.acm.org/doi/10.5555/2011114.2011117
Segrera SMoreno M(2006)An experimental comparative study of web mining methods for recommender systemsProceedings of the 6th WSEAS International Conference on Distance Learning and Web Engineering10.5555/1369827.1369838(56-61)Online publication date: 22-Sep-2006
https://dl.acm.org/doi/10.5555/1369827.1369838
Vogel DBickel SHaider PSchimpfky RSiemen PBridges SScheffer T(2005)Classifying search engine queries using the web as background knowledgeACM SIGKDD Explorations Newsletter10.1145/1117454.11174697:2(117-122)Online publication date: 1-Dec-2005
https://dl.acm.org/doi/10.1145/1117454.1117469

View Options

View options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents