Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2023607.2023637acmotherconferencesArticle/Chapter ViewAbstractPublication PagescompsystechConference Proceedingsconference-collections
research-article

Enhancing automatic term recognition algorithms with HTML tags processing

Published: 16 June 2011 Publication History

Abstract

We focus on mining relevant information from web pages. Unlike plain text documents, web pages contain another source of potentially relevant information - easily processable mark-up. We propose an approach to keyword extraction that enhances Automatic Term Recognition (ATR) algorithms intended for processing plain text documents with an analysis of HTML tags present in the document. We distinguish tags that have a semantic potential. We present results of an experiment we conducted on a set of Wikipedia pages. It shows that enhancement yields better results than using ATR algorithms alone.

References

[1]
Ahmad, K., Gillam, L., Tostevin, L. University of Surrey participation in TREC 8: Weirdness indexing for logical document extrapolation and retrieval (WILDER). In Text Retrieval Conference, TREC 1999, (1999).
[2]
Barla, M., Bieliková, M. Ordinary Web Pages as a Source for Metadata Acquisition for Open Corpus User Modeling. In White, B., Isaías, P., Andone, D., (Eds.): WWW/Internet 2010, IADIS Press, pp. 227--233 (2010).
[3]
Barla, M. Towards Social-based User Modeling and Personalization. In Information Sciences and Technologies Bulletin of the ACM Slovakia, Vol. 3, No. 1, pp. 52--60 (2011).
[4]
Church, K. W., Hanks, P. Word association norms, mutual information, and lexicography. In Computational Linguistics, MIT Press, 16(1), pp. 22--29 (1991).
[5]
Dicheva, D., Dichev, C. Helping Courseware Authors to Build Ontologies: the Case of TM4L. In Proc. of the Conf. on Artificial Intelligence in Education: Building Technology Rich Learning Contexts That Work, IOS press, pp. 77--84 (2007).
[6]
Frantzi, K. T., Ananiadou, S., Mima, H. Automatic recognition of multi-word terms: the C-value/NC-value method. In Int. Journal on Digital Libraries, 3(2), Springer, pp. 115--130 (2000).
[7]
Hodgson, J. Do HTML Tags Flag Semantic Content? IEEE Internet Computing, 5(1), pp. 20--25 (2001).
[8]
Knoth, P., Schmidt, M., Smrž, P., Zdráhal Z. Towards a Framework for Comparing Automatic Term Recognition Methods, In: Znalosti 2009, pp. 83--94 (2009).
[9]
Kozakov, L., Park, Y., Fin, T., Drissi, Y., Doganata, Y., and Cofino, T. Glossary extraction and utilization in the information search and delivery system for IBM technical support for IBM System. IBM Systems Journal, IBM Corp., 43(3), pp. 546--563 (2004).
[10]
Manning, C. D., Schütze, H.: Foundations of Statistical Natural Language Processing. MIT Press (1999).
[11]
Mukherjee, S., Yang, G., Ramakrishnan, I. V. Automatic Annotation of Content- Rich HTML Documents: Structural and Semantic Analysis. In: The SemanticWeb -- ISWC 2003. (2003), pp. 533--549.
[12]
SEOmoz.: Searching engine ranking factors {online; accessed 2010-03-31}, (2009). Available at: http://www.seomoz.org/article/search-rankingfactors#overview.
[13]
Sclano, F., Velardi, P. TermExtractor: a Web Application to Learn the Shared Terminology of Emergent Web Communities. In Enterprise Interoperability II, pp. 287--290 (2007).
[14]
Zhang, Z., Iria, J., Brewster, Ch., Ciravegna, F. A Comparative Evaluation of Term Recognition Algorithms. In Proc. of the 6th Int. Conf. on Language Resources and Evaluation, LREC08, (2008).

Cited By

View all
  • (2012)Named entity disambiguation based on explicit semanticsProceedings of the 38th international conference on Current Trends in Theory and Practice of Computer Science10.1007/978-3-642-27660-6_37(456-466)Online publication date: 21-Jan-2012

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
CompSysTech '11: Proceedings of the 12th International Conference on Computer Systems and Technologies
June 2011
688 pages
ISBN:9781450309172
DOI:10.1145/2023607
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

  • TELECVB: TELECOMS - Varna, Bulgaria
  • Austrian Comp Soc: Austrian Computer Society
  • BPCSB: BULGARIAN PUBLISHING COMPANY - Sofia, Bulgaria
  • IOMAIBB: INSTITUTE OF MATHEMATICS AND INFORMATICS - BAS, Bulgaria
  • NBUBB: New Bulgarian University - BAS, Bulgaria
  • Technical University of Sofia
  • IOIACTBB: INSTITUTE OF INFORMATION AND COMMUNICATION TECHNOLOGIES - BAS, Bulgaria
  • TSFPS: THE SEVENTH FRAMEWORK PROGRAMME - SISTER
  • ERSVB: EURORISC SYSTEMS - Varna, Bulgaria
  • FOSEUB: FEDERATION OF THE SCIENTIFIC ENGINEERING UNIONS - Bulgaria
  • UORB: University of Ruse, Bulgaria
  • BBPSB: BULGARIAN BUSINESS PUBLICATIONS - Sofia, Bulgaria
  • CASTUVTB: CYRIL AND ST. METHODIUS UNIVERSITY of Veliko Tarnovo, Bulgaria
  • TECHUVB: Technical University of Varna, Bulgaria
  • LLLPET: LIFELONG LEARNING PROGRAMME - ETN TRICE
  • IEEEBSB: IEEE Bulgaria Section, Bulgaria

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 June 2011

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. ATR
  2. HTML tag
  3. keyword extraction
  4. lightweight semantics
  5. term

Qualifiers

  • Research-article

Conference

CompSysTech '11
Sponsor:
  • TELECVB
  • Austrian Comp Soc
  • BPCSB
  • IOMAIBB
  • NBUBB
  • IOIACTBB
  • TSFPS
  • ERSVB
  • FOSEUB
  • UORB
  • BBPSB
  • CASTUVTB
  • TECHUVB
  • LLLPET
  • IEEEBSB

Acceptance Rates

Overall Acceptance Rate 241 of 492 submissions, 49%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 19 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2012)Named entity disambiguation based on explicit semanticsProceedings of the 38th international conference on Current Trends in Theory and Practice of Computer Science10.1007/978-3-642-27660-6_37(456-466)Online publication date: 21-Jan-2012

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media