Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/1008992.1009040acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
Article

Web-a-where: geotagging web content

Published: 25 July 2004 Publication History

Abstract

We describe Web-a-Where, a system for associating geography with Web pages. Web-a-Where locates mentions of places and determines the place each name refers to. In addition, it assigns to each page a geographic focus --- a locality that the page discusses as a whole. The tagging process is simple and fast, aimed to be applied to large collections of Web pages and to facilitate a variety of location-based applications and data analyses.Geotagging involves arbitrating two types of ambiguities: geo/non-geo and geo/geo. A geo/non-geo ambiguity occurs when a place name also has a non-geographic meaning, such as a person name (e.g., Berlin) or a common word (Turkey). Geo/geo ambiguity arises when distinct places have the same name, as in London, England vs. London, Ontario.An implementation of the tagger within the framework of the WebFountain data mining system is described, and evaluated on several corpora of real Web pages. Precision of up to 82% on individual geotags is achieved. We also evaluate the relative contribution of various heuristics the tagger employs, and evaluate the focus-finding algorithm using a corpus pretagged with localities, showing that as many as 91% of the foci reported are correct up to the country level.

References

[1]
Google Search by Location http://labs.google.com/location.]]
[2]
ISO 3166 code lists. http://www.iso.ch/iso/en/prods-services/iso3166ma/02iso-3166-code-lists/index.html.]]
[3]
MεταCARTA, Inc. 875 Massachusetts Avenue, Cambridge, MA 02139. http://www.metacarta.com.]]
[4]
ODP: Regional. http://dmoz.org/regional.]]
[5]
Text REtrieval Conference 2003: .gov test collection. http://es.cmis.csiro.au/trecweb/access_to_data.html.]]
[6]
United Nations department of economic and social affairs. http://unstats.un.org/unsd.]]
[7]
USGS Geographic Names Information System (GNIS). http://geonames.usgs.gov.]]
[8]
WebFountain framework for data mining. http://www.almaden.ibm.com/webfountain.]]
[9]
World Gazetteer. http://www.world-gazetteer.com.]]
[10]
The 6th message understanding conference task definition, March 1995. http://www.cs.nyu.edu/cs/faculty/grishman/COtask21.book_1.html.]]
[11]
Language-independent named entity recognition: shared task, 2002. http://cnts.uia.ac.be/conll2002/ner.]]
[12]
F. Bilhaut, T. Charnois, P. Enjalbert, and Y. Mathet. Geographic reference analysis for geographic document querying. In Workshop on the Analysis of Geographic References, Edmonton, Alberta, Canada, May 2003. NAACL-HLT.]]
[13]
J. D. Burger, J. C. Henderson, and W. T. Morgan. Statistical named entity recognizer adaptation. In Proceedings of CoNLL-2002, pages 163--166, 2002.]]
[14]
S. Cucerzan and D. Yarowsky. Language independent NER using a unified model of internal and contextual evidence. In Proceedings of CoNLL-2002, pages 171--175. Taipei, Taiwan, 2002.]]
[15]
J. Ding, L. Gravano, and N. Shivakumar. Computing geographical scopes of web resources. In Proceedings of the 26th VLDB Conference, Cairo, Egypt, 2000.]]
[16]
G. Eriksson, K. Franzén, F. Olsson, L. Asker, and P. Lidén. Exploiting syntax when detecting protein names in text. In Proceedings of Workshop on Natural Language Processing in Biomedical Applications, 2002.]]
[17]
J. Leidner, G. Sinclair, and B. Webber. Grounding spatial named entities for information extraction and question answering. In Workshop on the Analysis of Geographic References, Edmonton, Alberta, Canada, May 2003. NAACL-HLT.]]
[18]
H. Li, R. K. Srihari, C. Niu, and W. Li. Location normalization for information extraction. In Proc. of the 19th Conference on Computational Linguistics (COLING-02), Taipei, Taiwan, August 2002. ACL.]]
[19]
H. Li, R. K. Srihari, C. Niu, and W. Li. infoXtract location normalization: a hybrid approach to geographical references in information extraction. In Workshop on the Analysis of Geographic References, Edmonton, Canada, May 2003. NAACL-HLT.]]
[20]
R. Malouf. Markov models for language-independent named entity recognition. In Proceedings of CoNLL-2002, pages 187--190, Taipei, Taiwan, 2002.]]
[21]
K. S. McCurley. Geospatial mapping and navigation of the web. In Proc. of the 10th int. conference on World Wide Web, pages 221--229. ACM Press, 2001.]]
[22]
P. McNamee and J. Mayfield. Entity extraction without language-specific resources. In Proceedings of CoNLL-2002, pages 183--186. Taipei, Taiwan, 2002.]]
[23]
J. Patrick, C. Whitelaw, and R. Munro. Slinerc: The sydney language-independent named entity recogniser and classifier. In Proceedings of CoNLL-2002, pages 199--202. Taipei, Taiwan, 2002.]]
[24]
E. Rauch, M. Bukatin, and K. Baker. A confidence-based framework for disambiguating geographic terms. In Workshop on the Analysis of Geographic References, Edmonton, Alberta, Canada, May 2003. NAACL-HLT.]]
[25]
Y. Ravin and N. Wacholder. Extracting names from natural-language text. Technical Report RC-20338, IBM Research Division, T.J.Watson, Yorktown Heights, NY, October 1997.]]
[26]
D. A. Smith and G. Crane. Disambiguating geographic names in a historical digital library. In Proceedings of the 5th European Conference on Research and Advanced Technology for Digital Libraries (ECDL'01), Lecture Notes in Computer Science, pages 127--136, Darmstadt, September 2001. Springer.]]
[27]
B. Sundheim. Overview of results of the MUC-6 evaluation. In Proc. of the 6th message understanding conference, pages 13--32, Columbia, MD, Nov. 1995.]]
[28]
D. Wu, G. Ngai, M. Carpuat, J. Larsen, and Y. Yang. Boosting for named entity recognition. In Proceedings of CoNLL-2002, pages 195--198. Taipei, Taiwan, 2002.]]
[29]
G. Zhou and J. Su. Named entity tagging using an HMM-based chunk tagger. In Proceedings of the 40th Annual meeting of the ACL, pages 209--219, Philadelphia, PA, July 2002.]]

Cited By

View all
  • (2024)Search Engine for Open Geospatial Consortium Web Services Improving Discoverability through Natural Language Processing-Based Processing and RankingISPRS International Journal of Geo-Information10.3390/ijgi1304012813:4(128)Online publication date: 12-Apr-2024
  • (2024)Blending Social Interaction Realms: Harmonizing Online and Offline Interactions through Augmented RealityProceedings of the 17th International Symposium on Visual Information Communication and Interaction10.1145/3678698.3678700(1-8)Online publication date: 11-Dec-2024
  • (2024)Utilizing External Knowledge to Enhance Location Prediction for Twitter/X Users in Low Resource SettingsACM Transactions on Spatial Algorithms and Systems10.1145/3673899Online publication date: 19-Jun-2024
  • Show More Cited By

Index Terms

  1. Web-a-where: geotagging web content

      Recommendations

      Reviews

      Wei Tang

      Location-assisted search has been gaining momentum recently. For example, Google has introduced a new service called "Search by Location." (Other search engines offer similar services, for example, Gigablast.com and local-news.net.) However, there remain unanswered issues in this research area, for example, how to increase the precision of name resolving (and provide automatic measurement), find the focus (or foci) of a Web page, and bring the search scope to a broader geographical region worldwide. This paper describes Web-a-Where, a system for associating geography with Web pages. The process includes two steps: geotagging, and focus-finding for each page. The algorithms are implemented in the framework of the WebFountain data mining system. There is a performance evaluation for the geotagging and focus-finding algorithms, using several corpora of real Web pages derived from three categories: arbitrary, pages in the .gov domain, and pages from the "regional" sub-category in the Open Directory Project (ODP). The result shows that geotagging achieves a precision of up to 82 percent, and the focus-finding algorithm correctly finds as many as 91 percent of the foci reported, up to the country level. The authors also note that the main source of errors for geotagging comes from geo/nongeo ambiguity, a case in which a place name has another nongeo meaning, for example, Mobile (Alabama). The work described in the paper is novel, in that the algorithms are not covered in prior research, and they demonstrate promising results. The evaluation platform is also unique, and is effective in measuring the precision of the algorithms. The paper is well structured, with concepts and algorithms clearly explained. The presentation is also clean, with a good balance of tables and figures to describe the performance evaluation results. As a proof-of-concept system, Web-a-Where does a reasonable job in identifying place names in Web pages. However, the precision level is not yet satisfactory (especially in the ODP category). More experiments should be done on the effect of the confidence assignments in the disambiguation algorithm. The authors did not explain why they chose the current assignments in the evaluation. More heuristic algorithms may need to be developed (for example, correlation between place names and other terms). It would also be interesting to see the runtime performance of the system when applied to a much larger corpus of Web pages. Overall, this is a solid research paper, with good technical depth, and interesting demonstrated results. Online Computing Reviews Service

      Access critical reviews of Computing literature here

      Become a reviewer for Computing Reviews.

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      SIGIR '04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
      July 2004
      624 pages
      ISBN:1581138814
      DOI:10.1145/1008992
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 25 July 2004

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. disambiguation
      2. gazetteer
      3. geographic tagging
      4. information retrieval
      5. natural language processing
      6. text mining

      Qualifiers

      • Article

      Conference

      SIGIR04
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 792 of 3,983 submissions, 20%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)69
      • Downloads (Last 6 weeks)6
      Reflects downloads up to 18 Nov 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Search Engine for Open Geospatial Consortium Web Services Improving Discoverability through Natural Language Processing-Based Processing and RankingISPRS International Journal of Geo-Information10.3390/ijgi1304012813:4(128)Online publication date: 12-Apr-2024
      • (2024)Blending Social Interaction Realms: Harmonizing Online and Offline Interactions through Augmented RealityProceedings of the 17th International Symposium on Visual Information Communication and Interaction10.1145/3678698.3678700(1-8)Online publication date: 11-Dec-2024
      • (2024)Utilizing External Knowledge to Enhance Location Prediction for Twitter/X Users in Low Resource SettingsACM Transactions on Spatial Algorithms and Systems10.1145/3673899Online publication date: 19-Jun-2024
      • (2024)Geographical and linguistic perspectives on developing geoparsers with generic resourcesInternational Journal of Geographical Information Science10.1080/13658816.2024.236953938:10(2039-2060)Online publication date: 30-Jun-2024
      • (2024)CHTopoNER model-based method for recognizing Chinese place names from social media informationJournal of Geographical Systems10.1007/s10109-023-00433-w26:1(149-179)Online publication date: 11-Jan-2024
      • (2023)A Spatially-Aware Data-Driven Approach to Automatically Geocoding Non-Gazetteer Place NamesACM Transactions on Spatial Algorithms and Systems10.1145/362798710:1(1-34)Online publication date: 11-Dec-2023
      • (2023)Location Reference Recognition from Texts: A Survey and ComparisonACM Computing Surveys10.1145/362581956:5(1-37)Online publication date: 27-Nov-2023
      • (2023)Formal Verification of Quantum Programs: Theory, Tools and ChallengesACM Transactions on Quantum Computing10.1145/3624483Online publication date: 16-Oct-2023
      • (2023)Towards Generating Realistic Geosocial NetworksProceedings of the 7th ACM SIGSPATIAL Workshop on Location-based Recommendations, Geosocial Networks and Geoadvertising10.1145/3615896.3628340(25-28)Online publication date: 13-Nov-2023
      • (2023)Geographic Information Retrieval Using Wikipedia ArticlesProceedings of the ACM Web Conference 202310.1145/3543507.3583469(3331-3341)Online publication date: 30-Apr-2023
      • Show More Cited By

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media