Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/1772690.1772805acmotherconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
poster

Exploiting information redundancy to wring out structured data from the web

Published: 26 April 2010 Publication History

Abstract

A large number of web sites publish pages containing structured information about recognizable concepts, but these data are only partially used by current applications. Although such information is spread across a myriad of sources, the web scale implies a relevant redundancy. We present a domain independent system that exploits the redundancy of information to automatically extract and integrate data from the Web. Our solution concentrates on sources that provide structured data about multiple instances from the same conceptual domain, e.g. financial data, product information. Our proposal is based on an original approach that exploits the mutual dependency between the data extraction and the data integration tasks. Experiments confirmed the quality and the feasibility of the approach.

References

[1]
L. Blanco, V. Crescenzi, P. Merialdo, and P. Papotti. Supporting the automatic construction of entity aware search engines. In ACM WIDM 2008.
[2]
M. J. Cafarella, A. Y. Halevy, D. Z. Wang, E. Wu, and Y. Zhang. Webtables: exploring the power of tables on the web. PVLDB, 1(1):538--549, 2008.
[3]
V. Crescenzi, G. Mecca, and P. Merialdo. RoadRunner: Towards automatic data extraction from large Web sites. In VLDB 2001.
[4]
O. Etzioni, M. Banko, S. Soderland, and D. S. Weld. Open information extraction from the web. Commun. ACM, 51(12):68--74, 2008.

Cited By

View all

Index Terms

  1. Exploiting information redundancy to wring out structured data from the web

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Other conferences
      WWW '10: Proceedings of the 19th international conference on World wide web
      April 2010
      1407 pages
      ISBN:9781605587998
      DOI:10.1145/1772690

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 26 April 2010

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. data extraction
      2. data integration
      3. wrapper generation

      Qualifiers

      • Poster

      Conference

      WWW '10
      WWW '10: The 19th International World Wide Web Conference
      April 26 - 30, 2010
      North Carolina, Raleigh, USA

      Acceptance Rates

      Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)2
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 18 Nov 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2012)Web data reconciliationSearch Computing10.5555/2427336.2427338(1-15)Online publication date: 1-Jan-2012
      • (2012)Forum Data Extraction without Explicit RulesProceedings of the 2012 Second International Conference on Cloud and Green Computing10.1109/CGC.2012.72(460-465)Online publication date: 1-Nov-2012
      • (2012)Reasoning and Ontologies in Data ExtractionReasoning Web. Semantic Technologies for Advanced Query Answering10.1007/978-3-642-33158-9_5(184-210)Online publication date: 2012
      • (2012)DIADEM: Domains to DatabasesDatabase and Expert Systems Applications10.1007/978-3-642-32600-4_1(1-8)Online publication date: 2012
      • (2012)Flint: From Web Pages to Probabilistic Semantic DataSemantic Search over the Web10.1007/978-3-642-25008-8_13(333-359)Online publication date: 28-Jan-2012
      • (2011)Characterizing the uncertainty of web dataProceedings of the 2011 Joint WICOW/AIRWeb Workshop on Web Quality10.1145/1964114.1964116(1-8)Online publication date: 28-Mar-2011

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      EPUB

      View this article in ePub.

      ePub

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media