research-article

Data integration for the relational web

Authors:

Michael J. Cafarella,

Nodira KhoussainovaAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 2, Issue 1

Pages 1090 - 1101

https://doi.org/10.14778/1687627.1687750

Published: 01 August 2009 Publication History

Abstract

The Web contains a vast amount of structured information such as HTML tables, HTML lists and deep-web databases; there is enormous potential in combining and re-purposing this data in creative ways. However, integrating data from this relational web raises several challenges that are not addressed by current data integration systems or mash-up tools. First, the structured data is usually not published cleanly and must be extracted (say, from an HTML list) before it can be used. Second, due to the vastness of the corpus, a user can never know all of the potentially-relevant databases ahead of time (much less write a wrapper or mapping for each one); the source databases must be discovered during the integration process. Third, some of the important information regarding the data is only present in its enclosing web page and needs to be extracted appropriately.

This paper describes Octopus, a system that combines search, extraction, data cleaning and integration, and enables users to create new data sets from those found on the Web. The key idea underlying Octopus is to offer the user a set of best-effort operators that automate the most labor-intensive tasks. For example, the Search operator takes a search-style keyword query and returns a set of relevance-ranked and similarity-clustered structured data sources on the Web; the Context operator helps the user specify the semantics of the sources by inferring attribute values that may not appear in the source itself, and the Extend operator helps the user find related sources that can be joined to add new attributes to a table. Octopus executes some of these operators automatically, but always allows the user to provide feedback and correct errors. We describe the algorithms underlying each of these operators and experiments that demonstrate their efficacy.

References

[1]

P. A. Bernstein. Applying Model Management to Classical Meta Data Problems. In CIDR, 2003.

[2]

B. H. Bloom. Space/Time Trade-offs in Hash Coding with Allowable Errors. Commun. ACM, 13(7):422--426, 1970.

Digital Library

[3]

M. J. Cafarella, A. Y. Halevy, Y. Zhang, D. Z. Wang, and E. Wu. Uncovering the Relational Web. In WebDB, 2008.

[4]

M. J. Cafarella, A. Y. Halevy, Y. Zhang, D. Z. Wang, and E. Wu. WebTables: Exploring the Power of Tables on the Web. PVLDB, 1(1):538--549, 2008.

Digital Library

[5]

J. F. da Silva and G. P. Lopes. A Local Maxima Method and a Fair Dispersion Normalization for Extracting Multi-Word Units from Corpora. Sixth Meeting on Mathematics of Language, 1999.

[6]

P. DeRose, W. Shen, F. Chen, A. Doan, and R. Ramakrishnan. Building Structured Web Community Portals: A Top-Down, Compositional, and Incremental Approach. In VLDB, pages 399--410, 2007.

Digital Library

[7]

A. Doan, P. Domingos, and A. Y. Halevy. Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach. In SIGMOD Conference, pages 509--520, 2001.

Digital Library

[8]

X. Dong, A. Y. Halevy, and J. Madhavan. Reference Reconciliation in Complex Information Spaces. In SIGMOD Conference, pages 85--96, 2005.

Digital Library

[9]

H. Elmeleegy, J. Madhavan, and A. Halevy. Harvesting Relational Tables from Lists on the Web. PVLDB, 1(3), 2009.

Digital Library

[10]

O. Etzioni, M. J. Cafarella, D. Downey, S. Kok, A.-M. Popescu, T. Shaked, S. Soderland, D. S. Weld, and A. Yates. Web-scale Information Extraction in KnowItAll: (Preliminary Results). In WWW, pages 100--110, 2004.

Digital Library

[11]

M. Friedman, A. Y. Levy, and T. D. Millstein. Navigational Plans for Data Integration. In AAAI/IAAI, pages 67--73, 1999.

Digital Library

[12]

H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C.-A. Saita. Declarative Data Cleaning: Language, Model, and Algorithms. In VLDB, pages 371--380, 2001.

Digital Library

[13]

S. Kok and P. Domingos. Extracting Semantic Networks from Text Via Relational Clustering. In ECML/PKDD (1), pages 624--639, 2008.

[14]

Microsoft Popfly. http://www.popfly.com/.

[15]

E. Rahm and P. A. Bernstein. A Survey of Approaches to Automatic Schema Matching. VLDB J., 10(4):334--350, 2001.

Digital Library

[16]

V. Raman and J. M. Hellerstein. Potter's Wheel: An Interactive Data Cleaning System. In VLDB, pages 381--390, 2001.

Digital Library

[17]

S. Sarawagi and W. W. Cohen. Semi-Markov Conditional Random Fields for Information Extraction. In NIPS, 2004.

[18]

R. Tuchinda, P. A. Szekely, and C. A. Knoblock. Building Data Integration Queries by Demonstration. In Intelligent User Interfaces, pages 170--179, 2007.

Digital Library

[19]

P. D. Turney. Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL. CoRR, 2002.

[20]

J. Wong and J. I. Hong. Making Mashups with Marmite: Towards End-User Programming for the Web. In CHI, pages 1435--1444, 2007.

Digital Library

[21]

Yahoo Pipes. http://pipes.yahoo.com/pipes/.

Cited By

Deng YChai CCao LYuan QChen SYu YSun ZWang JLi JCao ZJin KZhang CJiang YZhang YWang YYuan YWang GTang N(2024)LakeBench: A Benchmark for Discovering Joinable and Unionable Tables in Data LakesProceedings of the VLDB Endowment10.14778/3659437.365944817:8(1925-1938)Online publication date: 1-Apr-2024
https://dl.acm.org/doi/10.14778/3659437.3659448
Zecchini LBleifuß TSimonini GBergamaschi SNaumann F(2024)Determining the Largest Overlap between TablesProceedings of the ACM on Management of Data10.1145/36393032:1(1-26)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3639303
Berenguer AMazón JTomás D(2024)Word embeddings for retrieving tabular data from research publicationsMachine Language10.1007/s10994-023-06472-0113:4(2227-2248)Online publication date: 1-Apr-2024
https://dl.acm.org/doi/10.1007/s10994-023-06472-0
Show More Cited By

Index Terms

Data integration for the relational web
1. Information systems

Recommendations

Fuzzy integration of web data sources for data warehousing
EUROCAST'07: Proceedings of the 11th international conference on Computer aided systems theory

In this paper we show our work related to an approach for monitoring web sources on the World Wide Web using its temporal properties in order to integrate them in a temporal Data Warehouse. We use these temporal properties obtained for integrating data ...
Extraction and integration of web data by end-users
CIKM '13: Proceedings of the 22nd ACM international conference on Information & Knowledge Management

For increasingly sophisticated use cases end users often need to extract, combine, and aggregate information from various (often dynamically generated) web pages from multiple websites. Current search engines do not focus on combining information from ...
Data Source Selection for Large-Scale Deep Web Data Integration
WMWA '09: Proceedings of the 2009 Second Pacific-Asia Conference on Web Mining and Web-based Application

Deep web has been an important resource on the web due to its rich and high quality information, leading to emerging a new application area in data mining and integrates. There may be hundreds or thousands of data sources providing data of relevance to ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 2, Issue 1

August 2009

1293 pages

ISSN:2150-8097

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 August 2009

Published in PVLDB Volume 2, Issue 1

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

98
Total Citations
View Citations
1,046
Total Downloads

Downloads (Last 12 months)95
Downloads (Last 6 weeks)6

Reflects downloads up to 10 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Deng YChai CCao LYuan QChen SYu YSun ZWang JLi JCao ZJin KZhang CJiang YZhang YWang YYuan YWang GTang N(2024)LakeBench: A Benchmark for Discovering Joinable and Unionable Tables in Data LakesProceedings of the VLDB Endowment10.14778/3659437.365944817:8(1925-1938)Online publication date: 1-Apr-2024
https://dl.acm.org/doi/10.14778/3659437.3659448
Zecchini LBleifuß TSimonini GBergamaschi SNaumann F(2024)Determining the Largest Overlap between TablesProceedings of the ACM on Management of Data10.1145/36393032:1(1-26)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3639303
Berenguer AMazón JTomás D(2024)Word embeddings for retrieving tabular data from research publicationsMachine Language10.1007/s10994-023-06472-0113:4(2227-2248)Online publication date: 1-Apr-2024
https://dl.acm.org/doi/10.1007/s10994-023-06472-0
Li XJiang CWang P(2024)Matching Tabular Data to Knowledge Graph Based on Multi-level Scoring Filters for Table Entity DisambiguationWeb and Big Data10.1007/978-981-97-7235-3_15(227-242)Online publication date: 31-Aug-2024
https://dl.acm.org/doi/10.1007/978-981-97-7235-3_15
Psallidas FAgrawal ASugunan CIbrahim KKaranasos KCamacho-Rodríguez JFloratou ACurino CRamakrishnan R(2023)OneProvenance: Efficient Extraction of Dynamic Coarse-Grained Provenance from Database Query Event LogsProceedings of the VLDB Endowment10.14778/3611540.361155516:12(3662-3675)Online publication date: 1-Aug-2023
https://dl.acm.org/doi/10.14778/3611540.3611555
Fernandez RElmore AFranklin MKrishnan STan C(2023)How Large Language Models Will Disrupt Data ManagementProceedings of the VLDB Endowment10.14778/3611479.361152716:11(3302-3309)Online publication date: 24-Aug-2023
https://dl.acm.org/doi/10.14778/3611479.3611527
Fan GWang JLi YZhang DMiller R(2023)Semantics-Aware Dataset Discovery from Data Lakes with Contextualized Column-Based Representation LearningProceedings of the VLDB Endowment10.14778/3587136.358714616:7(1726-1739)Online publication date: 8-May-2023
https://dl.acm.org/doi/10.14778/3587136.3587146
Paton NChen JWu Z(2023)Dataset Discovery and Exploration: A SurveyACM Computing Surveys10.1145/362652156:4(1-37)Online publication date: 9-Nov-2023
https://dl.acm.org/doi/10.1145/3626521
Zhong LWu JLi QPeng HWu X(2023)A Comprehensive Survey on Automatic Knowledge Graph ConstructionACM Computing Surveys10.1145/361829556:4(1-62)Online publication date: 5-Sep-2023
https://dl.acm.org/doi/10.1145/3618295
Khatiwada AFan GShraga RChen ZGatterbauer WMiller RRiedewald M(2023)SANTOS: Relationship-based Semantic Table Union SearchProceedings of the ACM on Management of Data10.1145/35886891:1(1-25)Online publication date: 30-May-2023
https://dl.acm.org/doi/10.1145/3588689
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents