Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Stitching web tables for improving matching quality

Published: 01 August 2017 Publication History

Abstract

HTML tables on web pages ("web tables") cover a wide variety of topics. Data from web tables can thus be useful for tasks such as knowledge base completion or ad hoc table extension. Before table data can be used for these tasks, the tables must be matched to the respective knowledge base or base table. The challenges of web table matching are the high heterogeneity and the small size of the tables.
Though it is known that the majority of web tables are very small, the gold standards that are used to compare web table matching systems mostly consist of larger tables. In this experimental paper, we evaluate T2K Match, a web table to knowledge base matching system, and COMA, a standard schema matching tool, using a sample of web tables that is more realistic than the gold standards that were previously used. We find that both systems fail to produce correct results for many of the very small tables in the sample. As a remedy, we propose to stitch (combine) the tables from each web site into larger ones and match these enlarged tables to the knowledge base or base table afterwards. For this stitching process, we evaluate different schema matching methods in combination with holistic correspondence refinement. Limiting the stitching procedure to web tables from the same web site decreases the heterogeneity and allows us to stitch tables with very high precision. Our experiments show that applying table stitching before running the actual matching method improves the matching results by 0.38 in F1-measure for T2K Match and by 0.14 for COMA. Also, stitching the tables allows us to reduce the amount of tables in our corpus from 5 million original web tables to as few as 100,000 stitched tables.

References

[1]
D. Aumueller, H.-H. Do, S. Massmann, and E. Rahm. Schema and ontology matching with coma++. In Proc. of the 2005 SIGMOD, pages 906--908, 2005.
[2]
S. Balakrishnan, A. Y. Halevy, B. Harb, and et al. Applying webtables in practice. In Conference on Innovative Data Systems Research CIDR, 2015.
[3]
P. A. Bernstein, J. Madhavan, and E. Rahm. Generic schema matching, ten years later. PVLDB, 4(11):pages 695--701, 2011.
[4]
C. S. Bhagavatula, T. Noraset, and D. Downey. Tabel: entity linking in web tables. In International Semantic Web Conference, pages 425--441. Springer, 2015.
[5]
A. Bilke and F. Naumann. Schema matching using duplicates. In 21st ICDE, pages 69--80. IEEE, 2005.
[6]
K. Braunschweig, M. Thiele, J. Eberius, and W. Lehner. Column-specific Context Extraction for Web Tables. In Proc. of the 30th ACM Symposium on Applied Computing, SAC '15, pages 1072--1077, 2015.
[7]
M. J. Cafarella, A. Halevy, and N. Khoussainova. Data Integration for the Relational Web. PVLDB, 2(1):1090--1101, 2009.
[8]
M. J. Cafarella, A. Halevy, D. Z. Wang, E. Wu, and Y. Zhang. WebTables: Exploring the Power of Tables on the Web. PVLDB, 1(1):538--549, 2008.
[9]
A. Das Sarma, L. Fang, N. Gupta, A. Halevy, H. Lee, F. Wu, R. Xin, and C. Yu. Finding Related Tables. In Proc. of the 2012 SIGMOD, pages 817--828, 2012.
[10]
H.-H. Do and E. Rahm. Coma: a system for flexible combination of schema matching approaches. In Proc. of the 28th VLDB, pages 610--621, 2002.
[11]
X. Dong, E. Gabrilovich, G. Heitz, W. Horn, N. Lao, K. Murphy, T. Strohmann, S. Sun, and W. Zhang. Knowledge Vault: A Web-scale Approach to Probabilistic Knowledge Fusion. In Proc. of the 20th SIGKDD, pages 601--610, 2014.
[12]
J. Ellis, A. Fokoue, O. Hassanzadeh, A. Kementsietsidis, K. Srinivas, and M. J. Ward. Exploring Big Data with Helix: Finding Needles in a Big Haystack. SIGMOD Rec., 43(4):43--54, Feb. 2015.
[13]
O. Hassanzadeh, M. J. Ward, M. Rodriguez-Muro, and K. Srinivas. Understanding a large corpus of web tables through matching with knowledge bases: an empirical study. In Proc. of the 10th Int. Workshop on Ontology Matching, pages 25--34, 2015.
[14]
B. He and K. C.-C. Chang. Statistical schema matching across web query interfaces. In Proceedings of the 2003 SIGMOD, pages 217--228, 2003.
[15]
B. He, K. C.-C. Chang, and J. Han. Discovering complex matchings across web query interfaces: a correlation mining approach. In Proceedings of the tenth ACM SIGKDD, pages 148--157, 2004.
[16]
Y. He, K. Chakrabarti, T. Cheng, and T. Tylenda. Automatic discovery of attribute synonyms using query logs and table corpora. In Proceedings of the 25th WWW, pages 1429--1439, 2016.
[17]
J. Lehmann, R. Isele, M. Jakob, A. Jentzsch, D. Kontokostas, P. N. Mendes, S. Hellmann, M. Morsey, P. van Kleef, S. Auer, and C. Bizer. DBpedia - A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia. Semantic Web Journal, 6(2):pages 167--195, 2015.
[18]
O. Lehmberg and C. Bizer. Web table column categorisation and profiling. In Proceedings of the 19th WebDB, page 4. ACM, 2016.
[19]
O. Lehmberg, D. Ritze, R. Meusel, and C. Bizer. A large public corpus of web tables containing time and context metadata. In Proceedings of the 25th WWW, pages 75--76, 2016.
[20]
O. Lehmberg, D. Ritze, P. Ristoski, R. Meusel, H. Paulheim, and C. Bizer. The Mannheim Search Join Engine. Web Semantics: Science, Services and Agents on the World Wide Web, 35:159--166, 2015.
[21]
G. Limaye, S. Sarawagi, and S. Chakrabarti. Annotating and Searching Web Tables Using Entities, Types and Relationships. PVLDB, 3(1--2):1338--1347, 2010.
[22]
X. Ling, A. Y. Halevy, F. Wu, and C. Yu. Synthesizing union tables from the web. In IJCAI, page 2677, 2013.
[23]
J. Morcos, Z. Abedjan, I. F. Ilyas, M. Ouzzani, P. Papotti, and M. Stonebraker. Dataxformer: An interactive data transformation tool. In Proceedings of the 2015 SIGMOD, pages 883--888. ACM, 2015.
[24]
V. Mulwad, T. Finin, and A. Joshi. Semantic message passing for generating linked data from tables. In ISWC, pages 363--378. Springer, 2013.
[25]
T. Papenbrock and F. Naumann. A hybrid approach to functional dependency discovery. Proceedings of the 2016 SIGMOD, pages 821--833, 2016.
[26]
E. Rahm and P. A. Bernstein. A survey of approaches to automatic schema matching. the VLDB Journal, 10(4):pages 334--350, 2001.
[27]
D. Ritze and C. Bizer. Matching Web Tables To DBpedia - A Feature Utility Study. In Proceedings of the 20th EDBT, pages 210--221, 2017.
[28]
D. Ritze, O. Lehmberg, and C. Bizer. Matching HTML Tables to DBpedia. In Proc. of the 5th WIMS, page 10, 2015.
[29]
D. Ritze, O. Lehmberg, Y. Oulabi, and C. Bizer. Profiling the potential of web tables for augmenting cross-domain knowledge bases. In Proceedings of the 25th WWW, pages 251--261, 2016.
[30]
W. Su, J. Wang, and F. Lochovsky. Holistic Schema Matching for Web Query Interfaces. In EDBT 2006, volume 3896, pages 77--94. Springer, 2006.
[31]
P. Venetis, A. Halevy, J. Madhavan, M. Paşca, W. Shen, F. Wu, G. Miao, and C. Wu. Recovering Semantics of Tables on the Web. PVLDB, 4(9):528--538, 2011.
[32]
J. Wang, H. Wang, Z. Wang, and K. Q. Zhu. Understanding Tables on the Web. In Proc. of the 31st ER, pages 141--155, 2012.
[33]
J. Wang, J.-R. Wen, F. Lochovsky, and W.-Y. Ma. Instance-based schema matching for web databases by domain-specific query probing. In PVLDB, volume 30, pages 408--419, 2004.
[34]
M. Yakout, K. Ganjam, K. Chakrabarti, and S. Chaudhuri. InfoGather: Entity Augmentation and Attribute Discovery by Holistic Matching with Web Tables. In Proc. of the 2012 SIGMOD, pages 97--108, 2012.
[35]
Z. Zhang. Effective and efficient semantic table interpretation using tableminer+. Semantic Web, 8(6):1--39, 2017.

Cited By

View all
  • (2024)Matching Tabular Data to Knowledge Graph with Effective Core Column Set Discovery.ACM Transactions on the Web10.1145/369497918:4(1-27)Online publication date: 8-Oct-2024
  • (2024)NPEL: Neural Paired Entity Linking in Web TablesACM Transactions on Asian and Low-Resource Language Information Processing10.1145/3652511Online publication date: 19-Mar-2024
  • (2024)Determining the Largest Overlap between TablesProceedings of the ACM on Management of Data10.1145/36393032:1(1-26)Online publication date: 26-Mar-2024
  • Show More Cited By
  1. Stitching web tables for improving matching quality

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image Proceedings of the VLDB Endowment
    Proceedings of the VLDB Endowment  Volume 10, Issue 11
    August 2017
    432 pages
    ISSN:2150-8097
    Issue’s Table of Contents

    Publisher

    VLDB Endowment

    Publication History

    Published: 01 August 2017
    Published in PVLDB Volume 10, Issue 11

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)24
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 05 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Matching Tabular Data to Knowledge Graph with Effective Core Column Set Discovery.ACM Transactions on the Web10.1145/369497918:4(1-27)Online publication date: 8-Oct-2024
    • (2024)NPEL: Neural Paired Entity Linking in Web TablesACM Transactions on Asian and Low-Resource Language Information Processing10.1145/3652511Online publication date: 19-Mar-2024
    • (2024)Determining the Largest Overlap between TablesProceedings of the ACM on Management of Data10.1145/36393032:1(1-26)Online publication date: 26-Mar-2024
    • (2024)Discovering Functional Dependencies through Hitting Set EnumerationProceedings of the ACM on Management of Data10.1145/36392982:1(1-24)Online publication date: 26-Mar-2024
    • (2024)Gen-T: Table Reclamation in Data Lakes2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00272(3532-3545)Online publication date: 13-May-2024
    • (2023)Semantics-Aware Dataset Discovery from Data Lakes with Contextualized Column-Based Representation LearningProceedings of the VLDB Endowment10.14778/3587136.358714616:7(1726-1739)Online publication date: 1-Mar-2023
    • (2023)SANTOS: Relationship-based Semantic Table Union SearchProceedings of the ACM on Management of Data10.1145/35886891:1(1-25)Online publication date: 30-May-2023
    • (2023)Table Discovery in Data Lakes: State-of-the-art and Future DirectionsCompanion of the 2023 International Conference on Management of Data10.1145/3555041.3589409(69-75)Online publication date: 4-Jun-2023
    • (2023)Data Lakes: A Survey of Functions and SystemsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2023.327010135:12(12571-12590)Online publication date: 25-Apr-2023
    • (2023)High-efficient Joinable Tables Discovery in Data Lakes: A Grey Relational Model-based Approach2023 Eleventh International Conference on Advanced Cloud and Big Data (CBD)10.1109/CBD63341.2023.00047(223-228)Online publication date: 18-Dec-2023
    • Show More Cited By

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media