Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Matching Tabular Data to Knowledge Graph with Effective Core Column Set Discovery.

Published: 08 October 2024 Publication History

Abstract

Matching tabular data to a knowledge graph (KG) is critical for understanding the semantic column types, column relationships, and entities of a table. Existing matching approaches rely heavily on core columns that represent primary subject entities on which other columns in the table depend. However, discovering these core columns before understanding the table’s semantics is challenging. Most prior works use heuristic rules, such as the leftmost column, to discover a single core column, while an insightful discovery of the core column set that accurately captures the dependencies between columns is often overlooked. To address these challenges, we introduce Dependency-aware Core Column Set Discovery (DaCo), an iterative method that uses a novel rough matching strategy to identify both inter-column dependencies and the core column set. Additionally, DaCo can be seamlessly integrated with pre-trained language models, as proposed in the optimization module. Unlike other methods, DaCo does not require labeled data or contextual information, making it suitable for real-world scenarios. In addition, it can identify multiple core columns within a table, which is common in real-world tables. We conduct experiments on six datasets, including five datasets with single core columns and one dataset with multiple core columns. Our experimental results show that DaCo  outperforms existing core column set detection methods, further improving the effectiveness of table understanding tasks.

References

[1]
2015. T2D Gold Standard for Matching Web Tables to DBpedia. Retrieved from http://webdatacommons.org/webtables/goldstandard.html
[2]
2021. GitTables Benchmark-column Type Detection. Retrieved from https://zenodo.org/record/5706316#.YxAVU9NBw2x
[3]
2021. SemTab 2021: Semantic Web Challenge on Tabular Data to Knowledge Graph Matching. Retrieved from http://www.cs.ox.ac.uk/isg/challenges/sem-tab/2021/
[4]
Chandra Sekhar Bhagavatula, Thanapon Noraset, and Doug Downey. 2015. TabEL: Entity linking in web tables. In Proceedings of the 14th International Semantic Web Conference. 425–441.
[5]
Johann Birnick, Thomas Blasius, Tobias Friedrich, Felix Naumann, Thorsten Papenbrock, and Martin Schirneck. 2020. Hitting set enumeration with partial information for unique column combination discovery. In Proceedings of the VLDB Endowment, Vol. 13. 2070–2083.
[6]
Leon Bornemann, Tobias Bleifuß, Dmitri V. Kalashnikov, Felix Naumann, and Divesh Srivastava. 2020. Natural key discovery in Wikipedia tables. In Proceedings of the Web Conference 2020. 2789–2795.
[7]
Alexander Brinkmann, Roee Shraga, and Christina Bizer. 2024. SC-Block: Supervised contrastive blocking within entity resolution pipelines. In Proceedings of the 21st Extended Semantic Web Conference. 121–142.
[8]
Michael J. Cafarella, Alon Halevy, and Daisyzhe Wang. 2008. WebTables: Exploring the power of tables on the web. In Proceedings of the VLDB Endowment, Vol. 1. 538–549.
[9]
Michael J. Cafarella, Alon Halevy, Daisyzhe Wang, Eugene Wu, and Yang Zhang. 2008. Uncovering the relational web. In Proceedings of the 11th International Workshop on Web and Databases.
[10]
Jixuan Chen, Yifeng Jin, Yihan Li, Zijing Tan, Weidong Yang, and Shuai Ma. 2023. Effective and efficient lexicographical order dependency discovery. IEEE Transactions on Knowledge and Data Engineering 35, 9 (2023), 9700–9714.
[11]
Zhiyu Chen, Mohamed Trabelsi, Jeff Heflin, Yinan Xu, and Brian D. Davison. 2020. Table search using a deep contextualized language model. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 589–598.
[12]
Fernando Chirigati, Jialu Liu, Flip Korn, You(Will) Wu, Cong Yu, and Hao Zhang. 2016. Knowledge exploration using tables on the web. In Proceedings of the VLDB Endowment, Vol. 10. 193–204.
[13]
Marco Cremaschi, Flavio De Paoli, Anisa Rula, and Blerina Spahiu. 2020. A fully automated approach to a complete semantic table interpretation. Future Generation Computer Systems 112, 2020 (2020), 478–500.
[14]
Xiang Deng, Huan Sun, Alyssa Lees, You Wu, and Cong Yu. 2020. TURL: Table understanding through representation learning. In Proceedings of the VLDB Endowment, Vol. 14. 33–40.
[15]
Jacob Devlin, Mingwei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2024 Annual Conference of the North American Chapter of the Association for Computational Linguistics—Human Language Technologies.
[16]
Yuyang Dong, Chuan Xiao, Enomoto Masafumi Nozawa, Takuma, and Masafumi. Oyamada. 2023. DeepJoin: Joinable table discovery with pre-trained language models. In Proceedings of the VLDB Endowment, Vol. 16. 2458–2470.
[17]
Vasilis Efthymiou, Oktie Hassanzadeh, Mariano Rodriguez-Muro, and Vassilis Christophides. 2017. Matching web tables with knowledge base entities: from entity lookups to entity embeddings. In Proceedings of the 16th International Semantic Web Conference. 260–277.
[18]
Shady Elbassuoni and Roi Blanco. 2011. Keyword search over RDF graphs. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management. 237–242.
[19]
Ivan Ermilov and Axel-Cyrille Ngonga Ngomo. 2016. TAIPAN: Automatic property mapping for tabular data. In European Knowledge Acquisition Workshop. 163–179.
[20]
Wenfei Fan, Chunming Hu, Xueli Liu, and Ping Lu. 2020. Discovering graph functional dependencies. ACM Transactions on Database Systems 45, 3 (2020), 15.
[21]
Benjamin Feuer, Yurong Liu, Chinmay Hegde, and Juliana Freire. 2024. ArcheType: A novel framework for open-source column type annotationusing large language models. In Proceedings of the VLDB Endowment, Vol. 17. 2279–2292.
[22]
Carlos Garcia-Alvarado and Carlos Ordonez. 2013. Clustering cubes with binary dimensions in one pass. In Proceedings of the 16th International Workshop on Data Warehousing and OLAP. 71–78.
[23]
Anna Lisa Gentile, Petar Ristoski, Steffen Eckel, Dominique Ritze, and Heiko Paulheim. 2017. Entity matching on web tables: A table embeddings approach for blocking. In Proceedings of the 20th International Conference on Extending Database Technology. 510–513.
[24]
Hazar Harmouch, Thorsten Papenbrock, and Felix Naumann. 2021. Relational header discovery using similarity search in a table corpus. In Proceedings of IEEE 37th International Conference on Data Engineering. 444–455.
[25]
Oktie Hassanzadeh, Nora Abdelmageed, Vasilis Efthymiou, Jiaoyan Chen, Vincenzo Cutrona, Madelon Hulsebos, Ernesto Jiménez-Ruiz, Aamod Khatiwada, Keti Korini, Benno Kruit, Juan Sequeda, and Kavitha Srinivas. 2022. Results of SemTab 2022. In Proceedings of the Semantic Web Challenge on Tabular Data to Knowledge Graph Matching Co-Located with the 21st International Semantic Web Conference.
[26]
Oktie Hassanzadeh, Nora Abdelmageed, Vasilis Efthymiou, Jiaoyan Chen, Vincenzo Cutrona, Madelon Hulsebos, Ernesto Jiménez-Ruiz, Aamod Khatiwada, Keti Korini, Benno Kruit, Juan Sequeda, and Kavitha Srinivas. 2023. Results of SemTab 2023. In Proceedings of the Semantic Web Challenge on Tabular Data to Knowledge Graph Matching Co-Located with the 22nd International Semantic Web Conference.
[27]
Vinh Thinh Ho, Koninika Pal, Simon Razniewski, Klaus Berberich, and Gerhard Weikum. 2021. Extracting contextualized quantity facts from web tables. In Proceedings of the Web Conference 2021. 4033–4042.
[28]
Yusra Ibrahim, Mirek Riedewald, Gerhard Weikum, and Demetrios. Zeinalipour-Yazti. 2019. Bridging quantities in tables and text. In Proceedings of IEEE 35th International Conference on Data Engineering. 1010–1021.
[29]
Ran Jia, Haoming Guo, Xiaoyuan Jin, Chao Yan, Lun Du, Xiaojun Ma, Tamara Stankovic, Marko Lozajic, Goran Zoranovic, Igor Ilic, Shi Han, and Dongmei Zhang. 2023. GetPt: Graph-enhanced general table pre-training with alternate attention network. In Proceedings of the 29th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 941–950.
[30]
Aamod Khatiwada, Grace Fan, Roee Shraga, Chen Zixuan, Wolfgang Gatterbauer, Renée J. Miller, and Mirek Riedewald. 2023. SANTOS: Relationship-based semantic table union search. Proceedings of the ACM on Management of Data 1, 1 (2023), 9.
[31]
Keti Korini and Christian Bizer. 2023. Column type annotation using ChatGPT. In Proceedings of the Joint Workshops at 49th International Conference on Very Large Data Bases (VLDBW’23) - TaDA’23: Tabular Data Analysis Workshop.
[32]
Keti Korini1, Ralph Peeters, and Christian Bizer. 2022. SOTAB: The WDC Schema.org table annotation benchmark. In Proceedings of the Semantic Web Challenge on Tabular Data to Knowledge Graph Matching Co-Located with the 21st International Semantic Web Conference, Vol. 3320. 14–19.
[33]
Benno Kruit, Peter Boncz, and Jacopo Urbani. 2020. Extracting n-ary facts from wikipedia table clusters. In Proceedings of the 29th ACM International Conference on Information and Knowledge Management. 655–664.
[34]
Benno Kruit, Peter Boncz, and Jacopo Urbani. 2021. TAKCO: A platform for extracting novel facts from tables. In Companion Proceedings of the Web Conference 2021. 705–707.
[35]
Sebastian Kruse and Felix. Naumann. 2018. Efficient discovery of approximate dependencies. In Proceedings of the VLDB Endowment, Vol. 11. 759–772.
[36]
Jens Lehmann, Robert Isele, Max Jakob, Anja Jentzsch, Dimitris Kontokostas, Pablo N. Mendes, Sebastian Hellmann, Mohamed Morsey, Patrick van Kleef, Soren Auer, and Christian Bizer. 2015. Dbpedia-a large-scale, multilingual knowledge base extracted from wikipedia. Semantic Web 6, 2 (2015) 167–195.
[37]
Oliver Lehmberg and Christian Bizer. 2016. Web table column categorisation and profiling. In Proceedings of the 19th International Workshop on Web and Databases. 1–7.
[38]
Oliver Lehmberg and Christian Bizer. 2017. Stitching web tables for improving matching quality. In Proceedings of the VLDB Endowment, Vol. 10. 1502–1513.
[39]
Oliver Lehmberg and Christian Bizer. 2019. Profiling the semantics of N-ary web table data. In Proceedings of the International Workshop on Semantic Big Data, Vol. 5. 1–6.
[40]
Oliver Lehmberg and Christian Bizer. 2019. Synthesizing N-ary relations from web tables. In Proceedings of the 9th International Conference on Web Intelligence, Mining and Semantics, Vol. 17. 1–12.
[41]
Peng Li, Yeye He, Dror Yashar, Weiwei Cui, Song Ge, Haidong Zhang, Danielle Rifinski Fainman, Dongmei Zhang, and Surajit Chaudhuri. 2024. Table-GPT: Table-tuned GPT for diverse table tasks. Proceedings of the ACM on Management of Data 2, 3 (2024), 176.
[42]
Zuquan Li. 2014. Cauchy convergence topologies on the space of continuous functions. Topology and Its Applications 161 (2014), 321–329.
[43]
Dugang Liu, Pengxiang Cheng, Hong Zhu, Xing Tang, Yanyu Chen, Xiaoting Wang, Weike Pan, Zhong Ming, and Xiuqiang He. 2023. DIWIFT: Discovering instance-wise influential features for tabular data. In Proceedings of the Web Conference 2023. 1673–1682.
[44]
Jhomara Luzuriaga, Emir Munoz, Henry Rosales-Mendez, and Aidan Hogan. 2023. Merging web tables for relation extraction with knowledge graphs. IEEE Transactions on Knowledge and Data Engineering 35, 2 (2023), 1803–1816.
[45]
Mattia Marzocchi, Marco Cremaschi, Riccardo Pozzi1, Roberto Avogadro, and Matteo Palmonari. 2022. MammoTab: A giant and comprehensive dataset for semantic table interpretation. In Proceedings of the Semantic Web Challenge on Tabular Data to Knowledge Graph Matching Co-Located with the 21st International Semantic Web Conference, Vol. 3320. 28–33.
[46]
Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. 2018. Foundations of Machine Learning. The MIT Press.
[47]
Jay Nandy, Jatin Chauhan, Rishi Saket, and Raghuveer. 2023. Non-uniform adversarial perturbations for discrete tabular datasets. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management. 1887–1896.
[48]
Fatemeh Nargesian, Erkang Zhu, Ken Q. Pu, and Renée J. Miller. 2018. Table union search on open data. In Proceedings of the VLDB Endowment, Vol. 11. 813–825.
[49]
Phuc Nguyen, Ikuya Yamada, Natthawut Kertkeidkachorn, Ryutaro Ichise, and Hideaki Takeda. 2020. MTab4Wikidata at SemTab 2020: tabular data annotation with wikidata. In Proceedings of the Semantic Web Challenge on Tabular Data to Knowledge Graph Matching co-located with the 19th International Semantic Web Conference, Vol. 2775. 86–95.
[50]
Phuc Nguyen, Natthawut Kertkeidkachorn, Ryutaro Ichise, and Hideaki Takeda. 2022. MTab4D: Semantic annotation of tabular data with DBpedia. Semantic Web (2022), 1–25.
[51]
Gang Qian, Gu Yuelong Sural, Shamik, and Sakti Pramanik. 2004. Similarity between Euclidean and cosine angle distance for nearest neighbor queries. In Proceedings of the ACM Symposium on Applied Computing. 1232–1237.
[52]
Can Qin, Sungchul Kim, Handong Zhao, Tong Yu, Ryan A. Rossi, and Yun Fu. 2022. External knowledge infusion for tabular pre-training models with dual-adapters. In Proceedings of the 28th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1401–1409.
[53]
Jingyi Qiu, Aibo Song, Jiahui Jin, Tianbo Zhang, Jingyi Ding, Xiaolin Fang, and Jianguo Qian. 2023. Dependency-aware core column discovery for table understanding. In Proceedings of the 22nd International Semantic Web Conference. 159–178.
[54]
Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence embeddings using siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. 3982–3992.
[55]
Dominique Ritze, Oliver Lehmberg, and Christian Bizer. 2015. Matching html tables to dbpedia. In Proceedings of the 5th International Conference on Web Intelligence, Mining and Semantics. 1–6.
[56]
S. Zhang and K. Balog. 2018. Ad hoc table retrieval using semantic similarity. In Proceedings of the Web Conference 2018. 1553–1562.
[57]
Roee Shraga, Haggai Roitman, Guy Feigenblat, and Mustafa Cannim. 2020. Web table retrieval using multimodal deep learning. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 1399–1408.
[58]
Shyong jian Shyu, Pengyeng Yin, and Bertand M T. Lin. 2004. An ant colony optimization algorithm for the minimum weight vertex cover problem. Annals of Operations Research 131 (2004), 283–304.
[59]
Yannis Sismanis, Paul Brown, Peter J Haas, and Berthold Reinwald. 2006. GORDIAN: Efficient and scalable discovery of composite keys. In Proceedings of the VLDB Endowment. 691–702.
[60]
Yoshihiko Suhara, Jinfeng Li, Yuliang Li, Dan Zhang, Çağatay Demiralp, Chen Chen, and Wang-Chiew Tan. 2022. Annotating columns with pre-trained language models. In Proceedings of the 2022 ACM SIGMOD International Conference on Management of Data. 1493–1503.
[61]
Pattara Sukprasert, Gromit Yeuk-Yin Chan, Ryan A. Rossi, Fan Du, and Eunyee Koh. 2023. Discovery and matching numerical attributes in data lakes. In 2023 IEEE International Conference on Big Data. 423–432.
[62]
Huan Sun, Hao Ma, Wen-tau Yih, and Xifeng Yan. 2016. Table cell search for question answering. In Proceedings of the Web Conference 2016. 771–782.
[63]
Yushi Sun, Hao Xin, and Lei. Chen. 2023. RECA: Related tables enhanced column semantic type annotation framework. In Proceedings of the VLDB Endowment, Vol. 16. 1319–1331.
[64]
Tahir Syed and Behroz Mirza. 2023. Self-supervision for tabular data by learning to predict additive homoskedastic gaussian noise as pretext. ACM Transactions on Knowledge Discovery from Data 17, 9 (2023), 122.
[65]
Mohamed Trabelsi, Zhiyu Chen, Shuo Zhang, Brian D. Davison, and Jeff Heflin. 2021. StruBERT: Structure-aware BERT for table search and matching. In Proceedings of the Web Conference 2022. 442–451.
[66]
Petros Venetis, Alon Y. Halevy, Jayant Madhavan, Marius Pasca, Warren Shen, Fei Wu, and Gengxin Miao. 2011. Recovering semantics of tables on the web. In Proceedings of the VLDB Endowment, Vol. 4. 528–538.
[67]
Fei Wang, Kexuan Sun, Muhao Chen, Jay Pujara, and Pedro Szekely. 2021. Retrieving complex tables with multi-granular graph representation learning. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1472–1482.
[68]
Ning Wang and Xiangran Ren. 2018. Identifying multiple entity columns in web tables. International Journal of Software Engineering and Knowledge Engineering 28, 3 (2018), 287–309.
[69]
Runhui Wang and Dong Deng. 2020. DeltaPQ: Lossless product quantization code compression for high dimensional similarity search. In Proceedings of the VLDB Endowment, Vol. 13. 3603–3616.
[70]
Ziheng Wei and Sebastian Link. 2021. Embedded functional dependencies and data-completeness tailored database design. ACM Transactions on the Web 46, 2 (2021), 7.
[71]
Renjie Xiao, Yongan Yuan, Zijing Tan, Shuai Ma, and Wei Wang. 2022. Dynamic functional dependency discovery with dynamic hitting set enumeration. In Proceedings of IEEE 38th International Conference on Data Engineering. 286–298.
[72]
Chen Ye, Haoshi Zhi, Shihao Jiang, Hua Zhang, Yifan Wu, and Guojun Dai. 2023. TETA: Text-enhanced tabular data annotation with multi-task graph convolutional network. In International Conference on Database Systems for Advanced Applications. 523–533.
[73]
Pengcheng Yin, Graham Neubig, Wen-Tau. Yih, and Sebastian Riedel. 2020. TaBERT: Pretraining for joint understanding of textual and tabular data. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020. 8413–8426.
[74]
Dongyu Zhang, Liang Wang, Xin Dai, Shubham Jain, Junpeng Wang, Yujie Fan, Chin-Chia Michael-Yeh, Yan Zheng, Zhongfang Zhuang, and Wei Zhang. 2023. FATA-Trans: Field and time-aware transformer for sequential tabular data. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management. 3247–3256.
[75]
Meihui Zhang and Kaushik Chakrabarti. 2013. Infogather+ semantic matching and annotation of numeric and time-varying attributes in web tables. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. 145–156.
[76]
Shuo Zhang and Krisztian Balog. 2018. On-the-fly table generation. In Proceedings of the 41st International ACM SIGIR Conference on Research and Development in Information Retrieval. 595–604.
[77]
Shuo Zhang and Krisztian Balog. 2019. Auto-completion for data cells in relational tables. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management. 761–770.
[78]
Shuo Zhang and Krisztian Balog. 2020. Web table extraction, retrieval, and augmentation: a survey. ACM Transactions on Intelligent Systems and Technology 11, 2 (2020), 13: 1–13: 35.
[79]
Shuo Zhang, Edgar Meij, Krisztian Balog, and Ridho Rernanda. 2020. Novel entity discovery from web tables. In Proceedings of the Web Conference 2020. 1298–1308.
[80]
Tianshu Zhang, Xiang Yue, Yifei Li, and Huan Sun. 2024. TableLlama: Towards open large generalist models for tables. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol. 1. 6024–6044.
[81]
Xiaolu Zhang, Yueguo Chen, Jinchuan Chen, Xiaoyong Du, and Lei Zou. 2013. Mapping entity-attribute web tables to web-scale knowledge bases. In International Conference on Database Systems for Advanced Applications. 108–122.
[82]
Ziqi Zhang. 2017. Effective and efficient semantic table interpretation using TableMiner+. Semantic Web 8, 6 (2017), 921–957.
[83]
Lei Zheng, Ning Li, Xianyu Chen, Quan Gan, and Weinan Zhang. 2023. Dense representation learning and retrieval for tabular data prediction. In Proceedings of the 29th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 3559–3569.
[84]
Ganggao Zhu and Carlos A. Iglesias. 2017. Computing semantic similarity of concepts in knowledge graphs. IEEE Transactions on Knowledge and Data Engineering 29, 1 (2017), 72–89.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on the Web
ACM Transactions on the Web  Volume 18, Issue 4
November 2024
257 pages
EISSN:1559-114X
DOI:10.1145/3613734
  • Editor:
  • Ryen White
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 October 2024
Online AM: 06 September 2024
Accepted: 19 August 2024
Revised: 04 March 2024
Received: 04 March 2024
Published in TWEB Volume 18, Issue 4

Check for updates

Author Tags

  1. Table understanding
  2. core column set
  3. semantic dependency

Qualifiers

  • Research-article

Funding Sources

  • National Key R&D Program
  • National Natural Science Foundation of China

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 220
    Total Downloads
  • Downloads (Last 12 months)220
  • Downloads (Last 6 weeks)81
Reflects downloads up to 18 Nov 2024

Other Metrics

Citations

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media