Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1007/978-3-031-68323-7_5guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

Embedding-Based Data Matching for Disparate Data Sources

Published: 26 August 2024 Publication History

Abstract

Dealing with heterogeneous sources is an important challenge in the field of knowledge discovery and management. Schema matching methods are employed to solve this problem using three approaches: schema-based, instance-based, or a combination. This paper focuses on mapping between a schema-available (only) data source and a data source containing both schema and instance (both). Given the lack of suitable methods for aligning these two types of sources, we propose an approach using embedding models to provide vector modelling of sources and calculate similarities between data. Our solution consists in combining domain-specific embedding models and cross-domain embedding models to make data matching possible and efficient between the above-mentioned data sources. We have conducted several experiments using the Valentine datasets to evaluate our data matching method on several disparate tabular data. The result indicate effectiveness in terms of stability and ablation handling.

References

[1]
Christodoulou K, Fernandes AAA, Paton NW, et al. Wang J et al. Combining syntactic and semantic evidence for improving matching over linked data sources Web Information Systems Engineering – WISE 2015 2015 Cham Springer 200-215
[2]
Zhang, Y., et al.: Schema matching using pre-trained language models. In: 2023 IEEE 39th International Conference on Data Engineering (ICDE), pp. 1558–1571. IEEE, Anaheim, CA, USA (2023).
[3]
Dash S, Bagchi S, Mihindukulasooriya N, Gliozzo A, et al. Payne TR et al. Linking tabular columns to unseen ontologies ISWC 2023 2023 Cham Springer 502-521
[4]
Lewis, M., et al.: BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension (2019). http://arxiv.org/abs/1910.13461
[5]
Liu, H., Cui, L., Liu, J., Zhang, Y.: Natural language inference in context - investigating contextual reasoning over long texts. In: AAAI, vol. 35, pp. 13388–13396 (2021).
[6]
Cappuzzo, R., Papotti, P., Thirumuruganathan, S.: Creating embeddings of heterogeneous relational datasets for data matching tasks. In: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, pp. 1335–1349. ACM, Portland OR USA (2020).
[7]
Bosch, N., Shalmashi, S., Yaghoubi, F., Holm, H., Gaim, F., Payberah, A.H.: Fine-tuning BERT-based language models for duplicate trouble report retrieval. In: 2022 IEEE International Conference on Big Data (Big Data), pp. 4737–4745. IEEE, Osaka, Japan (2022).
[8]
Koutras, C., et al.: Valentine: evaluating matching techniques for dataset discovery (2021). http://arxiv.org/abs/2010.07386

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings
Big Data Analytics and Knowledge Discovery: 26th International Conference, DaWaK 2024, Naples, Italy, August 26–28, 2024, Proceedings
Aug 2024
408 pages
ISBN:978-3-031-68322-0
DOI:10.1007/978-3-031-68323-7
  • Editors:
  • Robert Wrembel,
  • Silvia Chiusano,
  • Gabriele Kotsis,
  • A Min Tjoa,
  • Ismail Khalil

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 26 August 2024

Author Tags

  1. Schema Matching
  2. Disparate Data Source
  3. Embeddings

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 18 Dec 2024

Other Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media