Schema Integration on Massive Data Sources

Tianbao Li¹⁰,
Haifeng Guo¹⁰,
Donghua Yang¹⁰,
Mengmeng Li¹⁰,
Bo Zheng¹¹ &
…
Hongzhi Wang¹⁰

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14488))

Included in the following conference series:

International Conference on Algorithms and Architectures for Parallel Processing

345 Accesses

Abstract

As the fundamental phrase of collecting and analyzing data, data integration is used in many applications, such as data cleaning, bioinformatics and pattern recognition. In big data era, one of the major problems of data integration is to obtain the global schema of data sources since the global schema could be hardly derived from massive data sources directly. In this paper, we attempt to solve such schema integration problem. For different scenarios, we develop batch and incremental schema integration algorithms. We consider the representation difference of attribute names in various data sources and propose ED Join and Semantic Join algorithms to integrate attributes with different representations. Extensive experimental results demonstrate that the proposed algorithms could integrate schemas efficiently and effectively.

Supported by The National Key Research and Development Program of China (2020YFB1006104).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Integration of Big Data: A Survey

Towards a Holistic Schema Matching Approach Designed for Large-Scale Schemas

An Efficient Holistic Schema Matching Approach

Notes

References

Castano, S., De Antonellis, V.: Global viewing of heterogeneous data sources. IEEE Trans. Knowl. Data Eng. 13(2), 277–297 (2001)
Article Google Scholar
Gravano, L., et al.: Using q-grams in a DBMS for approximate string processing. IEEE Data Eng. Bull. 24(4), 28–34 (2001)
Google Scholar
He, Y., Ganjam, K., Chu, X.: SEMA-JOIN: joining semantically-related tables using big table corpora. VLDB Endow. 8, 1358–1369 (2015)
Article Google Scholar
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. In: Soviet Physics Doklady, vol. 10, pp. 707–710 (1966)
Google Scholar
Li, L., Wang, H., Li, J., Gao, H.: ED-SJOIN; an optimal algorithm for similarity joins with edit distance constraints. J. Comput. Res. Dev. 46, 319–325 (2009)
Google Scholar
Lin, X.M., Wang, W.: Set and string similarity queries: a survey. Jisuanji Xuebao (Chin. J. Comput.) 34(10), 1853–1862 (2011)
Google Scholar
Palopoli, L., Saccá, D., Ursino, D.: An automatic technique for detecting type conflicts in database schemes. In: Proceedings of the Seventh International Conference on Information and Knowledge Management, pp. 306–313. ACM (1998)
Google Scholar
Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. VLDB J. 10(4), 334–350 (2001)
Article Google Scholar
Xiao, C., Wang, W., Lin, X.: Ed-join: an efficient algorithm for similarity joins with edit distance constraints. Proc. VLDB Endow. 1(1), 933–944 (2008)
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Harbin Institute of Technology, Harbin, China
Tianbao Li, Haifeng Guo, Donghua Yang, Mengmeng Li & Hongzhi Wang
ConDB, Beijing, China
Bo Zheng

Authors

Tianbao Li
View author publications
You can also search for this author in PubMed Google Scholar
Haifeng Guo
View author publications
You can also search for this author in PubMed Google Scholar
Donghua Yang
View author publications
You can also search for this author in PubMed Google Scholar
Mengmeng Li
View author publications
You can also search for this author in PubMed Google Scholar
Bo Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Hongzhi Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Donghua Yang .

Editor information

Editors and Affiliations

Royal Melbourne Institute of Technology, Melbourne, VIC, Australia
Zahir Tari
Tianjin University, Tianjin, China
Keqiu Li
University of Arizona, Tucson, AZ, USA
Hongyi Wu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, T., Guo, H., Yang, D., Li, M., Zheng, B., Wang, H. (2024). Schema Integration on Massive Data Sources. In: Tari, Z., Li, K., Wu, H. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2023. Lecture Notes in Computer Science, vol 14488. Springer, Singapore. https://doi.org/10.1007/978-981-97-0801-7_11

Download citation

DOI: https://doi.org/10.1007/978-981-97-0801-7_11
Published: 01 March 2024
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-0800-0
Online ISBN: 978-981-97-0801-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Schema Integration on Massive Data Sources

Abstract

Access this chapter

Subscribe and save

Buy Now