Abstract
As the fundamental phrase of collecting and analyzing data, data integration is used in many applications, such as data cleaning, bioinformatics and pattern recognition. In big data era, one of the major problems of data integration is to obtain the global schema of data sources since the global schema could be hardly derived from massive data sources directly. In this paper, we attempt to solve such schema integration problem. For different scenarios, we develop batch and incremental schema integration algorithms. We consider the representation difference of attribute names in various data sources and propose ED Join and Semantic Join algorithms to integrate attributes with different representations. Extensive experimental results demonstrate that the proposed algorithms could integrate schemas efficiently and effectively.
Supported by The National Key Research and Development Program of China (2020YFB1006104).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Castano, S., De Antonellis, V.: Global viewing of heterogeneous data sources. IEEE Trans. Knowl. Data Eng. 13(2), 277–297 (2001)
Gravano, L., et al.: Using q-grams in a DBMS for approximate string processing. IEEE Data Eng. Bull. 24(4), 28–34 (2001)
He, Y., Ganjam, K., Chu, X.: SEMA-JOIN: joining semantically-related tables using big table corpora. VLDB Endow. 8, 1358–1369 (2015)
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. In: Soviet Physics Doklady, vol. 10, pp. 707–710 (1966)
Li, L., Wang, H., Li, J., Gao, H.: ED-SJOIN; an optimal algorithm for similarity joins with edit distance constraints. J. Comput. Res. Dev. 46, 319–325 (2009)
Lin, X.M., Wang, W.: Set and string similarity queries: a survey. Jisuanji Xuebao (Chin. J. Comput.) 34(10), 1853–1862 (2011)
Palopoli, L., Saccá, D., Ursino, D.: An automatic technique for detecting type conflicts in database schemes. In: Proceedings of the Seventh International Conference on Information and Knowledge Management, pp. 306–313. ACM (1998)
Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. VLDB J. 10(4), 334–350 (2001)
Xiao, C., Wang, W., Lin, X.: Ed-join: an efficient algorithm for similarity joins with edit distance constraints. Proc. VLDB Endow. 1(1), 933–944 (2008)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Li, T., Guo, H., Yang, D., Li, M., Zheng, B., Wang, H. (2024). Schema Integration on Massive Data Sources. In: Tari, Z., Li, K., Wu, H. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2023. Lecture Notes in Computer Science, vol 14488. Springer, Singapore. https://doi.org/10.1007/978-981-97-0801-7_11
Download citation
DOI: https://doi.org/10.1007/978-981-97-0801-7_11
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-0800-0
Online ISBN: 978-981-97-0801-7
eBook Packages: Computer ScienceComputer Science (R0)