article

Free access

A survey of approaches to automatic schema matching

Authors:

Philip A. BernsteinAuthors Info & Claims

The VLDB Journal — The International Journal on Very Large Data Bases, Volume 10, Issue 4

Pages 334 - 350

https://doi.org/10.1007/s007780100057

Published: 01 December 2001 Publication History

Abstract

Schema matching is a basic problem in many database application domains, such as data integration, E-business, data warehousing, and semantic query processing. In current implementations, schema matching is typically performed manually, which has significant limitations. On the other hand, previous research papers have proposed many techniques to achieve a partial automation of the match operation for specific application domains. We present a taxonomy that covers many of these existing approaches, and we describe the approaches in some detail. In particular, we distinguish between schema-level and instance-level, element-level and structure-level, and language-based and constraint-based matchers. Based on our classification we review some previous match implementations thereby indicating which part of the solution space they cover. We intend our taxonomy and review of past work to be useful when comparing different approaches to schema matching, when developing a new match algorithm, and when implementing a schema matching component.

References

[1]

{BBC^*00} Beneventano D, Bergamaschi S, Castano S, Corni A, Guidetti R, Malvezzi G, Melchiori M, Vincini M (2000) Information integration: the MOMIS project demonstration. In: Proc 26th Int Conf On Very Large Data Bases, pp. 611-614.

[2]

{BLN86} Batini C, Lenzerini M, Navathe SB (1986) A comparative analysis of methodologies for database schema integration. ACM Comput Surv 18(4):323-364.

[3]

{BFHW95} Benkley S, Fandozzi J, Housman E, Woodhouse G (1995) Data element tool-based analysis (DELTA). MITRE Technical Report MTR'95 B147.

[4]

{BM01} Berlin J, Motro M (2001) Autoplex: automated discovery of content for virtual databases. In: Proc 9th Int Conf On Cooperative Information Systems (CoopIS), Lecture Notes in Computer Science, vol. 2172. Springer, Berlin Heidelberg New York, 2001, pp. 108-122.

[5]

{BCV99} Bergamaschi S, Castano S, Vincini M (1999) Semantic integration of semistructured and structured data sources. ACM SIGMOD Record 28(1):54-59.

[6]

{BCVB01} Bergamaschi S, Castano S, Vincini M, Beneventano D (2001) Semanticinte gration of heterogeneous information sources. Data Knowl Eng 36(3):215-249.

[7]

{BS01} Bell GS, Sethi A (2001) Matching records in a national medical patient index. CACM 44(9):83-88.

[8]

{BHL01} Berners-Lee T, Hendler J, Lassila O (2001) The semantic web. Sci Am 284(5):34-43.

[9]

{Be00} Bernstein PA (2000) Is generic metadata management feasible? Panel overview. In: Proc 26th Int Conf On Very Large Data Bases, pp. 660-662.

[10]

{BHP00} Bernstein PA, Halevy A, Pottinger RA (2000) A vision for management of complex models. A vision for management of complex models. ACM SIGMOD Record 29(4):55-63.

[11]

{BR00} Bernstein PA, Rahm E (2000) Data warehouse scenarios for model management. In: Proc 19th Int Conf On Entity-Relationship Modeling, Lecture Notes in Computer Science, vol. 1920. Springer, Berlin Heidelberg New York, 2000, pp. 1-15.

[12]

{BHP94} Bright MW, Hurson AR, Pakzad SH (1994) Automated resolution of semantic heterogeneity in multidatabases. TODS 19(2):212-253.

[13]

{CD99} Castano S, De Antonellis V (1999) A schema analysis and reconciliation tool environment. In: Proc Int Database Eng Appl Symp (IDEAS), IEEE Computer, New York, pp. 53-62.

[14]

{CDD01} Castano S, De Antonellis V, De Capitani di Vemercati S (2001) Global viewing of heterogeneous data sources. IEEE Trans Data Knowl Eng 13(2):277-297.

[15]

{CHR97} Clifton C, Housman E, Rosenthal A (1997) Experience with a combined approach to attribute-matching across heterogenenous databases. In: Proc 7, IFIP 2.6 Working Conf. Database Semantics.

[16]

{Co98} Cohen WW (1998) Integration of heterogeneous databases without common domains using queries based on textual similarity. In: Proc ACM SIGMOD Conf, pp. 201-212.

[17]

{DDL00} Doan AH, Domingos P, Levy A (2000) Learning source descriptions for data integration. In: Proc WebDB Workshop, pp. 81-92.

[18]

{DDH01} Doan AH, Domingos P, Halevy A (2001) Reconciling schemas of disparate data sources: a machine-learning approach. In: Proc ACM SIGMOD Conf, pp. 509-520.

[19]

{EP90} Elmagarmid AK, Pu C (1990) Guest editors' introduction to the special issue on heterogeneous databases. ACM Comput Surv 22(3):175-178.

[20]

{EJX01} Embley DW, Jackman D, Xu L (2001) Multifaceted exploitation of metadata for attribute match discovery in information integration. In: Proc Int Workshop on Information Integration on the Web, pp. 110-117.

[21]

{GW97} Goldman R, Widom J (1997) Dataguides: enabling query formulation and optimization in semistructured databases. In: Proc 23th Int Conf On Very Large Data Bases, pp. 436-445.

[22]

{HMNT99} Haas LM, Miller RJ, Niswonger B, Tork Roth, Schwarz PM, Wimmers EL (1999) Transforming heterogeneous data with database middleware: beyond integration. IEEE Tech Bull Data Eng 22(1):31-36.

[23]

{KKFG84} Korth HF, Kuper GM, Feigenbaum J, Van Gelder A, Ullman JD (1984) System/U: a database system based on the universal relation assumption. ACM TODS 9(3):331-347.

[24]

{LNE89} Larson JA, Navathe SB, ElMasri R (1989) A theory of attribute equivalence in databases with application to schema integration. IEEE Trans Software Eng 16(4):449 -463.

[25]

{LC94} Li W, Clifton C (1994) Semantic integration in heterogeneous databases using neural networks. In: Proc 20th Int Conf On Very Large Data Bases, pp. 1-12.

[26]

{LC00} Li W, Clifton C (2000) SemInt: a tool for identifying attribute correspondences in heterogeneous databases using neural network. Data Knowl Eng 33(1):49-84.

[27]

{LCL00} Li W, Clifton C, Liu S (2000) Database integration using neural network: implementation and experiences. Knowl Inf Syst 2(1):73-96.

[28]

{Le00} Lerner BS (2000) A model for compound type changes encountered in schema evolution. ACM TODS 25(1):83-127.

[29]

{MBR01} Madhavan J, Bernstein PA, Rahm E (2001) Generic schema matching with Cupid. In: Proc 27th Int Conf On Very Large Data Bases, pp. 49-58.

[30]

{MRSS82} Maier D, Rozenshtein D, Salveter SC, Stein J, Warren DS (1982) Toward logical data independence: a relational query language without relations. In: Proc ACM SIGMOD Conf, pp. 51-60.

[31]

{MGR02} Melnik S, Garcia-Molina H, Rahm E (2002) Similarity flooding - a versatile graph matching algorithm. In: Proc 18th Int Conf Data Eng (to appear).

[32]

{MHH00} Miller RJ, Haas L, Hern'andez MA (2000) Schema mapping as query discovery. In: Proc 26th Int Conf On Very Large Data Bases, pp. 77-88.

[33]

{MIR94} Miller R, YE Ioannidis, Ramakrishnan R (1994) Schema equivalence in heterogeneous systems: bridging theory and practice. Inf Syst 19(1):3-31.

[34]

{Mi01} Miller R, et al (2001) The Clio project: managing heterogeneity. ACM SIGMOD Record 30(1):78-83.

[35]

{MZ98} Milo T, Zohar S (1998) Using schema matching to simplify heterogeneous data translation. In: Proc 24th Int Conf On Very Large Data Bases, pp. 122-133.

[36]

{MWJ99} Mitra P, Wiederhold G, Jannink J (1999) Semi-automatic integration of knowledge sources. In: Proc of Fusion '99, Sunnyvale, USA,.

[37]

{MWK00} Mitra P, Wiederhold G, Kersten M (2000) A graph-oriented model for articulation of ontology interdependencies. In: Proc Extending DataBase Technologies, Lecture Notes in Computer Science, vol. 1777. Springer, Berlin Heidelberg New York, 2000, pp. 86-100.

[38]

{PS98} Parent C, Spaccapietra S (1998) Issues and approaches of database integration. CACM 41(5):166-178.

[39]

{PSTU99} Palopoli L, Sacca D, Terracina G, Ursino D (1999) A unified graph-based framework for deriving nominal interscheme properties, type conflicts and object cluster similarities. In: Proc 4th IFCIS Int Conf On Cooperative Information Systems (CoopIS), IEEE Comput, pp. 34-45.

[40]

{PSU98a} Palopoli L, Sacca D, Ursino D (1998) Semi-automatic, semantic discovery of properties from database schemas. In: Proc Int. Database Engineering and Applications Symp. (IDEAS), IEEE Comput, pp. 244-253.

[41]

{PSU98b} Palopoli L, Sacca D, Ursino D (1998) An automatic technique for detecting type conflicts in database schemas. In: Proc 7th Int Conf On Information and Knowledge Management (CIKM), pp. 306-313.

[42]

{PTU00} Palopoli L, Terracina G, Ursino D (2000) The system DIKE: towards the semi-automatic synthesis of cooperative information systems and data warehouses. In: Proc ADBIS-DASFAA Conf, Matfyz, pp. 108-117.

[43]

{RTU01} Rosaci D, Terracina G, Ursino D (2001) Deriving sub-source similarities from heterogeneous, semi-structured information sources. In: Proc 9th Int Conf On Cooperative Information Systems (CoopIS), Lecture Notes in Computer Science, vol. 2172. Springer, Berlin Heidelberg New York, 2001, pp. 150-162.

[44]

{RYAC00} Rishe N, Yuan J, Athauda R, Chen SC, Lu X, Ma X, Vaschillo A, Shaposhnikov A, Vasilevsky D (2000) Semantic access: semantic interface for querying databases. In: Proc 26th Int Conf On Very Large Data Bases, pp. 591-594.

[45]

{SL90} Sheth AP, Larson JA (1990) Federated database systems for managing distributed, heterogeneous, and autonomous databases. ACM Comput Surv 22(3):183- 236.

[46]

{TU00} Terracina G, Ursino D (2000) Deriving synonymies and homonymies of object classes in semi-structured information sources. Advances in data management, Tata McGraw-Hill, pp. 21-32.

[47]

{Ur99} Ursino D (1999) Semiautomatic approaches and tools for the extraction and the exploitation of intentional knowledge from heterogeneous information sources. Ph.D. Thesis. http://www.ing.unirc.it/didattica/inform00/ursino/ tesi.zip

[48]

{WS90} Wald JA, Sorenson PG (1990) Explaining ambiguity in a formal query language. ACM TODS 15(2):125-161.

[49]

{WYW00} Wang Q, Yu J, Wong K (2000) Approximate graph schema extraction for semi-structured data. In: Proc Extending DataBase Technologies, Lecture Notes in Computer Science, vol. 1777. Springer, Berlin Heidelberg New York, 2000, pp. 302-316.

[50]

{WZJS94} Wang JTL, Zhang K, Jeong K, Shasha D (1994) A system for approximate tree matching. IEEE Trans Data Knowl Eng 6(4):559-571.

[51]

{YMHF01} Yan L, Miller RJ, Haas LM, Fagin R (2001) Data-driven understanding and refinement of schema mappings. In: Proc ACM SIGMOD Conf, pp. 485-496.

[52]

{ZS89} Zhang K, Shasha D (1989) Simple fast algorithms for the editing distance between trees and related problems. SIAM J Comput 18:1245-1262.

[53]

{ZS97} Zhang K, Shasha D (1997) Approximate tree pattern matching. In: Apostolico A, Galil Z (eds) Pattern matching in strings, trees, and arrays. Oxford University, Oxford, pp. 341-371.

[54]

{ZSW92} Zhang K, Shasha D, Wang JTL (1992) Fast serial and parallel algorithms for approximate tree matching with VLDC's. In: Proc Int Conf Combinatorial Pattern Matching, pp. 148-158.

[55]

{ZSW00} Zhang K, Shasha D, Wang JTL: http://cs.nyu.edu/cs/faculty/shasha/papers/agm.html, http://cs.nyu.edu/cs/faculty/shasha/papers/tree.html, http://cs.nyu.edu/cs/faculty/shasha/papers/ treesearch.html

Cited By

Trummer I(2024)Generating Succinct Descriptions of Database Schemata for Cost-Efficient Prompting of Large Language ModelsProceedings of the VLDB Endowment10.14778/3681954.368201717:11(3511-3523)Online publication date: 30-Aug-2024
https://doi.org/10.14778/3681954.3682017
Liu XWang RSong YKong LBaeza-Yates RBonchi F(2024)GRAM: Generative Retrieval Augmented Matching of Data Schemas in the Context of Data SecurityProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671602(5476-5486)Online publication date: 25-Aug-2024
https://dl.acm.org/doi/10.1145/3637528.3671602
Bianchini DDe Antonellis VGarda M(2024)A semantics-enabled approach for personalised Data Lake explorationKnowledge and Information Systems10.1007/s10115-023-02014-166:2(1469-1502)Online publication date: 1-Feb-2024
https://dl.acm.org/doi/10.1007/s10115-023-02014-1
Show More Cited By

Index Terms

A survey of approaches to automatic schema matching

Recommendations

Matching large schemas: Approaches and evaluation

Current schema matching approaches still have to improve for large and complex Schemas. The large search space increases the likelihood for false matches as well as execution times. Further difficulties for Schema matching are posed by the high ...
Semi-automated schema integration with SASMINT

The emergence of increasing number of collaborating organizations has made clear the need for supporting interoperability infrastructures, enabling sharing and exchange of data among organizations. Schema matching and schema integration are the crucial ...
Schema Mediation for Heterogeneous XML Schema Sources
WAINA '09: Proceedings of the 2009 International Conference on Advanced Information Networking and Applications Workshops

Due to the increasingly widespread use of XML, many XML-related applications require the service of schema mediation, which is to find semantically similar elements from two or more schema sources. Current approaches to schema mediation require much ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image The VLDB Journal — The International Journal on Very Large Data Bases

The VLDB Journal — The International Journal on Very Large Data Bases Volume 10, Issue 4

December 2001

110 pages

ISSN:1066-8888

Issue’s Table of Contents

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 01 December 2001

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

972
Total Citations
View Citations
4,363
Total Downloads

Downloads (Last 12 months)69
Downloads (Last 6 weeks)11

Reflects downloads up to 05 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Trummer I(2024)Generating Succinct Descriptions of Database Schemata for Cost-Efficient Prompting of Large Language ModelsProceedings of the VLDB Endowment10.14778/3681954.368201717:11(3511-3523)Online publication date: 30-Aug-2024
https://doi.org/10.14778/3681954.3682017
Liu XWang RSong YKong LBaeza-Yates RBonchi F(2024)GRAM: Generative Retrieval Augmented Matching of Data Schemas in the Context of Data SecurityProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671602(5476-5486)Online publication date: 25-Aug-2024
https://dl.acm.org/doi/10.1145/3637528.3671602
Bianchini DDe Antonellis VGarda M(2024)A semantics-enabled approach for personalised Data Lake explorationKnowledge and Information Systems10.1007/s10115-023-02014-166:2(1469-1502)Online publication date: 1-Feb-2024
https://dl.acm.org/doi/10.1007/s10115-023-02014-1
Meduri VQuamar ALei CQin XReinwald B(2024)Alfa: active learning for graph neural network-based semantic schema alignmentThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-023-00822-z33:4(981-1011)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.1007/s00778-023-00822-z
Babalou SKönig-Ries B(2023)Towards building knowledge by merging multiple ontologies with CoMergerApplied Ontology10.3233/AO-23002018:4(307-341)Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.3233/AO-230020
Shraga RMiller R(2023)Explaining Dataset Changes for Semantic Data Versioning with Explain-Da-VProceedings of the VLDB Endowment10.14778/3583140.358316916:6(1587-1600)Online publication date: 20-Apr-2023
https://dl.acm.org/doi/10.14778/3583140.3583169
Sun YXin HChen L(2023)RECA: Related Tables Enhanced Column Semantic Type Annotation FrameworkProceedings of the VLDB Endowment10.14778/3583140.358314916:6(1319-1331)Online publication date: 1-Feb-2023
https://dl.acm.org/doi/10.14778/3583140.3583149
Paton NChen JWu Z(2023)Dataset Discovery and Exploration: A SurveyACM Computing Surveys10.1145/362652156:4(1-37)Online publication date: 9-Nov-2023
https://dl.acm.org/doi/10.1145/3626521
Khan A(2023)Knowledge Graphs QueryingACM SIGMOD Record10.1145/3615952.361595652:2(18-29)Online publication date: 11-Aug-2023
https://dl.acm.org/doi/10.1145/3615952.3615956
Fathollahzadeh SBoehm M(2023)GIO: Generating Efficient Matrix and Frame Readers for Custom Data Formats by ExampleProceedings of the ACM on Management of Data10.1145/35892651:2(1-26)Online publication date: 20-Jun-2023
https://dl.acm.org/doi/10.1145/3589265
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Figures

Tables

Media

View Issue’s Table of Contents