Nothing Special   »   [go: up one dir, main page]

skip to main content
article
Free access

A survey of approaches to automatic schema matching

Published: 01 December 2001 Publication History

Abstract

Schema matching is a basic problem in many database application domains, such as data integration, E-business, data warehousing, and semantic query processing. In current implementations, schema matching is typically performed manually, which has significant limitations. On the other hand, previous research papers have proposed many techniques to achieve a partial automation of the match operation for specific application domains. We present a taxonomy that covers many of these existing approaches, and we describe the approaches in some detail. In particular, we distinguish between schema-level and instance-level, element-level and structure-level, and language-based and constraint-based matchers. Based on our classification we review some previous match implementations thereby indicating which part of the solution space they cover. We intend our taxonomy and review of past work to be useful when comparing different approaches to schema matching, when developing a new match algorithm, and when implementing a schema matching component.

References

[1]
{BBC*00} Beneventano D, Bergamaschi S, Castano S, Corni A, Guidetti R, Malvezzi G, Melchiori M, Vincini M (2000) Information integration: the MOMIS project demonstration. In: Proc 26th Int Conf On Very Large Data Bases, pp. 611-614.
[2]
{BLN86} Batini C, Lenzerini M, Navathe SB (1986) A comparative analysis of methodologies for database schema integration. ACM Comput Surv 18(4):323-364.
[3]
{BFHW95} Benkley S, Fandozzi J, Housman E, Woodhouse G (1995) Data element tool-based analysis (DELTA). MITRE Technical Report MTR'95 B147.
[4]
{BM01} Berlin J, Motro M (2001) Autoplex: automated discovery of content for virtual databases. In: Proc 9th Int Conf On Cooperative Information Systems (CoopIS), Lecture Notes in Computer Science, vol. 2172. Springer, Berlin Heidelberg New York, 2001, pp. 108-122.
[5]
{BCV99} Bergamaschi S, Castano S, Vincini M (1999) Semantic integration of semistructured and structured data sources. ACM SIGMOD Record 28(1):54-59.
[6]
{BCVB01} Bergamaschi S, Castano S, Vincini M, Beneventano D (2001) Semanticinte gration of heterogeneous information sources. Data Knowl Eng 36(3):215-249.
[7]
{BS01} Bell GS, Sethi A (2001) Matching records in a national medical patient index. CACM 44(9):83-88.
[8]
{BHL01} Berners-Lee T, Hendler J, Lassila O (2001) The semantic web. Sci Am 284(5):34-43.
[9]
{Be00} Bernstein PA (2000) Is generic metadata management feasible? Panel overview. In: Proc 26th Int Conf On Very Large Data Bases, pp. 660-662.
[10]
{BHP00} Bernstein PA, Halevy A, Pottinger RA (2000) A vision for management of complex models. A vision for management of complex models. ACM SIGMOD Record 29(4):55-63.
[11]
{BR00} Bernstein PA, Rahm E (2000) Data warehouse scenarios for model management. In: Proc 19th Int Conf On Entity-Relationship Modeling, Lecture Notes in Computer Science, vol. 1920. Springer, Berlin Heidelberg New York, 2000, pp. 1-15.
[12]
{BHP94} Bright MW, Hurson AR, Pakzad SH (1994) Automated resolution of semantic heterogeneity in multidatabases. TODS 19(2):212-253.
[13]
{CD99} Castano S, De Antonellis V (1999) A schema analysis and reconciliation tool environment. In: Proc Int Database Eng Appl Symp (IDEAS), IEEE Computer, New York, pp. 53-62.
[14]
{CDD01} Castano S, De Antonellis V, De Capitani di Vemercati S (2001) Global viewing of heterogeneous data sources. IEEE Trans Data Knowl Eng 13(2):277-297.
[15]
{CHR97} Clifton C, Housman E, Rosenthal A (1997) Experience with a combined approach to attribute-matching across heterogenenous databases. In: Proc 7, IFIP 2.6 Working Conf. Database Semantics.
[16]
{Co98} Cohen WW (1998) Integration of heterogeneous databases without common domains using queries based on textual similarity. In: Proc ACM SIGMOD Conf, pp. 201-212.
[17]
{DDL00} Doan AH, Domingos P, Levy A (2000) Learning source descriptions for data integration. In: Proc WebDB Workshop, pp. 81-92.
[18]
{DDH01} Doan AH, Domingos P, Halevy A (2001) Reconciling schemas of disparate data sources: a machine-learning approach. In: Proc ACM SIGMOD Conf, pp. 509-520.
[19]
{EP90} Elmagarmid AK, Pu C (1990) Guest editors' introduction to the special issue on heterogeneous databases. ACM Comput Surv 22(3):175-178.
[20]
{EJX01} Embley DW, Jackman D, Xu L (2001) Multifaceted exploitation of metadata for attribute match discovery in information integration. In: Proc Int Workshop on Information Integration on the Web, pp. 110-117.
[21]
{GW97} Goldman R, Widom J (1997) Dataguides: enabling query formulation and optimization in semistructured databases. In: Proc 23th Int Conf On Very Large Data Bases, pp. 436-445.
[22]
{HMNT99} Haas LM, Miller RJ, Niswonger B, Tork Roth, Schwarz PM, Wimmers EL (1999) Transforming heterogeneous data with database middleware: beyond integration. IEEE Tech Bull Data Eng 22(1):31-36.
[23]
{KKFG84} Korth HF, Kuper GM, Feigenbaum J, Van Gelder A, Ullman JD (1984) System/U: a database system based on the universal relation assumption. ACM TODS 9(3):331-347.
[24]
{LNE89} Larson JA, Navathe SB, ElMasri R (1989) A theory of attribute equivalence in databases with application to schema integration. IEEE Trans Software Eng 16(4):449 -463.
[25]
{LC94} Li W, Clifton C (1994) Semantic integration in heterogeneous databases using neural networks. In: Proc 20th Int Conf On Very Large Data Bases, pp. 1-12.
[26]
{LC00} Li W, Clifton C (2000) SemInt: a tool for identifying attribute correspondences in heterogeneous databases using neural network. Data Knowl Eng 33(1):49-84.
[27]
{LCL00} Li W, Clifton C, Liu S (2000) Database integration using neural network: implementation and experiences. Knowl Inf Syst 2(1):73-96.
[28]
{Le00} Lerner BS (2000) A model for compound type changes encountered in schema evolution. ACM TODS 25(1):83-127.
[29]
{MBR01} Madhavan J, Bernstein PA, Rahm E (2001) Generic schema matching with Cupid. In: Proc 27th Int Conf On Very Large Data Bases, pp. 49-58.
[30]
{MRSS82} Maier D, Rozenshtein D, Salveter SC, Stein J, Warren DS (1982) Toward logical data independence: a relational query language without relations. In: Proc ACM SIGMOD Conf, pp. 51-60.
[31]
{MGR02} Melnik S, Garcia-Molina H, Rahm E (2002) Similarity flooding - a versatile graph matching algorithm. In: Proc 18th Int Conf Data Eng (to appear).
[32]
{MHH00} Miller RJ, Haas L, Hern'andez MA (2000) Schema mapping as query discovery. In: Proc 26th Int Conf On Very Large Data Bases, pp. 77-88.
[33]
{MIR94} Miller R, YE Ioannidis, Ramakrishnan R (1994) Schema equivalence in heterogeneous systems: bridging theory and practice. Inf Syst 19(1):3-31.
[34]
{Mi01} Miller R, et al (2001) The Clio project: managing heterogeneity. ACM SIGMOD Record 30(1):78-83.
[35]
{MZ98} Milo T, Zohar S (1998) Using schema matching to simplify heterogeneous data translation. In: Proc 24th Int Conf On Very Large Data Bases, pp. 122-133.
[36]
{MWJ99} Mitra P, Wiederhold G, Jannink J (1999) Semi-automatic integration of knowledge sources. In: Proc of Fusion '99, Sunnyvale, USA,.
[37]
{MWK00} Mitra P, Wiederhold G, Kersten M (2000) A graph-oriented model for articulation of ontology interdependencies. In: Proc Extending DataBase Technologies, Lecture Notes in Computer Science, vol. 1777. Springer, Berlin Heidelberg New York, 2000, pp. 86-100.
[38]
{PS98} Parent C, Spaccapietra S (1998) Issues and approaches of database integration. CACM 41(5):166-178.
[39]
{PSTU99} Palopoli L, Sacca D, Terracina G, Ursino D (1999) A unified graph-based framework for deriving nominal interscheme properties, type conflicts and object cluster similarities. In: Proc 4th IFCIS Int Conf On Cooperative Information Systems (CoopIS), IEEE Comput, pp. 34-45.
[40]
{PSU98a} Palopoli L, Sacca D, Ursino D (1998) Semi-automatic, semantic discovery of properties from database schemas. In: Proc Int. Database Engineering and Applications Symp. (IDEAS), IEEE Comput, pp. 244-253.
[41]
{PSU98b} Palopoli L, Sacca D, Ursino D (1998) An automatic technique for detecting type conflicts in database schemas. In: Proc 7th Int Conf On Information and Knowledge Management (CIKM), pp. 306-313.
[42]
{PTU00} Palopoli L, Terracina G, Ursino D (2000) The system DIKE: towards the semi-automatic synthesis of cooperative information systems and data warehouses. In: Proc ADBIS-DASFAA Conf, Matfyz, pp. 108-117.
[43]
{RTU01} Rosaci D, Terracina G, Ursino D (2001) Deriving sub-source similarities from heterogeneous, semi-structured information sources. In: Proc 9th Int Conf On Cooperative Information Systems (CoopIS), Lecture Notes in Computer Science, vol. 2172. Springer, Berlin Heidelberg New York, 2001, pp. 150-162.
[44]
{RYAC00} Rishe N, Yuan J, Athauda R, Chen SC, Lu X, Ma X, Vaschillo A, Shaposhnikov A, Vasilevsky D (2000) Semantic access: semantic interface for querying databases. In: Proc 26th Int Conf On Very Large Data Bases, pp. 591-594.
[45]
{SL90} Sheth AP, Larson JA (1990) Federated database systems for managing distributed, heterogeneous, and autonomous databases. ACM Comput Surv 22(3):183- 236.
[46]
{TU00} Terracina G, Ursino D (2000) Deriving synonymies and homonymies of object classes in semi-structured information sources. Advances in data management, Tata McGraw-Hill, pp. 21-32.
[47]
{Ur99} Ursino D (1999) Semiautomatic approaches and tools for the extraction and the exploitation of intentional knowledge from heterogeneous information sources. Ph.D. Thesis. http://www.ing.unirc.it/didattica/inform00/ursino/ tesi.zip
[48]
{WS90} Wald JA, Sorenson PG (1990) Explaining ambiguity in a formal query language. ACM TODS 15(2):125-161.
[49]
{WYW00} Wang Q, Yu J, Wong K (2000) Approximate graph schema extraction for semi-structured data. In: Proc Extending DataBase Technologies, Lecture Notes in Computer Science, vol. 1777. Springer, Berlin Heidelberg New York, 2000, pp. 302-316.
[50]
{WZJS94} Wang JTL, Zhang K, Jeong K, Shasha D (1994) A system for approximate tree matching. IEEE Trans Data Knowl Eng 6(4):559-571.
[51]
{YMHF01} Yan L, Miller RJ, Haas LM, Fagin R (2001) Data-driven understanding and refinement of schema mappings. In: Proc ACM SIGMOD Conf, pp. 485-496.
[52]
{ZS89} Zhang K, Shasha D (1989) Simple fast algorithms for the editing distance between trees and related problems. SIAM J Comput 18:1245-1262.
[53]
{ZS97} Zhang K, Shasha D (1997) Approximate tree pattern matching. In: Apostolico A, Galil Z (eds) Pattern matching in strings, trees, and arrays. Oxford University, Oxford, pp. 341-371.
[54]
{ZSW92} Zhang K, Shasha D, Wang JTL (1992) Fast serial and parallel algorithms for approximate tree matching with VLDC's. In: Proc Int Conf Combinatorial Pattern Matching, pp. 148-158.
[55]
{ZSW00} Zhang K, Shasha D, Wang JTL: http://cs.nyu.edu/cs/faculty/shasha/papers/agm.html, http://cs.nyu.edu/cs/faculty/shasha/papers/tree.html, http://cs.nyu.edu/cs/faculty/shasha/papers/ treesearch.html

Cited By

View all
  • (2024)Generating Succinct Descriptions of Database Schemata for Cost-Efficient Prompting of Large Language ModelsProceedings of the VLDB Endowment10.14778/3681954.368201717:11(3511-3523)Online publication date: 30-Aug-2024
  • (2024)GRAM: Generative Retrieval Augmented Matching of Data Schemas in the Context of Data SecurityProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671602(5476-5486)Online publication date: 25-Aug-2024
  • (2024)A semantics-enabled approach for personalised Data Lake explorationKnowledge and Information Systems10.1007/s10115-023-02014-166:2(1469-1502)Online publication date: 1-Feb-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image The VLDB Journal — The International Journal on Very Large Data Bases
The VLDB Journal — The International Journal on Very Large Data Bases  Volume 10, Issue 4
December 2001
110 pages

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 01 December 2001

Author Tags

  1. Graph matching
  2. Machine learning
  3. Model management
  4. Schema integration
  5. Schema matching

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)69
  • Downloads (Last 6 weeks)11
Reflects downloads up to 05 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Generating Succinct Descriptions of Database Schemata for Cost-Efficient Prompting of Large Language ModelsProceedings of the VLDB Endowment10.14778/3681954.368201717:11(3511-3523)Online publication date: 30-Aug-2024
  • (2024)GRAM: Generative Retrieval Augmented Matching of Data Schemas in the Context of Data SecurityProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671602(5476-5486)Online publication date: 25-Aug-2024
  • (2024)A semantics-enabled approach for personalised Data Lake explorationKnowledge and Information Systems10.1007/s10115-023-02014-166:2(1469-1502)Online publication date: 1-Feb-2024
  • (2024)Alfa: active learning for graph neural network-based semantic schema alignmentThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-023-00822-z33:4(981-1011)Online publication date: 1-Jul-2024
  • (2023)Towards building knowledge by merging multiple ontologies with CoMergerApplied Ontology10.3233/AO-23002018:4(307-341)Online publication date: 1-Jan-2023
  • (2023)Explaining Dataset Changes for Semantic Data Versioning with Explain-Da-VProceedings of the VLDB Endowment10.14778/3583140.358316916:6(1587-1600)Online publication date: 20-Apr-2023
  • (2023)RECA: Related Tables Enhanced Column Semantic Type Annotation FrameworkProceedings of the VLDB Endowment10.14778/3583140.358314916:6(1319-1331)Online publication date: 1-Feb-2023
  • (2023)Dataset Discovery and Exploration: A SurveyACM Computing Surveys10.1145/362652156:4(1-37)Online publication date: 9-Nov-2023
  • (2023)Knowledge Graphs QueryingACM SIGMOD Record10.1145/3615952.361595652:2(18-29)Online publication date: 11-Aug-2023
  • (2023)GIO: Generating Efficient Matrix and Frame Readers for Custom Data Formats by ExampleProceedings of the ACM on Management of Data10.1145/35892651:2(1-26)Online publication date: 20-Jun-2023
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Full Access

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media