Article

Multi-column substring matching for database schema translation

Authors:

Robert H. Warren,

Frank Wm. TompaAuthors Info & Claims

VLDB '06: Proceedings of the 32nd international conference on Very large data bases

Pages 331 - 342

Published: 01 September 2006 Publication History

Abstract

We describe a method for discovering complex schema translations involving substrings from multiple database columns. The method does not require a training set of instances linked across databases and it is capable of dealing with both fixed-and variable-length field columns. We propose an iterative algorithm that deduces the correct sequence of concatenations of column substrings in order to translate from one database to another. We introduce the algorithm along with examples on common database data values and examine its performance on real-world and synthetic datasets.

References

[1]

{1} P. Carreira and H. Galhardas. Execution of data mappers. In Intl. Workshop on Information Quality in Info. Sys., pages 2-9, 2004.

Digital Library

[2]

{2} S. Chaudhuri, K. Ganjam, V. Ganti, and R. M. ani. Robust and efficient fuzzy match for online data cleaning. In Intl. Conf. ACM SIGMOD, pages 313-324, 2003.

Digital Library

[3]

{3} R. Dhamankar, Y. Lee, A. Doan, A. Halevy, and P. Domingos. imap: discovering complex semantic matches between database schemas. In Intl. Conf. ACM SIGMOD, pages 383-394, 2004.

Digital Library

[4]

{4} A. Doan, P. Domingos, and A. Y. Halevy. Reconciling schemas of disparate data sources: a machine-learning approach. In Intl. Conf. ACM SIGMOD, page 509, 2001.

Digital Library

[5]

{5} D. W. Embley, L. Xu, and Y. Ding. Automatic direct and indirect schema mapping: experiences and lessons learned. SIGMOD Rec., 33(4):14-19, 2004.

Digital Library

[6]

{6} G. H. L. Fletcher. The data mapping problem: Algorithmic and logical characterizations. In Workshop on Databases For Next Generation Researchers at ICDE, 2005.

Digital Library

[7]

{7} L. Gravano, P. Ipeirotis, N. Koudas, and D. Srivastava. Text joins in an rdbms for web data integration. In Intl. WWW Conference, pages 90-101, 2003.

Digital Library

[8]

{8} D. S. Hirschberg. A linear space algorithm for computing maximal common subsequences. Comm. ACM, 18(6):341-343, 1975.

Digital Library

[9]

{9} J. W. Hunt and T. G. Szymanski. A fast algorithm for computing longest common subsequences. Comm. ACM, 20(5):350-353, 1977.

Digital Library

[10]

{10} N. Koudas, A. Marathe, and D. Srivastava. Flexible string matching against large databases in practice. In VLDB, pages 1078-1086, 2004.

Digital Library

[11]

{11} V. I. Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics - Doklady, 10(8):707-710, Feb. 1966.

[12]

{12} J. Madhavan, P. A. Bernstein, and E. Rahm. Generic schema matching with cupid. In Intl. Conf. VLDB, page 49, 2001.

Digital Library

[13]

{13} B. Momjian. PostgreSQL: introduction and concepts. Addison Wesley, 2001.

Digital Library

[14]

{14} A. E. Monge and C. Elkan. An efficient domain-independent algorithm for detecting approximately duplicate database records. In DMKD, pages 0-, 1997.

[15]

{15} M. S. Paterson and V. Dancik. Longest common subsequences. In Math. Foundations of Comp. Sci., pages 127-142, 1994.

Digital Library

[16]

{16} E. Rahm and P. Bernstein. On matching schemas automatically. Technical Report MSR-TR-2001-17, Microsoft Research, Feb. 2001.

[17]

{17} E. Rahm and P. A. Bernstein. A survey of approaches to automatic schema matching. The VLDB Journal, 10(4):334-350, 2001.

Digital Library

[18]

{18} G. Salton, A. Wong, and C. S. Yang. A Vector Space Model for Automatic Indexing. Comm. ACM, 18(11):613, 1975.

Digital Library

[19]

{19} L. Seligman, A. Rosenthal, P. Lehner, and A. Smith. Data integration: Where does the time go?, Nov. 2005.

[20]

{20} E. Ukkonen. Approximate string-matching with q-grams and maximal matches. Theor. Comp. Sci., 92(1):191-211, 1992.

Digital Library

[21]

{21} L. L. Yan, R. J. Miller, L. M. Haas, and R. Fagin. Data-driven understanding and refinement of schema mappings. In Intl. Conf. ACM SIGMOD, pages 485-496, 2001.

Digital Library

[22]

{22} M. D. Young-Lai and F. Tompa. Stochastic grammatical inference of text database structure. Machine Learning, 40:111-137, 2000.

Digital Library

Cited By

Lin YHe YChaudhuri S(2023)Auto-BI: Automatically Build BI-Models Leveraging Local Join Prediction and Global Schema GraphProceedings of the VLDB Endowment10.14778/3603581.360359616:10(2578-2590)Online publication date: 1-Jun-2023
https://dl.acm.org/doi/10.14778/3603581.3603596
Li PCheng XChu XHe YChaudhuri SLi GLi ZIdreos SSrivastava D(2021)Auto-FuzzyJoinProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3452824(1064-1076)Online publication date: 9-Jun-2021
https://dl.acm.org/doi/10.1145/3448016.3452824
Zhang YNg KChurchill TChristen PCuzzocrea AAllan JPaton NSrivastava DAgrawal RBroder AZaki MCandan SLabrinidis ASchuster AWang H(2018)Scalable Entity Resolution Using Probabilistic Signatures on Parallel DatabasesProceedings of the 27th ACM International Conference on Information and Knowledge Management10.1145/3269206.3272016(2213-2221)Online publication date: 17-Oct-2018
https://dl.acm.org/doi/10.1145/3269206.3272016
Show More Cited By

Index Terms

Multi-column substring matching for database schema translation

Recommendations

Model Transformation From Object Relational Database to NoSQL Column Based Database
NISS '20: Proceedings of the 3rd International Conference on Networking, Information Systems & Security

NoSQL databases play an important role in saving a huge amount of data. To benefit from the advantages of horizontal scalability and flexibility and with the fast data growing many companies are now replacing their traditional database management ...
Automatic NoSQL to Relational Database Transformation with Dynamic Schema Mapping
Recently, the use of NoSQL databases has grown to manage unstructured data for applications to ensure performance and scalability. However, many organizations prefer to transfer data from an operational NoSQL database to a SQL-based relational database ...
Column-oriented database systems

Column-oriented database systems (column-stores) have attracted a lot of attention in the past few years. Column-stores, in a nutshell, store each database table column separately, with attribute values belonging to the same column stored contiguously, ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

VLDB '06: Proceedings of the 32nd international conference on Very large data bases

September 2006

1269 pages

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data
K.I.S.S. SIG on Databases
AJU Information Technology Co., Ltd
US Army ITC-PAC Asian Research Office
Google Inc.
The Database Society of Japan
Samsung SOS
Advanced Information Technology Research Center
Naver
Microsoft: Microsoft
Korea Info Sci Society: Korea Information Science Society
SK telecom
Systems Applications Products
ORACLE: ORACLE
International Business Management
Air Force Office of Scientific Research/Asian Office of Aerospace R&D
Kosef
Kaist
LG Electronics
CCF-DBS

Publisher

VLDB Endowment

Publication History

Published: 01 September 2006

Check for updates

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

9
Total Citations
View Citations
300
Total Downloads

Downloads (Last 12 months)1
Downloads (Last 6 weeks)0

Reflects downloads up to 30 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Lin YHe YChaudhuri S(2023)Auto-BI: Automatically Build BI-Models Leveraging Local Join Prediction and Global Schema GraphProceedings of the VLDB Endowment10.14778/3603581.360359616:10(2578-2590)Online publication date: 1-Jun-2023
https://dl.acm.org/doi/10.14778/3603581.3603596
Li PCheng XChu XHe YChaudhuri SLi GLi ZIdreos SSrivastava D(2021)Auto-FuzzyJoinProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3452824(1064-1076)Online publication date: 9-Jun-2021
https://dl.acm.org/doi/10.1145/3448016.3452824
Zhang YNg KChurchill TChristen PCuzzocrea AAllan JPaton NSrivastava DAgrawal RBroder AZaki MCandan SLabrinidis ASchuster AWang H(2018)Scalable Entity Resolution Using Probabilistic Signatures on Parallel DatabasesProceedings of the 27th ACM International Conference on Information and Knowledge Management10.1145/3269206.3272016(2213-2221)Online publication date: 17-Oct-2018
https://dl.acm.org/doi/10.1145/3269206.3272016
Zhu EHe YChaudhuri S(2017)Auto-joinProceedings of the VLDB Endowment10.14778/3115404.311540910:10(1034-1045)Online publication date: 1-Jun-2017
https://dl.acm.org/doi/10.14778/3115404.3115409
Yu MLi GDeng DFeng J(2016)String similarity search and joinFrontiers of Computer Science: Selected Publications from Chinese Universities10.1007/s11704-015-5900-510:3(399-417)Online publication date: 1-Jun-2016
https://dl.acm.org/doi/10.1007/s11704-015-5900-5
Singh RGulwani S(2012)Learning semantic string transformations from examplesProceedings of the VLDB Endowment10.14778/2212351.22123565:8(740-751)Online publication date: 1-Apr-2012
https://dl.acm.org/doi/10.14778/2212351.2212356
Ding GDong HWang G(2012)Appearance-Order-Based schema matchingProceedings of the 17th international conference on Database Systems for Advanced Applications - Volume Part I10.1007/978-3-642-29038-1_8(79-94)Online publication date: 15-Apr-2012
https://dl.acm.org/doi/10.1007/978-3-642-29038-1_8
Ding GWang G(2011)Discovering implicit categorical semantics for schema matchingProceedings of the 16th international conference on Database systems for advanced applications: Part II10.5555/1997251.1997269(179-194)Online publication date: 22-Apr-2011
https://dl.acm.org/doi/10.5555/1997251.1997269
Ding GWang GWang B(2010)Top-k generation of mediated schemas over multiple data sourcesProceedings of the 15th international conference on Database systems for advanced applications10.5555/1880853.1880871(143-155)Online publication date: 1-Apr-2010
https://dl.acm.org/doi/10.5555/1880853.1880871

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents