Nothing Special   »   [go: up one dir, main page]

skip to main content
10.5555/1083592.1083630dlproceedingsArticle/Chapter ViewAbstractPublication PagesvldbConference Proceedingsconference-collections
Article

Approximate matching of hierarchical data using pq-grams

Published: 30 August 2005 Publication History

Abstract

When integrating data from autonomous sources, exact matches of data items that represent the same real world object often fail due to a lack of common keys. Yet in many cases structural information is available and can be used to match such data. As a running example we use residential address information. Addresses are hierarchical structures and are present in many databases. Often they are the best, if not only, relationship between autonomous data sources. Typically the matching has to be approximate since the representations in the sources differ.We propose pq-grams to approximately match hierarchical information from autonomous sources. We define the pq-gram distance between ordered labeled trees as an effective and efficient approximation of the well-known tree edit distance. We analyze the properties of the pq-gram distance and compare it with the edit distance and alternative approximations. Experiments with synthetic and real world data confirm the analytic results and the scalability of our approach.

References

[1]
S. Al-Khalifa, H. V. Jagadish, J. M. Patel, Y. Wu, N. Koudas, and D. Srivastava. Structural joins: A primitive for efficient XML query pattern matching. In Proc. of the Int. Conf. on Data Engineering (ICDE), pages 141--152, San Jose, California, 2002. ACM Press.]]
[2]
N. Augsten, M. Böhlen, and J. Gamper. Reducing the integration of public administration databases to approximate tree matching. In R. Traunmüller, editor, Electronic Government - Third International Conference, EGOV 2004, LNCS 3183, pages 102--107, Zaragoza, Spain, 2004.]]
[3]
N. Bruno, N. Koudas, and D. Srivastava. Holistic twig joins: Optimal XML pattern matching. In Proc. of the ACM SIGMOD Int. Conf. on Management of Data, pages 310--321, Madison, Wisconsin, June 2002. ACM Press.]]
[4]
J. Celko. Trees, databases and SQL. Database Programming and Design, 7(10):48--57, 1994.]]
[5]
J. Celko. Trees and Hierarchies in SQL for Smarties. Morgan Kaufmann Publishers Inc., 2004.]]
[6]
S. S. Chawathe, A. Rajaraman, H. Garcia-Molina, and J. Widom. Change detection in hierarchically structured information. In Proc. of the ACM SIGMOD Int. Conf. on Management of Data, pages 493--504, Montreal, Canada, June 1996. ACM Press.]]
[7]
W. Chen. New algorithm for ordered tree-to-tree correction problem. Journal of Algorithms, 40(2):135--158, Aug. 2001.]]
[8]
D. DeHaan, D. Toman, M. P. Consens, and M. T. Özsu. A comprehensive XQuery to SQL translation using dynamic interval encoding. In Proc. of the ACM SIGMOD Int. Conf. on Management of Data, pages 623--634, San Diego, California, June 2003. ACM Press.]]
[9]
M. Garofalakis and A. Kumar. XML stream processing using tree-edit distance embeddings. ACM Trans. on Database Systems, 30(1):279--332, 2005.]]
[10]
L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava. Approximate string joins in a database (almost) for free. In Proc. of the Int. Conf. on Very Large Databases (VLDB), pages 491--500. Roma, Italy, Sept. 2001. Morgan Kaufmann Publishers Inc.]]
[11]
S. Guha, H. V. Jagadish, N. Koudas, D. Srivastava, and T. Yu. Approximate XML joins. In Proc. of the ACM SIGMOD Int. Conf. on Management of Data, pages 287--298, Madison, Wisconsin, 2002. ACM Press.]]
[12]
H. Jiang, W. Wang, H. Lu, and J. X. Yu. Holistic twig joins on indexed XML documents. In Proc. of the Int. Conf. on Very Large Databases (VLDB), pages 273--284, Berlin, Germany, Sept. 2003. Morgan Kaufmann Publishers Inc.]]
[13]
T. Jiang, L. Wang, and K. Zhang. Alignment of trees---an alternative to tree edit. Theoretical Computer Science, 143(1):137--148, July 1995.]]
[14]
P. N. Klein. Computing the edit-distance between unrooted ordered trees. In Proceedings of the 6th European Symposium on Algorithms, volume 1461 of Lecture Notes in Computer Science, pages 91--102, Venice, Italy, 1998. Springer.]]
[15]
K.-H. Lee, Y.-C. Choy, and S.-B. Cho. An efficient algorithm to compute differences between structured documents. IEEE Transactions on Knowledge and Data Engineering, 16(8):965--979, 2004.]]
[16]
G. Navarro. A guided tour to approximate string matching. ACM Computing Surveys, 33(1):31--88, 2001.]]
[17]
N. Polyzotis, M. Garofalakis, and Y. Ioannidis. Approximate XML query answers. In Proc. of the ACM SIGMOD Int. Conf. on Management of Data, pages 263--274, Paris, France, June 2004. ACM Press.]]
[18]
S. M. Selkow. The tree-to-tree editing problem. Information Processing Letters, 6(6):184--186, Dec. 1977.]]
[19]
K.-C. Tai. The tree-to-tree correction problem. Journal of the ACM (JACM), 26(3):422--433, July 1979.]]
[20]
E. Tanaka and K. Tanaka. The tree-to-tree editing problem. Int. Journal of Pattern Recognition and Artificial Intelligence (IJPRAI), 2(2):221--240, 1988.]]
[21]
E. Ukkonen. Approximate string-matching with q-grams and maximal matches. Theoretical Computer Science, 92(1):191--211, Jan. 1992.]]
[22]
W. Yang. Identifying syntactic differences between two programs. Software---Practice & Experience, 21(7):739--755, July 1991.]]
[23]
C. Zhang, J. F. Naughton, D. J. DeWitt, Q. Luo, and G. M. Lohman. On supporting containment queries in relational database management systems. In Proc. of the ACM SIGMOD Int. Conf. on Management of Data, pages 425--436, Santa Barabara, California, 2001.]]
[24]
K. Zhang and D. Shasha. Simple fast algorithms for the editing distance between trees and related problems. SIAM Journal on Computing, 18(6):1245--1262, 1989.]]

Cited By

View all
  • (2022)SyncSignatureProceedings of the VLDB Endowment10.14778/3565816.356583316:2(330-342)Online publication date: 1-Oct-2022
  • (2022)CATNIPProceedings of the 27th ACM Conference on on Innovation and Technology in Computer Science Education Vol. 110.1145/3502718.3524820(124-130)Online publication date: 7-Jul-2022
  • (2021)Guiding Next-Step Hint Generation Using Automated TestsProceedings of the 26th ACM Conference on Innovation and Technology in Computer Science Education V. 110.1145/3430665.3456344(220-226)Online publication date: 26-Jun-2021
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image DL Hosted proceedings
VLDB '05: Proceedings of the 31st international conference on Very large data bases
August 2005
1392 pages
ISBN:1595931546

Publisher

VLDB Endowment

Publication History

Published: 30 August 2005

Qualifiers

  • Article

Conference

ICMI05

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)7
  • Downloads (Last 6 weeks)0
Reflects downloads up to 22 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2022)SyncSignatureProceedings of the VLDB Endowment10.14778/3565816.356583316:2(330-342)Online publication date: 1-Oct-2022
  • (2022)CATNIPProceedings of the 27th ACM Conference on on Innovation and Technology in Computer Science Education Vol. 110.1145/3502718.3524820(124-130)Online publication date: 7-Jul-2022
  • (2021)Guiding Next-Step Hint Generation Using Automated TestsProceedings of the 26th ACM Conference on Innovation and Technology in Computer Science Education V. 110.1145/3430665.3456344(220-226)Online publication date: 26-Jun-2021
  • (2019)A Scalable Index for Top-k Subtree Similarity QueriesProceedings of the 2019 International Conference on Management of Data10.1145/3299869.3319892(1624-1641)Online publication date: 25-Jun-2019
  • (2015)An automated framework for recommending program elements to novicesProceedings of the 30th IEEE/ACM International Conference on Automated Software Engineering10.1109/ASE.2015.54(283-288)Online publication date: 9-Nov-2015
  • (2014)Perceptual Similarity of Images Generated using Tree GrammarsProceedings of the Southern African Institute for Computer Scientist and Information Technologists Annual Conference 2014 on SAICSIT 2014 Empowered by Technology10.1145/2664591.2664606(286-296)Online publication date: 29-Sep-2014
  • (2014)A survey on tree edit distance lower bound estimation techniques for similarity join on XML dataACM SIGMOD Record10.1145/2590989.259099442:4(29-39)Online publication date: 28-Feb-2014
  • (2012)What is the IQ of your data transformation system?Proceedings of the 21st ACM international conference on Information and knowledge management10.1145/2396761.2396872(872-881)Online publication date: 29-Oct-2012
  • (2011)Similarity join on XML based on k-generation set distanceProceedings of the 2011 international conference on Web-Age Information Management10.1007/978-3-642-28635-3_11(124-135)Online publication date: 14-Sep-2011
  • (2010)Evolutionary taxonomy construction from dynamic tag spaceProceedings of the 11th international conference on Web information systems engineering10.5555/1991336.1991350(105-119)Online publication date: 12-Dec-2010
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media