Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3514221.3517850acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article
Open access

JEDI: These aren't the JSON documents you're looking for...

Published: 11 June 2022 Publication History

Abstract

The JavaScript Object Notation (JSON) is a popular data format used in document stores to natively support semi-structured data. In this paper, we address the problem of JSON similarity lookup queries: given a query document and a distance threshold τ, retrieve all documents that are within τ from the query document. Different from other hierarchical formats such as XML, JSON supports both ordered and unordered sibling collections within a single document which poses a new challenge to the tree model and distance computation. We propose JSON tree, a lossless tree representation of JSON documents, and define the JSON Edit Distance (JEDI), the first edit-based distance measure for JSON. We develop QuickJEDI, an algorithm that computes JEDI by leveraging a new technique to prune expensive sibling matchings. It outperforms a baseline algorithm by an order of magnitude in runtime. To boost the performance of JSON similarity queries, we introduce an index called JSIM and an effective upper bound based on tree sorting. Our upper bound algorithm runs in O(nτ) time and O(n+τ log n) space, which substantially improves the previous best bound of O(n2) time and O(n log n) space (where n is the tree size). Our experimental evaluation shows that our solution scales to databases with millions of documents and JSON trees with tens of thousands of nodes.

Supplemental Material

M4V File
Presentation video
PDF File
Read me
ZIP File
Source Code

References

[1]
Sattam Alsubaiee, Yasser Altowim, Hotham Altwaijry, Alexander Behm, Vinayak Borkar, Yingyi Bu, Michael Carey, et al. 2014. AsterixDB: A Scalable, Open Source BDMS. Proceedings of the VLDB Endowment, Vol. 7, 14, 1905--1916.
[2]
Nikolaus Augsten, Michael Böhlen, Curtis Dyreson, and Johann Gamper. 2012. Windowed pq-grams for approximate joins of data-centric XML. The VLDB Journal, Vol. 21, 4 (2012), 463--488.
[3]
Nikolaus Augsten and Michael H Böhlen. 2013. Similarity joins in relational database systems. Vol. 5. Morgan & Claypool Publishers.
[4]
Mohamed-Amine Baazizi, Dario Colazzo, Giorgio Ghelli, and Carlo Sartiani. 2019. Parametric schema inference for massive JSON datasets. The VLDB Journal, Vol. 28, 4 (2019), 497--521.
[5]
Mohamed Amine Baazizi, Dario Colazzo, Giorgio Ghelli, Carlo Sartiani, and Stefanie Scherzinger. 2021. A JSON Schema Corpus. https://github.com/sdbs-uni-p/json-schema-corpus.
[6]
Dipti Borkar, Ravi Mayuram, Gerald Sangudi, and Michael Carey. 2016. Have your data and query it too: From key-value caching to big data management. In Proceedings of the 2016 International Conference on Management of Data. ACM, 239--251.
[7]
Pierre Bourhis, Juan L Reutter, Fernando Suárez, and Domagoj Vrgovc. 2017. JSON: data model, query languages and schema specification. In Proceedings of the 36th Symposium on Principles of Database Systems. 123--135.
[8]
Tim Bray. 2017. The JavaScript Object Notation (JSON) Data Interchange Format. RFC 8259. RFC Editor. https://www.rfc-editor.org/rfc/rfc8259.txt
[9]
Paul C Bryan and Mark Nottingham. 2013. JavaScript Object Notation (JSON) Patch. RFC 6902. RFC Editor. https://www.rfc-editor.org/rfc/rfc6902.txt
[10]
David Buttler. 2004. A short survey of document structure similarity algorithms. Proceedings of the International Conference on Internet Computing, Vol. 1, 3--9.
[11]
Hanyang Cao, Jean-Rémy Falleri, Xavier Blanc, and Li Zhang. 2016. JSON Patch for Turning a Pull REST API into a Push. In International Conference on Service-Oriented Computing. Springer, 435--449.
[12]
Sudarshan S Chawathe, Anand Rajaraman, Hector Garcia-Molina, and Jennifer Widom. 1996. Change detection in hierarchically structured information. In Proceedings of the 1996 International Conference on Management of Data. ACM, 493--504.
[13]
Tao Chen and Min-Yen Kan. 2013. Creating a live, public short message service corpus: the NUS SMS corpus. Language Resources and Evaluation, Vol. 47, 2 (2013), 299--335.
[14]
Mohamed L. Chouder, Stefano Rizzi, and Rachid Chalal. 2017. JSON Datasets for Exploratory OLAP. https://doi.org/10.17632/CT8F9SKV97.1
[15]
Circlecell. 2022. JSON Compare. https://jsoncompare.com/. Accessed: 2022-01--12.
[16]
Gregory Cobena, Serge Abiteboul, and Amelie Marian. 2002. Detecting changes in XML documents. In Proceedings of the 18th International Conference on Data Engineering. IEEE, 41--52.
[17]
SQL Docs. 2021. Full Text Search. https://docs.microsoft.com/en-us/sql/relational-databases/search/full-text-search?view=sql-server-ver15. Accessed: 2022-01--12.
[18]
Dominik Durner, Viktor Leis, and Thomas Neumann. 2021. JSON Tiles: Fast Analytics on Semi-Structured Data. In Proceedings of the 2021 International Conference on Management of Data. ACM, 445--458.
[19]
Vincent Emeakaroha, Philip Healy, Kaniz Fatema, and John Morrison. 2013. Analysis of Data Interchange Formats for Interoperable and Efficient Data Communication in Clouds. In IEEE/ACM 6th International Conference on Utility and Cloud Computing. 393--398.
[20]
EU. 2021. EU Open Data Portal. https://data.europa.eu. Accessed: 2022-01--12.
[21]
Jan P Finis, Martin Raiber, Nikolaus Augsten, Robert Brunel, Alfons Kemper, and Franz F"arber. 2013. Rws-diff: flexible and efficient change detection in hierarchical data. In Proceedings of the 22nd ACM international conference on Information & Knowledge Management. 339--348.
[22]
US Food and Drug Administration. 2017. FDA Enforcement Actions Dataset. https://www.kaggle.com/fda/fda-enforcement-actions Accessed: 2022-01--12.
[23]
Xinbo Gao, Bing Xiao, Dacheng Tao, and Xuelong Li. 2010. A survey of graph edit distance. Pattern Analysis and Applications, Vol. 13 (2010), 113--129.
[24]
Minos Garofalakis and Amit Kumar. 2003. Correlating XML data streams using tree-edit distance embeddings. In Proceedings of the 22nd Symposium on Principles of Database Systems. 143--154.
[25]
Zack Grossbart. 2021. JSON Diff. http://www.jsondiff.com. Accessed: 2022-01--12.
[26]
Zachary Halpern. 2022. Magic Cards Dataset. https://mtgjson.com/api/v5/AtomicCards.json. Accessed: 2022-01--12.
[27]
Daniel S. Hirschberg. 1975. A linear space algorithm for computing maximal common subsequences. Commun. ACM, Vol. 18, 6 (1975), 341--343.
[28]
Thomas Hütter. 2021. https://github.com/DatabaseGroup/jedi-experiments: SIGMOD 2022. https://doi.org/10.5281/zenodo.5807299
[29]
Thomas Hütter. 2022. https://github.com/DatabaseGroup/jedi-experiments: SIGMOD 2022. https://doi.org/10.5281/zenodo.5881864
[30]
Thomas Hütter, Nikolaus Augsten, Kirsch Christoph M, Carey Michael J, and Chen Li. 2022. JEDI: These arentextquotesinglet the JSON documents youtextquotesinglere looking for... (Extended Version*). arXiv preprint arXiv:2201.08099 (2022).
[31]
Thomas Hütter, Mateusz Pawlik, Robert Löschinger, and Nikolaus Augsten. 2019. Effective filters and linear time verification for tree similarity joins. In Proceedings of the 35th International Conference on Data Engineering. IEEE, 854--865.
[32]
Bumsuk Jang, SeongHun Park, and Young-guk Ha. 2017. A stream-based method to detect differences between XML documents. Journal of Information Science, Vol. 43, 1 (2017), 39--53.
[33]
Taewoo Kim, Wenhai Li, Alexander Behm, Inci Cetindil, Rares Vernica, Vinayak Borkar, Michael J Carey, and Chen Li. 2020. Similarity query support in big data management systems. Information Systems, Vol. 88 (2020), 101455.
[34]
Meike Klettke, Uta Störl, and Stefanie Scherzinger. 2015. Schema extraction and structural outlier detection for JSON-based NoSQL data stores. Datenbanksysteme für Business, Technologie und Web (2015), 425--444.
[35]
IBM Knowledge Center. 2022. Fuzzy search. https://www.ibm.com/support/knowledgecenter/SSEPGG_11.5.0/com.ibm.db2.luw.admin.ts.doc/doc/c0058557.html. Accessed: 2022-01--12.
[36]
Daniel Kocher and Nikolaus Augsten. 2019. A scalable index for top-k subtree similarity queries. In Proceedings of the 2019 International Conference on Management of Data. ACM, 1624--1641.
[37]
Erwin Leonardi and Sourav S Bhowmick. 2005. Detecting changes on unordered XML documents using relational databases: a schema-conscious approach. In Proceedings of the 14th ACM International Conference on Information and Knowledge Management. 509--516.
[38]
Willi Mann, Nikolaus Augsten, and Panagiotis Bouros. 2016. An empirical evaluation of set similarity join techniques. Proceedings of the VLDB Endowment, Vol. 9, 9, 636--647.
[39]
MongoDB. 2020. White paper: MongoDB Architecture Guide: Overview. Technical Report. 12 pages. mongodb.com
[40]
Jianmo Ni, Jiacheng Li, and Julian McAuley. 2019. Justifying Recommendations using Distantly-Labeled Reviews and Fine-Grained Aspects. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. 188--197.
[41]
Mateusz Pawlik and Nikolaus Augsten. 2016. Tree edit distance: Robust and memory-efficient. Information Systems, Vol. 56 (2016), 157--173.
[42]
PostgreSQL Documentation. 2021. Additional Supplied Modules. https://www.postgresql.org/docs/current/pgtrgm.html. Accessed: 2022-01--12.
[43]
Reddit. 2021. Reddit Dataset. https://www.reddit.com/r/science.json. Accessed: 2022-01--12.
[44]
Jo ao Setubal and Jo ao Meidanis. 1997. Introduction to Computational Biology .PWS Publishing Company.
[45]
Zeyuan Shang, Yaxiao Liu, Guoliang Li, and Jianhua Feng. 2017. K-Join: Knowledge-Aware Similarity Join. In Proceedings of the 33rd International Conference on Data Engineering. IEEE, 23--24.
[46]
Dharma Shukla, Shireesh Thota, Karthik Raman, Madhan Gajendran, Ankur Shah, Sergii Ziuzin, Krishnan Sundaram, et al. 2015. Schema-agnostic indexing with Azure DocumentDB. Proceedings of the VLDB Endowment, Vol. 8, 12, 1668--1679.
[47]
William Spoth, Ting Xie, Oliver Kennedy, Ying Yang, Beda Hammerschmidt, Zhen Hua Liu, and Dieter Gawlick. 2018. SchemaDrill: Interactive Semi-Structured Schema Design. In Proceedings of the Workshop on Human-In-the-Loop Data Analytics. 1--7.
[48]
Kuo-Chung Tai. 1979. The tree-to-tree correction problem. J. ACM, Vol. 26, 3, 422--433.
[49]
Yu Tang, Yilun Cai, and Nikos Mamoulis. 2015. Scaling similarity joins over tree-structured data. Proceedings of the VLDB Endowment, Vol. 8, 11, 1130--1141.
[50]
Robert Endre Tarjan. 1983. Data structures and network algorithms .SIAM.
[51]
Twitter. 2022. Twitter Developer Platform. https://developer.twitter.com/en/docs.html Accessed: 2022-01--12.
[52]
Cornell University. 2022. arXiv Dataset. https://www.kaggle.com/Cornell-University/arxiv Accessed: 2022-01--12.
[53]
Stanford University. 2019. Stanford QA Dataset. https://www.kaggle.com/stanfordu/stanford-question-answering-dataset Accessed: 2022-01--12.
[54]
US. 2021. Open Data US. https://www.data.gov. Accessed: 2022-01--12.
[55]
Lusheng Wang and Kaizhong Zhang. 2008. Space efficient algorithms for ordered tree comparison. Algorithmica, Vol. 51, 3 (2008), 283--297.
[56]
Elyas Ben Hadj Yahia, Jean-Rémy Falleri, and Laurent Réveillère. 2017. Polly: A Language-Based Approach for Custom Change Detection of Web Service Data. In International Conference on Service-Oriented Computing. Springer, 430--444.
[57]
Rui Yang, Panos Kalnis, and Anthony KH Tung. 2005. Similarity evaluation on tree-structured data. In Proceedings of the 2005 International Conference on Management of Data. ACM, 754--765.
[58]
Minghe Yu, Jin Wang, Guoliang Li, Yong Zhang, Dong Deng, and Jianhua Feng. 2017. A unified framework for string similarity search with edit-distance constraint. The VLDB Journal, Vol. 26, 2 (2017), 249--274.
[59]
Kaizhong Zhang. 1995. Algorithms for the constrained editing distance between ordered labeled trees and related problems. Pattern recognition, Vol. 28, 3 (1995), 463--474.
[60]
Kaizhong Zhang. 1996. A constrained edit distance between unordered labeled trees. Algorithmica, Vol. 15, 3 (1996), 205--222.
[61]
Kaizhong Zhang, Rick Statman, and Dennis Shasha. 1992. On the editing distance between unordered labeled trees. Information processing letters, Vol. 42, 3 (1992), 133--139.

Cited By

View all
  • (2024)ReCG: Bottom-up JSON Schema Discovery Using a Repetitive Cluster-and-Generalize FrameworkProceedings of the VLDB Endowment10.14778/3681954.368201917:11(3538-3550)Online publication date: 30-Aug-2024
  • (2024)Temporal JSON Keyword SearchProceedings of the ACM on Management of Data10.1145/36549802:3(1-27)Online publication date: 30-May-2024
  • (2024)FUDJ: Flexible User-Defined Distributed Joins2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00320(4194-4207)Online publication date: 13-May-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '22: Proceedings of the 2022 International Conference on Management of Data
June 2022
2597 pages
ISBN:9781450392495
DOI:10.1145/3514221
This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 June 2022

Check for updates

Badges

Author Tags

  1. document stores
  2. json edit distance
  3. similarity lookup queries

Qualifiers

  • Research-article

Funding Sources

Conference

SIGMOD/PODS '22
Sponsor:

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)325
  • Downloads (Last 6 weeks)33
Reflects downloads up to 27 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)ReCG: Bottom-up JSON Schema Discovery Using a Repetitive Cluster-and-Generalize FrameworkProceedings of the VLDB Endowment10.14778/3681954.368201917:11(3538-3550)Online publication date: 30-Aug-2024
  • (2024)Temporal JSON Keyword SearchProceedings of the ACM on Management of Data10.1145/36549802:3(1-27)Online publication date: 30-May-2024
  • (2024)FUDJ: Flexible User-Defined Distributed Joins2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00320(4194-4207)Online publication date: 13-May-2024
  • (2024)Subtree Similarity Search Based on Structure and TextBig Data Analytics and Knowledge Discovery10.1007/978-3-031-68323-7_6(72-87)Online publication date: 26-Aug-2024
  • (2023)Feedforward-Aided Course Designs for Similarity SearchProceedings of the 2nd International Workshop on Data Systems Education: Bridging education practice with education research10.1145/3596673.3596974(14-17)Online publication date: 23-Jun-2023
  • (2023)Scalable Computation of Fuzzy Joins Over Large Collections of JSON Data2023 IEEE International Conference on Fuzzy Systems (FUZZ)10.1109/FUZZ52849.2023.10309759(01-06)Online publication date: 13-Aug-2023
  • (2023)Clustering Raw Sensor Data in Process Logs to Detect Data StreamsCooperative Information Systems10.1007/978-3-031-46846-9_25(438-447)Online publication date: 30-Oct-2023

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media