research-article

Open access

JEDI: These aren't the JSON documents you're looking for...

Authors:

Thomas Hütter,

Nikolaus Augsten,

Christoph M. Kirsch,

Michael J. Carey,

Chen LiAuthors Info & Claims

SIGMOD '22: Proceedings of the 2022 International Conference on Management of Data

Pages 1584 - 1597

https://doi.org/10.1145/3514221.3517850

Published: 11 June 2022 Publication History

Abstract

The JavaScript Object Notation (JSON) is a popular data format used in document stores to natively support semi-structured data. In this paper, we address the problem of JSON similarity lookup queries: given a query document and a distance threshold τ, retrieve all documents that are within τ from the query document. Different from other hierarchical formats such as XML, JSON supports both ordered and unordered sibling collections within a single document which poses a new challenge to the tree model and distance computation. We propose JSON tree, a lossless tree representation of JSON documents, and define the JSON Edit Distance (JEDI), the first edit-based distance measure for JSON. We develop QuickJEDI, an algorithm that computes JEDI by leveraging a new technique to prune expensive sibling matchings. It outperforms a baseline algorithm by an order of magnitude in runtime. To boost the performance of JSON similarity queries, we introduce an index called JSIM and an effective upper bound based on tree sorting. Our upper bound algorithm runs in O(nτ) time and O(n+τ log n) space, which substantially improves the previous best bound of O(n2) time and O(n log n) space (where n is the tree size). Our experimental evaluation shows that our solution scales to databases with millions of documents and JSON trees with tens of thousands of nodes.

Supplemental Material

M4V File

Presentation video

Download
155.03 MB

PDF File

Read me

Download
121.20 KB

ZIP File

Source Code

Download
77.20 KB

References

[1]

Sattam Alsubaiee, Yasser Altowim, Hotham Altwaijry, Alexander Behm, Vinayak Borkar, Yingyi Bu, Michael Carey, et al. 2014. AsterixDB: A Scalable, Open Source BDMS. Proceedings of the VLDB Endowment, Vol. 7, 14, 1905--1916.

Digital Library

[2]

Nikolaus Augsten, Michael Böhlen, Curtis Dyreson, and Johann Gamper. 2012. Windowed pq-grams for approximate joins of data-centric XML. The VLDB Journal, Vol. 21, 4 (2012), 463--488.

Digital Library

[3]

Nikolaus Augsten and Michael H Böhlen. 2013. Similarity joins in relational database systems. Vol. 5. Morgan & Claypool Publishers.

[4]

Mohamed-Amine Baazizi, Dario Colazzo, Giorgio Ghelli, and Carlo Sartiani. 2019. Parametric schema inference for massive JSON datasets. The VLDB Journal, Vol. 28, 4 (2019), 497--521.

Digital Library

[5]

Mohamed Amine Baazizi, Dario Colazzo, Giorgio Ghelli, Carlo Sartiani, and Stefanie Scherzinger. 2021. A JSON Schema Corpus. https://github.com/sdbs-uni-p/json-schema-corpus.

[6]

Dipti Borkar, Ravi Mayuram, Gerald Sangudi, and Michael Carey. 2016. Have your data and query it too: From key-value caching to big data management. In Proceedings of the 2016 International Conference on Management of Data. ACM, 239--251.

Digital Library

[7]

Pierre Bourhis, Juan L Reutter, Fernando Suárez, and Domagoj Vrgovc. 2017. JSON: data model, query languages and schema specification. In Proceedings of the 36th Symposium on Principles of Database Systems. 123--135.

Digital Library

[8]

Tim Bray. 2017. The JavaScript Object Notation (JSON) Data Interchange Format. RFC 8259. RFC Editor. https://www.rfc-editor.org/rfc/rfc8259.txt

[9]

Paul C Bryan and Mark Nottingham. 2013. JavaScript Object Notation (JSON) Patch. RFC 6902. RFC Editor. https://www.rfc-editor.org/rfc/rfc6902.txt

[10]

David Buttler. 2004. A short survey of document structure similarity algorithms. Proceedings of the International Conference on Internet Computing, Vol. 1, 3--9.

[11]

Hanyang Cao, Jean-Rémy Falleri, Xavier Blanc, and Li Zhang. 2016. JSON Patch for Turning a Pull REST API into a Push. In International Conference on Service-Oriented Computing. Springer, 435--449.

Digital Library

[12]

Sudarshan S Chawathe, Anand Rajaraman, Hector Garcia-Molina, and Jennifer Widom. 1996. Change detection in hierarchically structured information. In Proceedings of the 1996 International Conference on Management of Data. ACM, 493--504.

Digital Library

[13]

Tao Chen and Min-Yen Kan. 2013. Creating a live, public short message service corpus: the NUS SMS corpus. Language Resources and Evaluation, Vol. 47, 2 (2013), 299--335.

[14]

Mohamed L. Chouder, Stefano Rizzi, and Rachid Chalal. 2017. JSON Datasets for Exploratory OLAP. https://doi.org/10.17632/CT8F9SKV97.1

[15]

Circlecell. 2022. JSON Compare. https://jsoncompare.com/. Accessed: 2022-01--12.

[16]

Gregory Cobena, Serge Abiteboul, and Amelie Marian. 2002. Detecting changes in XML documents. In Proceedings of the 18th International Conference on Data Engineering. IEEE, 41--52.

[17]

SQL Docs. 2021. Full Text Search. https://docs.microsoft.com/en-us/sql/relational-databases/search/full-text-search?view=sql-server-ver15. Accessed: 2022-01--12.

[18]

Dominik Durner, Viktor Leis, and Thomas Neumann. 2021. JSON Tiles: Fast Analytics on Semi-Structured Data. In Proceedings of the 2021 International Conference on Management of Data. ACM, 445--458.

Digital Library

[19]

Vincent Emeakaroha, Philip Healy, Kaniz Fatema, and John Morrison. 2013. Analysis of Data Interchange Formats for Interoperable and Efficient Data Communication in Clouds. In IEEE/ACM 6th International Conference on Utility and Cloud Computing. 393--398.

[20]

EU. 2021. EU Open Data Portal. https://data.europa.eu. Accessed: 2022-01--12.

[21]

Jan P Finis, Martin Raiber, Nikolaus Augsten, Robert Brunel, Alfons Kemper, and Franz F"arber. 2013. Rws-diff: flexible and efficient change detection in hierarchical data. In Proceedings of the 22nd ACM international conference on Information & Knowledge Management. 339--348.

Digital Library

[22]

US Food and Drug Administration. 2017. FDA Enforcement Actions Dataset. https://www.kaggle.com/fda/fda-enforcement-actions Accessed: 2022-01--12.

[23]

Xinbo Gao, Bing Xiao, Dacheng Tao, and Xuelong Li. 2010. A survey of graph edit distance. Pattern Analysis and Applications, Vol. 13 (2010), 113--129.

[24]

Minos Garofalakis and Amit Kumar. 2003. Correlating XML data streams using tree-edit distance embeddings. In Proceedings of the 22nd Symposium on Principles of Database Systems. 143--154.

Digital Library

[25]

Zack Grossbart. 2021. JSON Diff. http://www.jsondiff.com. Accessed: 2022-01--12.

[26]

Zachary Halpern. 2022. Magic Cards Dataset. https://mtgjson.com/api/v5/AtomicCards.json. Accessed: 2022-01--12.

[27]

Daniel S. Hirschberg. 1975. A linear space algorithm for computing maximal common subsequences. Commun. ACM, Vol. 18, 6 (1975), 341--343.

Digital Library

[28]

Thomas Hütter. 2021. https://github.com/DatabaseGroup/jedi-experiments: SIGMOD 2022. https://doi.org/10.5281/zenodo.5807299

[29]

Thomas Hütter. 2022. https://github.com/DatabaseGroup/jedi-experiments: SIGMOD 2022. https://doi.org/10.5281/zenodo.5881864

[30]

Thomas Hütter, Nikolaus Augsten, Kirsch Christoph M, Carey Michael J, and Chen Li. 2022. JEDI: These arentextquotesinglet the JSON documents youtextquotesinglere looking for... (Extended Version*). arXiv preprint arXiv:2201.08099 (2022).

[31]

Thomas Hütter, Mateusz Pawlik, Robert Löschinger, and Nikolaus Augsten. 2019. Effective filters and linear time verification for tree similarity joins. In Proceedings of the 35th International Conference on Data Engineering. IEEE, 854--865.

[32]

Bumsuk Jang, SeongHun Park, and Young-guk Ha. 2017. A stream-based method to detect differences between XML documents. Journal of Information Science, Vol. 43, 1 (2017), 39--53.

Digital Library

[33]

Taewoo Kim, Wenhai Li, Alexander Behm, Inci Cetindil, Rares Vernica, Vinayak Borkar, Michael J Carey, and Chen Li. 2020. Similarity query support in big data management systems. Information Systems, Vol. 88 (2020), 101455.

Digital Library

[34]

Meike Klettke, Uta Störl, and Stefanie Scherzinger. 2015. Schema extraction and structural outlier detection for JSON-based NoSQL data stores. Datenbanksysteme für Business, Technologie und Web (2015), 425--444.

[35]

IBM Knowledge Center. 2022. Fuzzy search. https://www.ibm.com/support/knowledgecenter/SSEPGG_11.5.0/com.ibm.db2.luw.admin.ts.doc/doc/c0058557.html. Accessed: 2022-01--12.

[36]

Daniel Kocher and Nikolaus Augsten. 2019. A scalable index for top-k subtree similarity queries. In Proceedings of the 2019 International Conference on Management of Data. ACM, 1624--1641.

Digital Library

[37]

Erwin Leonardi and Sourav S Bhowmick. 2005. Detecting changes on unordered XML documents using relational databases: a schema-conscious approach. In Proceedings of the 14th ACM International Conference on Information and Knowledge Management. 509--516.

Digital Library

[38]

Willi Mann, Nikolaus Augsten, and Panagiotis Bouros. 2016. An empirical evaluation of set similarity join techniques. Proceedings of the VLDB Endowment, Vol. 9, 9, 636--647.

Digital Library

[39]

MongoDB. 2020. White paper: MongoDB Architecture Guide: Overview. Technical Report. 12 pages. mongodb.com

[40]

Jianmo Ni, Jiacheng Li, and Julian McAuley. 2019. Justifying Recommendations using Distantly-Labeled Reviews and Fine-Grained Aspects. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. 188--197.

[41]

Mateusz Pawlik and Nikolaus Augsten. 2016. Tree edit distance: Robust and memory-efficient. Information Systems, Vol. 56 (2016), 157--173.

Digital Library

[42]

PostgreSQL Documentation. 2021. Additional Supplied Modules. https://www.postgresql.org/docs/current/pgtrgm.html. Accessed: 2022-01--12.

[43]

Reddit. 2021. Reddit Dataset. https://www.reddit.com/r/science.json. Accessed: 2022-01--12.

[44]

Jo ao Setubal and Jo ao Meidanis. 1997. Introduction to Computational Biology .PWS Publishing Company.

[45]

Zeyuan Shang, Yaxiao Liu, Guoliang Li, and Jianhua Feng. 2017. K-Join: Knowledge-Aware Similarity Join. In Proceedings of the 33rd International Conference on Data Engineering. IEEE, 23--24.

[46]

Dharma Shukla, Shireesh Thota, Karthik Raman, Madhan Gajendran, Ankur Shah, Sergii Ziuzin, Krishnan Sundaram, et al. 2015. Schema-agnostic indexing with Azure DocumentDB. Proceedings of the VLDB Endowment, Vol. 8, 12, 1668--1679.

Digital Library

[47]

William Spoth, Ting Xie, Oliver Kennedy, Ying Yang, Beda Hammerschmidt, Zhen Hua Liu, and Dieter Gawlick. 2018. SchemaDrill: Interactive Semi-Structured Schema Design. In Proceedings of the Workshop on Human-In-the-Loop Data Analytics. 1--7.

Digital Library

[48]

Kuo-Chung Tai. 1979. The tree-to-tree correction problem. J. ACM, Vol. 26, 3, 422--433.

Digital Library

[49]

Yu Tang, Yilun Cai, and Nikos Mamoulis. 2015. Scaling similarity joins over tree-structured data. Proceedings of the VLDB Endowment, Vol. 8, 11, 1130--1141.

Digital Library

[50]

Robert Endre Tarjan. 1983. Data structures and network algorithms .SIAM.

[51]

Twitter. 2022. Twitter Developer Platform. https://developer.twitter.com/en/docs.html Accessed: 2022-01--12.

[52]

Cornell University. 2022. arXiv Dataset. https://www.kaggle.com/Cornell-University/arxiv Accessed: 2022-01--12.

[53]

Stanford University. 2019. Stanford QA Dataset. https://www.kaggle.com/stanfordu/stanford-question-answering-dataset Accessed: 2022-01--12.

[54]

US. 2021. Open Data US. https://www.data.gov. Accessed: 2022-01--12.

[55]

Lusheng Wang and Kaizhong Zhang. 2008. Space efficient algorithms for ordered tree comparison. Algorithmica, Vol. 51, 3 (2008), 283--297.

Digital Library

[56]

Elyas Ben Hadj Yahia, Jean-Rémy Falleri, and Laurent Réveillère. 2017. Polly: A Language-Based Approach for Custom Change Detection of Web Service Data. In International Conference on Service-Oriented Computing. Springer, 430--444.

[57]

Rui Yang, Panos Kalnis, and Anthony KH Tung. 2005. Similarity evaluation on tree-structured data. In Proceedings of the 2005 International Conference on Management of Data. ACM, 754--765.

Digital Library

[58]

Minghe Yu, Jin Wang, Guoliang Li, Yong Zhang, Dong Deng, and Jianhua Feng. 2017. A unified framework for string similarity search with edit-distance constraint. The VLDB Journal, Vol. 26, 2 (2017), 249--274.

Digital Library

[59]

Kaizhong Zhang. 1995. Algorithms for the constrained editing distance between ordered labeled trees and related problems. Pattern recognition, Vol. 28, 3 (1995), 463--474.

[60]

Kaizhong Zhang. 1996. A constrained edit distance between unordered labeled trees. Algorithmica, Vol. 15, 3 (1996), 205--222.

Digital Library

[61]

Kaizhong Zhang, Rick Statman, and Dennis Shasha. 1992. On the editing distance between unordered labeled trees. Information processing letters, Vol. 42, 3 (1992), 133--139.

Cited By

Yun JTak BHan W(2024)ReCG: Bottom-up JSON Schema Discovery Using a Repetitive Cluster-and-Generalize FrameworkProceedings of the VLDB Endowment10.14778/3681954.368201917:11(3538-3550)Online publication date: 30-Aug-2024
https://dl.acm.org/doi/10.14778/3681954.3682019
Dyreson CShatnawi ABhowmick SSharma V(2024)Temporal JSON Keyword SearchProceedings of the ACM on Management of Data10.1145/36549802:3(1-27)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3654980
Sevim AEldawy ACarman ECarey MTsotras V(2024)FUDJ: Flexible User-Defined Distributed Joins2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00320(4194-4207)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00320
Show More Cited By

Index Terms

JEDI: These aren't the JSON documents you're looking for...
1. Information systems
  1. Data management systems
  2. Information retrieval
    1. Retrieval models and ranking
      1. Similarity measures

Recommendations

JSON: Data model, Query languages and Schema specification
PODS '17: Proceedings of the 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems

Despite the fact that JSON is currently one of the most popular formats for exchanging data on the Web, there are very few studies on this topic and there is no agreement upon a theoretical framework for dealing with JSON. Therefore in this paper we ...
Foundations of JSON Schema
WWW '16: Proceedings of the 25th International Conference on World Wide Web

JSON -- the most popular data format for sending API requests and responses -- is still lacking a standardized schema or meta-data definition that allows the developers to specify the structure of JSON documents. JSON Schema is an attempt to provide a ...
Json: Main principals

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '22: Proceedings of the 2022 International Conference on Management of Data

June 2022

2597 pages

ISBN:9781450392495

DOI:10.1145/3514221

General Chair:
Zachary Ives
University of Pennsylvania (USA)
,
Program Chairs:
Angela Bonifati
Lyon 1 University (France)
,
Amr El Abbadi
University of California, Santa Barbara (USA)

Copyright © 2022 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 June 2022

Check for updates

Badges

Author Tags

Qualifiers

Research-article

Funding Sources

Austrian Marshall Plan Foundation
Austrian Science Fund
Czech Ministry of Education, Youth and Sports from the Czech Operational Programme Research, Development, and Education
Horizon 2020 Framework Programme

Conference

SIGMOD/PODS '22

Sponsor:

SIGMOD

SIGMOD/PODS '22: International Conference on Management of Data

June 12 - 17, 2022

PA, Philadelphia, USA

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
974
Total Downloads

Downloads (Last 12 months)325
Downloads (Last 6 weeks)33

Reflects downloads up to 27 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Yun JTak BHan W(2024)ReCG: Bottom-up JSON Schema Discovery Using a Repetitive Cluster-and-Generalize FrameworkProceedings of the VLDB Endowment10.14778/3681954.368201917:11(3538-3550)Online publication date: 30-Aug-2024
https://dl.acm.org/doi/10.14778/3681954.3682019
Dyreson CShatnawi ABhowmick SSharma V(2024)Temporal JSON Keyword SearchProceedings of the ACM on Management of Data10.1145/36549802:3(1-27)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3654980
Sevim AEldawy ACarman ECarey MTsotras V(2024)FUDJ: Flexible User-Defined Distributed Joins2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00320(4194-4207)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00320
Mizokami TBou SAmagasa T(2024)Subtree Similarity Search Based on Structure and TextBig Data Analytics and Knowledge Discovery10.1007/978-3-031-68323-7_6(72-87)Online publication date: 26-Aug-2024
https://dl.acm.org/doi/10.1007/978-3-031-68323-7_6
Hütter TKocher DAivaloglou EFletcher GMiedema D(2023)Feedforward-Aided Course Designs for Similarity SearchProceedings of the 2nd International Workshop on Data Systems Education: Bridging education practice with education research10.1145/3596673.3596974(14-17)Online publication date: 23-Jun-2023
https://dl.acm.org/doi/10.1145/3596673.3596974
Uhartegaray RD'Orazio LDamigos MKalogeros E(2023)Scalable Computation of Fuzzy Joins Over Large Collections of JSON Data2023 IEEE International Conference on Fuzzy Systems (FUZZ)10.1109/FUZZ52849.2023.10309759(01-06)Online publication date: 13-Aug-2023
https://doi.org/10.1109/FUZZ52849.2023.10309759
Ehrendorfer MMangler JRinderle-Ma S(2023)Clustering Raw Sensor Data in Process Logs to Detect Data StreamsCooperative Information Systems10.1007/978-3-031-46846-9_25(438-447)Online publication date: 30-Oct-2023
https://dl.acm.org/doi/10.1007/978-3-031-46846-9_25

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten