Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

ReCG: Bottom-up JSON Schema Discovery Using a Repetitive Cluster-and-Generalize Framework

Published: 01 July 2024 Publication History

Abstract

The schemalessness, one of the major advantages of JSON representation format, comes with high penalties in querying and operations by denying various critical functions such as query optimizations, indexing, or data verification. There have been continuous efforts to develop an accurate JSON schema discovery algorithm from a bag of JSON documents. Unfortunately, existing schema discovery techniques, being top-down algorithms, face challenges from the lack of visibility into children nodes of JSON tree. With absence of the information about lower-level JSON elements, top-down algorithms need to employ assumptions and heuristics to decide the schema type of nodes. However, such static decisions are often violated in datasets which causes top-down algorithms to perform poorly. To overcome this, we propose an algorithm, called ReCG, that processes JSON documents in a bottom-up manner. It builds up schemas from leaf elements upward in the JSON document tree and, thus, can make more informed decisions of the schema node types. In addition, we adopt MDL (Minimum Description Length) principles systematically while building up the schemas to choose among candidate schemas the most concise yet accurate one with well-balanced generality. Evaluations show that our technique improves the recall and precision of found schemas by as high as 47%, resulting in 46% better F1 score while also performing 2.11× faster on average against the state-of-the-art.

References

[1]
2024. Technical Report. Retrieved July 15, 2024 from https://sites.google.com/dblab.postech.ac.kr/recg-technical-report
[2]
Tarfah Alrashed, Jumana Almahmoud, Amy X. Zhang, and David R. Karger. 2020. ScrAPIr: Making Web Data APIs Accessible to End Users. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (, Honolulu, HI, USA,) (CHI '20). ACM, New York, NY, USA, 1--12.
[3]
Lyes Attouche, Mohamed-Amine Baazizi, Dario Colazzo, Giorgio Ghelli, Carlo Sartiani, and Stefanie Scherzinger. 2024. Validation of Modern JSON Schema: Formalization and Complexity. Proceedings of the ACM on Programming Languages 8, POPL (2024), 1451--1481.
[4]
Mohamed Amine Baazizi, Houssem Ben Lahmar, Dario Colazzo, Giorgio Ghelli, and Carlo Sartiani. 2017. Schema Inference for Massive JSON Datasets. In Proceedings of the Conference on Extending Database Technology (EDBT). 222--233.
[5]
Mohamed-Amine Baazizi, Dario Colazzo, Giorgio Ghelli, and Carlo Sartiani. 2019. Schemas and Types for JSON Data: From Theory to Practice. In Proceedings of the 2019 International Conference on Management of Data (Amsterdam, Netherlands) (SIGMOD '19). ACM, New York, NY, USA, 2060--2063.
[6]
Mohamed-Amine Baazizi, Dario Colazzo, Giorgio Ghelli, and Carlo Sartiani. 2022. Parametric Schema Inference for Massive JSON Datasets. The VLDB Journal 28, 4 (mar 2022), 497--521.
[7]
Véronique Benzaken, Giuseppe Castagna, Dario Colazzo, and Kim Nguyen. 2006. Type-Based XML Projection. In VLDB, Vol. 6. 271--282.
[8]
Geert Jan Bex, Wouter Gelade, Frank Neven, and Stijn Vansummeren. 2010. Learning deterministic regular expressions for the inference of schemas from XML data. ACM Transactions on the Web (TWEB) 4, 4 (2010), 1--32.
[9]
Kevin S Beyer, Vuk Ercegovac, Rainer Gemulla, Andrey Balmin, Mohamed Eltabakh, Carl-Christian Kanne, Fatma Ozcan, and Eugene J Shekita. 2011. Jaql: A scripting language for large scale semistructured data analysis. Proceedings of the VLDB Endowment 4, 12 (2011), 1272--1283.
[10]
Daniele Bonetta and Matthias Brantner. 2017. FAD.Js: Fast JSON Data Access Using JIT-Based Speculative Optimizations. Proc. VLDB Endow. 10, 12 (aug 2017), 1778--1789.
[11]
Pierre Bourhis, Juan L. Reutter, Fernando Suárez, and Domagoj Vrgoč. 2017. JSON: Data model, Query languages and Schema specification. In Proceedings of the 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems (, Chicago, Illinois, USA,) (PODS '17). ACM, New York, NY, USA, 123--135.
[12]
Alvaro Cabrera. 2016. JSON Schema Faker. Retrieved July 15, 2024 from https://github.com/json-schema-faker/json-schema-faker
[13]
Craig Chasseur, Yinan Li, and Jignesh M Patel. 2013. Enabling JSON Document Stores in Relational Systems. In WebDB, Vol. 13. 14--15.
[14]
Julien Delarue. 2014. Flash profile. Novel techniques in sensory characterization and consumer profiling (2014), 175--206.
[15]
Alin Deutsch, Lucian Popa, and Val Tannen. 2006. Query reformulation with constraints. SIGMOD Rec. 35, 1 (mar 2006), 65--73.
[16]
Dominik Durner, Viktor Leis, and Thomas Neumann. 2021. JSON Tiles: Fast Analytics on Semi-Structured Data. In Proceedings of the 2021 International Conference on Management of Data (Virtual Event, China) (SIGMOD '21). ACM, New York, NY, USA, 445--458.
[17]
Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu, et al. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In kdd, Vol. 96. 226--231.
[18]
Wenfei Fan, Floris Geerts, Xibei Jia, and Anastasios Kementsietsidis. 2006. Rewriting regular XPath queries on XML views. In 2007 IEEE 23rd International Conference on Data Engineering. IEEE, 666--675.
[19]
Daniel H Fishman and J Minker. 1970. On the number of trees with n terminal nodes. Technical Report.
[20]
Angelo Augusto Frozza, Ronaldo dos Santos Mello, and Felipe de Souza da Costa. 2018. An approach for schema extraction of JSON and extended JSON document collections. In 2018 IEEE International Conference on Information Reuse and Integration (IRI). IEEE, 356--363.
[21]
Enrico Gallinucci, Matteo Golfarelli, and Stefano Rizzi. 2018. Schema profiling of document-oriented databases. Information Systems 75 (2018), 13--25.
[22]
Minos Garofalakis, Aristides Gionis, Rajeev Rastogi, Sridhar Seshadri, and Kyuseok Shim. 2000. XTRACT: A system for extracting document type descriptors from XML documents. In Proceedings of the 2000 ACM SIGMOD international conference on Management of data. 165--176.
[23]
Patrice Godefroid, Bo-Yuan Huang, and Marina Polishchuk. 2020. Intelligent REST API data fuzzing. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (Virtual Event, USA) (ESEC/FSE 2020). ACM, New York, NY, USA, 725--736.
[24]
Thomas Hütter, Nikolaus Augsten, Christoph M. Kirsch, Michael J. Carey, and Chen Li. 2022. JEDI: These Aren't the JSON Documents You're Looking For.... In Proceedings of the 2022 International Conference on Management of Data (Philadelphia, PA, USA) (SIGMOD '22). ACM, New York, NY, USA, 1584--1597.
[25]
Lubna Irshad, Li Yan, and Zongmin Ma. 2019. Schema-based JSON data stores in relational databases. Journal of Database Management (JDM) 30, 3 (2019), 38--70.
[26]
Paul Jaccard. 1912. The distribution of the flora in the alpine zone. 1. New phytologist 11, 2 (1912), 37--50.
[27]
Lin Jiang, Junqiao Qiu, and Zhijia Zhao. 2020. Scalable structural index construction for JSON analytics. Proc. VLDB Endow. 14, 4 (dec 2020), 694--707.
[28]
Meike Klettke, Uta Störl, and Stefanie Scherzinger. 2015. Schema extraction and structural outlier detection for JSON-based NoSQL data stores. (2015).
[29]
Mads Kristensen. 2017. SchemaStore. Retrieved July 15, 2024 from https://github.com/SchemaStore/schemastore
[30]
Markus Lanthaler and Christian Gütl. 2013. Model your application domain, not your JSON structures. In Proceedings of the 22nd International Conference on World Wide Web (Rio de Janeiro, Brazil) (WWW '13 Companion). ACM, New York, NY, USA, 1415--1420.
[31]
Yinan Li, Nikos R. Katsipoulakis, Badrish Chandramouli, Jonathan Goldstein, and Donald Kossmann. 2017. Mison: a fast JSON parser for data analytics. Proc. VLDB Endow. 10, 10 (jun 2017), 1118--1129.
[32]
Zhen Hua Liu, Beda Hammerschmidt, and Doug McMahon. 2014. JSON data management: supporting schema-less development in RDBMS. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (Snowbird, Utah, USA) (SIGMOD '14). ACM, New York, NY, USA, 1247--1258.
[33]
Jason McHugh and Jennifer Widom. 1999. Query optimization for XML. In VLDB, Vol. 99. 315--326.
[34]
Felipe Pezoa, Juan L. Reutter, Fernando Suarez, Martin Ugarte, and Domagoj Vrgoč. 2016. Foundations of JSON Schema. In Proceedings of the 25th International Conference on World Wide Web (Montréal, Québec, Canada) (WWW '16). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE, 263--273.
[35]
J. R. Quinlan and R. L. Rivest. 1989. Inferring Decision Trees Using the Minimum Description Length Principle. Inf. Comput. 80, 3 (mar 1989), 227--248.
[36]
J. Rissanen. 1978. Paper: Modeling by Shortest Data Description. Automatica 14, 5 (sep 1978), 465--471.
[37]
Filipa Alves dos Santos, Hugo André Coelho Cardoso, João da Cunha Costa, Válter Ferreira Picas Carvalho, and José Carlos Ramalho. 2021. DataGen: JSON/XML Dataset Generator. (2021).
[38]
William Spoth, Oliver Kennedy, Ying Lu, Beda Hammerschmidt, and Zhen Hua Liu. 2021. Reducing Ambiguity in Json Schema Discovery. In Proceedings of the 2021 International Conference on Management of Data (Virtual Event, China) (SIGMOD '21). ACM, New York, NY, USA, 1732--1744.
[39]
Daniel Tahara, Thaddeus Diamond, and Daniel J. Abadi. 2014. Sinew: a SQL system for multi-structured data. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (Snowbird, Utah, USA) (SIGMOD '14). ACM, New York, NY, USA, 815--826.
[40]
Animesh Trivedi, Patrick Stuedi, Jonas Pfefferle, Adrian Schuepbach, and Bernard Metzler. 2018. Albis:{High-Performance} File Format for Big Data Systems. In 2018 USENIX Annual Technical Conference (USENIX ATC 18). 615--630.
[41]
Santiago Vargas, Utkarsh Goel, Moritz Steiner, and Aruna Balasubramanian. 2019. Characterizing JSON Traffic Patterns on a CDN. In Proceedings of the Internet Measurement Conference (Amsterdam, Netherlands) (IMC '19). ACM, New York, NY, USA, 195--201.
[42]
Lanjun Wang, Shuo Zhang, Juwei Shi, Limei Jiao, Oktie Hassanzadeh, Jia Zou, and Chen Wangz. 2015. Schema management for document stores. Proc. VLDB Endow. 8, 9 (may 2015), 922--933.
[43]
Yuepeng Wang, Rushi Shah, Abby Criswell, Rong Pan, and Isil Dillig. 2020. Data migration using datalog program synthesis. Proc. VLDB Endow. 13, 7 (mar 2020), 1006--1019.
[44]
Erik Wilde. 2018. Surfing the API Web: Web Concepts. In Companion Proceedings of the The Web Conference 2018 (Lyon, France) (WWW '18). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE, 797--803.
[45]
Yi-Pu Wu, Jin-Jiang Guo, and Xue-Jie Zhang. 2007. A linear dbscan algorithm based on lsh. In 2007 International Conference on Machine Learning and Cybernetics, Vol. 5. IEEE, 2608--2614.
[46]
Gongsheng Yuan, Jiaheng Lu, Zhengtong Yan, and Sai Wu. 2023. A Survey on Mapping Semi-Structured Data and Graph Data to Relational Data. ACM Comput. Surv. 55, 10, Article 218 (feb 2023), 38 pages.
[47]
Qin Yuan, Ye Yuan, Zhenyu Wen, He Wang, and Shiyuan Tang. 2023. An Effective Framework for Enhancing Query Answering in a Heterogeneous Data Lake. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (, Taipei, Taiwan,) (SIGIR '23). ACM, New York, NY, USA, 770--780.
[48]
Xuan Zhou, Julien Gaugaz, Wolf-Tilo Balke, and Wolfgang Nejdl. 2007. Query relaxation using malleable schemas. In Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data (Beijing, China) (SIGMOD '07). ACM, New York, NY, USA, 545--556.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 17, Issue 11
July 2024
1039 pages
Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 July 2024
Published in PVLDB Volume 17, Issue 11

Check for updates

Badges

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 92
    Total Downloads
  • Downloads (Last 12 months)92
  • Downloads (Last 6 weeks)11
Reflects downloads up to 27 Feb 2025

Other Metrics

Citations

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media