Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

MT-teql: evaluating and augmenting neural NLIDB on real-world linguistic and schema variations

Published: 01 November 2021 Publication History

Abstract

Natural Language Interface to Database (NLIDB) translates human utterances into SQL queries and enables database interactions for non-expert users. Recently, neural network models have become a major approach to implementing NLIDB. However, neural NLIDB faces challenges due to variations in natural language and database schema design. For instance, one user intent or database conceptual model can be expressed in various forms. However, existing benchmarks, using hold-out datasets, cannot provide thorough understanding of how good neural NLIDBs really are in real-world situations and its robustness against such variations. A key difficulty is to annotate SQL queries for inputs under real-world variations, requiring considerable manual effort and expert knowledge.
To systematically assess the robustness of neural NLIDBs without extensive manual effort, we propose MT-Teql, a unified framework to benchmark NLIDBs against real-world language and schema variations. Inspired by recent advances in DBMS metamorphic testing, MT-Teql implements semantics-preserving transformations on utterances and database schemas to generate their variants. NLIDBs can thus be examined for robustness utilizing utterances/schemas and their variants without requiring manual intervention.
We benchmarked nine neural NLIDBs using 62,430 inputs and identified 15,433 defects. We analyzed potential root causes of defects and conducted a user study to show how MT-Teql can assist developers to systematically assess NLIDBs. We further show that the transformed (error-triggering) inputs can be used to augment popular NLIDBs and eliminate 46.5%(±5.0%) errors made by them without compromising their accuracy on standard benchmarks. We summarize lessons from this study that can provide insights to select and design NLIDBs that fit particular usage scenarios.

References

[1]
Christopher Baik, Hosagrahar V Jagadish, and Yunyao Li. 2019. Bridging the semantic gap with SQL query logs in natural language interfaces to databases. In 2019 IEEE 35th International Conference on Data Engineering (ICDE). IEEE, 374--385.
[2]
Ben Bogin, Jonathan Berant, and Matt Gardner. 2019. Representing Schema Structure with Graph Neural Networks for Text-to-SQL Parsing. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, 4560--4565.
[3]
Ben Bogin, Matt Gardner, and Jonathan Berant. 2019. Global Reasoning over Database Structures for Text-to-SQL Parsing. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 3650--3655.
[4]
Ursin Brunner and Kurt Stockinger. 2021. ValueNet: a natural language-to-SQL system that learns from database information. In International Conference on Data Engineering (ICDE), Chania, Greece, 19--22 April 2021. IEEE.
[5]
Sanxing Chen, Aidan San, Xiaodong Liu, and Yangfeng Ji. 2020. A Tale of Two Linkings: Dynamically Gating between Schema Linking and Structural Linking for Text-to-SQL Parsing. In Proceedings of the 28th International Conference on Computational Linguistics. 2900--2912.
[6]
Tsong Y Chen, Shing C Cheung, and Shiu Ming Yiu. 1998. Metamorphic testing: a new approach for generating next test cases. Technical Report. Technical Report HKUST-CS98-01, Department of Computer Science, Hong Kong ....
[7]
DongHyun Choi, Myeong Cheol Shin, EungGyun Kim, and Dong Ryeol Shin. 2020. RYANSQL: Recursively Applying Sketch-based Slot Fillings for Complex Text-to-SQL in Cross-Domain Databases. arXiv preprint arXiv:2004.03125 (2020).
[8]
Deborah A Dahl, Madeleine Bates, Michael K Brown, William M Fisher, Kate Hunicke-Smith, David S Pallett, Christine Pao, Alexander Rudnicky, and Elizabeth Shriberg. 1994. Expanding the scope of the ATIS task: The ATIS-3 corpus. In HUMAN LANGUAGE TECHNOLOGY: Proceedings of a Workshop held at Plainsboro, New Jersey, March 8--11, 1994.
[9]
Xiang Deng, Ahmed Hassan Awadallah, Christopher Meek, Oleksandr Polozov, Huan Sun, and Matthew Richardson. 2020. Structure-Grounded Pretraining for Text-to-SQL. arXiv preprint arXiv:2010.12773 (2020).
[10]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT (1).
[11]
Catherine Finegan-Dollak, Jonathan K Kummerfeld, Li Zhang, Karthik Ramanathan, Sesh Sadasivam, Rui Zhang, and Dragomir Radev. 2018. Improving text-to-sql evaluation methodology. arXiv preprint arXiv:1806.09029 (2018).
[12]
Matt Gardner, Yoav Artzi, Victoria Basmov, Jonathan Berant, Ben Bogin, Sihao Chen, Pradeep Dasigi, Dheeru Dua, Yanai Elazar, Ananth Gottumukkala, et al. 2020. Evaluating Models' Local Decision Boundaries via Contrast Sets. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings. 1307--1323.
[13]
Tao Ge, Furu Wei, and Ming Zhou. 2018. Fluency boost learning and inference for neural grammatical error correction. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1055--1065.
[14]
Jiaqi Guo, Zecheng Zhan, Yan Gao, Yan Xiao, Jian-Guang Lou, Ting Liu, and Dongmei Zhang. 2019. Towards complex text-to-sql in cross-domain database with intermediate representation. arXiv preprint arXiv:1905.08205 (2019).
[15]
Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, Jayant Krishnamurthy, and Luke Zettlemoyer. 2017. Learning a neural semantic parser from user feedback. arXiv preprint arXiv:1704.08760 (2017).
[16]
Hyeonji Kim, Byeong-Hoon So, Wook-Shin Han, and Hongrae Lee. 2020. Natural language to SQL: Where are we today? Proceedings of the VLDB Endowment 13, 10 (2020), 1737--1750.
[17]
Fei Li and HV Jagadish. 2014. Constructing an interactive natural language interface for relational databases. Proceedings of the VLDB Endowment 8, 1 (2014), 73--84.
[18]
Pingchuan Ma and Shuai Wang. 2021. MT-Teql: Evaluating and Augmenting Neural NLIDB on Real-world Linguistic and Schema Variations. Supplementary Material. https://bit.ly/MT-Teql-sm.
[19]
Pingchuan Ma, Shuai Wang, and Jin Liu. 2020. Metamorphic Testing and Certified Mitigation of Fairness Violations in NLP Models. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI-20. 458--465.
[20]
Nishtha Madaan, Inkit Padhi, Naveen Panwar, and Diptikalyan Saha. 2021. Generate Your Counterfactuals: Towards Controlled Counterfactual Generation for Text. In Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2--9, 2021. AAAI Press, 13516--13524.
[21]
Thorsten Papenbrock, Jens Ehrlich, Jannik Marten, Tommy Neubert, Jan-Peer Rudolph, Martin Schönberg, Jakob Zwiener, and Felix Naumann. 2015. Functional dependency discovery: An experimental evaluation of seven algorithms. Proceedings ofthe VLDB Endowment 8, 10 (2015), 1082--1093.
[22]
Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. GloVe: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 1532--1543.
[23]
Ana-Maria Popescu, Alex Armanasu, Oren Etzioni, David Ko, and Alexander Yates. 2004. Modern natural language interfaces to databases: Composing statistical parsing with semantic tractability. In COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics. 141--147.
[24]
Patti Price. 1990. Evaluation of spoken language systems: The ATIS domain. In Speech and Natural Language: Proceedings of a Workshop Held at Hidden Valley, Pennsylvania, June 24--27, 1990.
[25]
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9.
[26]
Marco Tulio Ribeiro, Carlos Guestrin, and Sameer Singh. 2019. Are red roses red? evaluating consistency of question-answering models. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 6174--6184.
[27]
Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. 2020. Beyond Accuracy: Behavioral Testing of NLP Models with CheckList. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 4902--4912.
[28]
Manuel Rigger and Zhendong Su. 2020. Detecting optimization bugs in database engines via non-optimizing reference engine construction. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1140--1152.
[29]
Manuel Rigger and Zhendong Su. 2020. Finding bugs in database systems via query partitioning. Proceedings of the ACM on Programming Languages 4, OOPSLA (2020), 1--30.
[30]
Manuel Rigger and Zhendong Su. 2020. Testing Database Engines via Pivoted Query Synthesis. In 14th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 20). 667--682.
[31]
Diptikalyan Saha, Avrilia Floratou, Karthik Sankaranarayanan, Umar Farooq Minhas, Ashish R Mittal, and Fatma Özcan. 2016. ATHENA: an ontology-driven system for natural language querying over relational data stores. Proceedings of the VLDB Endowment 9, 12 (2016), 1209--1220.
[32]
Torsten Scholak, Raymond Li, Dzmitry Bahdanau, Harm de Vries, and Chris Pal. 2020. DuoRAT: Towards Simpler Text-to-SQL Models. arXiv preprint arXiv:2010.11119 (2020).
[33]
Sergio Segura, Gordon Fraser, Ana B Sanchez, and Antonio Ruiz-Cortés. 2016. A survey on metamorphic testing. IEEE Transactions on software engineering 42, 9 (2016), 805--824.
[34]
Jaydeep Sen, Chuan Lei, Abdul Quamar, Fatma Özcan, Vasilis Efthymiou, Ayushi Dalmia, Greg Stager, Ashish Mittal, Diptikalyan Saha, and Karthik Sankaranarayanan. 2020. ATHENA++ natural language querying for complex nested SQL queries. Proceedings of the VLDB Endowment 13, 12 (2020), 2747--2759.
[35]
Peter Shaw, Philip Massey, Angelica Chen, Francesco Piccinno, and Yasemin Altun. 2019. Generating Logical Forms from Graph Representations of Text and Entities. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 95--106.
[36]
Meina Song, Zecheng Zhan, and E Haihong. 2019. Hierarchical schema representation for text-to-SQL parsing with decomposing decoding. IEEE Access 7 (2019), 103706--103715.
[37]
Ezekiel Soremekun, Sakshi Udeshi, and Sudipta Chattopadhyay. 2020. Astraea: Grammar-based Fairness Testing. arXiv preprint arXiv:2010.02542 (2020).
[38]
Robyn Speer, Joshua Chin, and Catherine Havasi. 2017. ConceptNet5.5: An Open Multilingual Graph of General Knowledge. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4--9, 2017, San Francisco, California, USA, Satinder P. Singh and Shaul Markovitch (Eds.). AAAI Press, 4444--4451.
[39]
Alane Suhr, Ming-Wei Chang, Peter Shaw, and Kenton Lee. 2020. Exploring unexplored generalization challenges for cross-database semantic parsing. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 8372--8388.
[40]
Yuchi Tian, Kexin Pei, Suman Jana, and Baishakhi Ray. 2018. Deeptest: Automated testing of deep-neural-network-driven autonomous cars. In Proceedings of the 40th international conference on software engineering. 303--314.
[41]
Bailin Wang, Mirella Lapata, and Ivan Titov. 2020. Meta-Learning for Domain Generalization in Semantic Parsing. arXiv preprint arXiv:2010.11988 (2020).
[42]
Bailin Wang, Richard Shin, Xiaodong Liu, Oleksandr Polozov, and Matthew Richardson. 2019. Rat-sql: Relation-aware schema encoding and linking for text-to-sql parsers. arXiv preprint arXiv:1911.04942 (2019).
[43]
David HD Warren and Fernando CN Pereira. 1982. An efficient easily adaptable system for interpreting natural language queries. American journal of computational linguistics 8, 3--4 (1982), 110--122.
[44]
Nathaniel Weir, Prasetya Utama, Alex Galakatos, Andrew Crotty, Amir Ilkhechi, Shekar Ramaswamy, Rohin Bhushan, Nadja Geisler, Benjamin Hättasch, Steffen Eger, et al. 2020. DBPal: A Fully Pluggable NL2SQL Training Pipeline. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 2347--2361.
[45]
Tongshuang Wu, Marco Tulio Ribeiro, Jeffrey Heer, and Daniel S Weld. 2021. Polyjuice: Generating Counterfactuals for Explaining, Evaluating, and Improving Models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics.
[46]
Xiaojun Xu, Chang Liu, and Dawn Song. 2017. Sqlnet: Generating structured queries from natural language without reinforcement learning. arXiv preprint arXiv:1711.04436 (2017).
[47]
Tao Yu, Zifan Li, Zilin Zhang, Rui Zhang, and Dragomir Radev. 2018. Type-sql: Knowledge-based type-aware neural text-to-sql generation. arXiv preprint arXiv:1804.09769 (2018).
[48]
Tao Yu, Chien-Sheng Wu, Xi Victoria Lin, Bailin Wang, Yi Chern Tan, Xinyi Yang, Dragomir Radev, Richard Socher, and Caiming Xiong. 2020. GraPPa: Grammar-Augmented Pre-Training for Table Semantic Parsing. arXiv preprint arXiv:2009.13845 (2020).
[49]
Tao Yu, Michihiro Yasunaga, Kai Yang, Rui Zhang, Dongxu Wang, Zifan Li, and Dragomir Radev. 2018. SyntaxSQLNet: Syntax Tree Networks for Complex and Cross-Domain Text-to-SQL Task. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 1653--1663.
[50]
Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, Zilin Zhang, and Dragomir Radev. 2018. Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Brussels, Belgium.
[51]
John M Zelle and Raymond J Mooney. 1996. Learning to parse database queries using inductive logic programming. In Proceedings of the national conference on artificial intelligence. 1050--1055.
[52]
Jichuan Zeng, Xi Victoria Lin, Caiming Xiong, Richard Socher, Michael R Lyu, Irwin King, and Steven CH Hoi. 2020. Photon: A Robust Cross-Domain Text-to-SQL System. arXiv preprint arXiv:2007.15280 (2020).
[53]
Ruiqi Zhong, Tao Yu, and Dan Klein. 2020. Semantic Evaluation for Text-to-SQL with Distilled Test Suites. arXiv preprint arXiv:2010.02840 (2020).
[54]
Victor Zhong, Caiming Xiong, and Richard Socher. 2017. Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning. CoRR abs/1709.00103 (2017).

Cited By

View all

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 15, Issue 3
November 2021
364 pages
ISSN:2150-8097
Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 November 2021
Published in PVLDB Volume 15, Issue 3

Badges

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)33
  • Downloads (Last 6 weeks)4
Reflects downloads up to 17 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Gar: Natural Language to SQL Translation with Efficient Generate-and-RankWeb and Big Data10.1007/978-981-97-7238-4_26(411-427)Online publication date: 31-Aug-2024
  • (2024)Creating and Querying Data Cubes in Python Using PyCubeBig Data Analytics and Knowledge Discovery10.1007/978-3-031-68323-7_22(269-283)Online publication date: 26-Aug-2024
  • (2023)Testing Graph Database Systems via Graph-Aware Metamorphic RelationsProceedings of the VLDB Endowment10.14778/3636218.363623617:4(836-848)Online publication date: 1-Dec-2023
  • (2023)ScienceBenchmark: A Complex Real-World Benchmark for Evaluating Natural Language to SQL SystemsProceedings of the VLDB Endowment10.14778/3636218.363622517:4(685-698)Online publication date: 1-Dec-2023
  • (2023)CatSQL: Towards Real World Natural Language to SQL ApplicationsProceedings of the VLDB Endowment10.14778/3583140.358316516:6(1534-1547)Online publication date: 1-Feb-2023
  • (2023)RHB-Net: A Relation-aware Historical Bridging Network for Text2SQL Auto-CompletionProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591759(1458-1467)Online publication date: 19-Jul-2023
  • (2023)CC: Causality-Aware Coverage Criterion for Deep Neural NetworksProceedings of the 45th International Conference on Software Engineering10.1109/ICSE48619.2023.00153(1788-1800)Online publication date: 14-May-2023
  • (2023)PerfCE: Performance Debugging on Databases with Chaos Engineering-Enhanced Causality AnalysisProceedings of the 38th IEEE/ACM International Conference on Automated Software Engineering10.1109/ASE56229.2023.00106(1454-1466)Online publication date: 11-Nov-2023
  • (2023)PhyFu: Fuzzing Modern Physics Simulation EnginesProceedings of the 38th IEEE/ACM International Conference on Automated Software Engineering10.1109/ASE56229.2023.00054(1579-1591)Online publication date: 11-Nov-2023
  • (2023)Task-Driven Neural Natural Language Interface to DatabaseWeb Information Systems Engineering – WISE 202310.1007/978-981-99-7254-8_51(659-673)Online publication date: 25-Oct-2023

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media