research-article

From natural language processing to neural databases

Authors:

Marzieh Saeidi,

Fabrizio Silvestri,

Sebastian Riedel,

Alon HalevyAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 14, Issue 6

Pages 1033 - 1039

https://doi.org/10.14778/3447689.3447706

Published: 01 February 2021 Publication History

Abstract

In recent years, neural networks have shown impressive performance gains on long-standing AI problems, such as answering queries from text and machine translation. These advances raise the question of whether neural nets can be used at the core of query processing to derive answers from facts, even when the facts are expressed in natural language. If so, it is conceivable that we could relax the fundamental assumption of database management, namely, that our data is represented as fields of a pre-defined schema. Furthermore, such technology would enable combining information from text, images, and structured data seamlessly.

This paper introduces neural databases, a class of systems that use NLP transformers as localized answer derivation engines. We ground the vision in NeuralDB, a system for querying facts represented as short natural language sentences. We demonstrate that recent natural language processing models, specifically transformers, can answer select-project-join queries if they are given a set of relevant facts. However, they cannot scale to non-trivial databases nor answer set-based and aggregation queries. Based on these insights, we identify specific research challenges that are needed to build neural databases. Some of the challenges require drawing upon the rich literature in data management, and others pose new research opportunities to the NLP community. Finally, we show that with preliminary solutions, NeuralDB can already answer queries over thousands of sentences with very high accuracy.

References

[1]

Oshin Agarwal, Heming Ge, Siamak Shakeri, and Rami Al-Rfou. Large scale knowledge graph based synthetic corpus generation for knowledge-enhanced language model pre-training, 2020.

[2]

Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. Neural module networks. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 39--48. IEEE Computer Society, 2016.

[3]

I Androutsopoulos, G D Ritchie, and P Thanisch. Natural Language Interfaces to Databases - an Introduction. Natural Language Engineering, 1(1):29--81, 1995.

[4]

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. VQA: visual question answering. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, pages 2425--2433. IEEE Computer Society, 2015.

Digital Library

[5]

Akari Asai, Kazuma Hashimoto, Hannaneh Hajishirzi, Richard Socher, and Caiming Xiong. Learning to retrieve reasoning paths over wikipedia graph for question answering. In International Conference on Learning Representations, 2020.

[6]

Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language Models are Few-Shot Learners. 2020.

[7]

Nicola De Cao, Gautier Izacard, Sebastian Riedel, and Fabio Petroni. Autoregressive entity retrieval. In International Conference on Learning Representations, 2021.

[8]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, 2019.

[9]

Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2368--2378, Minneapolis, Minnesota, jun 2019. Association for Computational Linguistics.

[10]

Hady Elsahar, Pavlos Vougiouklis, Arslen Remaci, Christophe Gravier, Jonathon Hare, Frederique Laforest, and Elena Simperl. T-REx: A large scale alignment of natural language with knowledge base triples. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, May 2018. European Language Resources Association (ELRA).

[11]

Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. CoRR, abs/1410.5401, 2014.

[12]

Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-wei Chang. REALM : Retrieval-Augmented Language Model Pre-Training, 2020.

[13]

Alon Y. Halevy, Oren Etzioni, AnHai Doan, Zachary G. Ives, Jayant Madhavan, Luke K. McDowell, and Igor Tatarinov. Crossing the structure chasm. In CIDR 2003, First Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, January 5-8, 2003, Online Proceedings. www.cidrdb.org, 2003.

[14]

Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for NLP. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 2790--2799, Long Beach, California, USA, 09--15 Jun 2019. PMLR.

[15]

Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense Passage Retrieval for Open-Domain Question Answering. 2020.

[16]

Tim Kraska, Alex Beutel, Ed H. Chi, Jeffrey Dean, and Neoklis Polyzotis. The case for learned index structures. In Gautam Das, Christopher M. Jermaine, and Philip A. Bernstein, editors, Proceedings of the 2018 International Conference on Management of Data, SIGMOD Conference 2018, Houston, TX, USA, June 10-15, 2018, pages 489--504. ACM, 2018.

Digital Library

[17]

Guillaume Lample, Alexis Conneau, Marc'Aurelio Ranzato, Ludovic Denoyer, and Hervé Jégou. Word translation without parallel data. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net, 2018.

[18]

Patrick Lewis, Ethan Perez, Aleksandara Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. In NeurIPS, 2020.

[19]

Fei Li and H V Jagadish. Constructing an Interactive Natural Language Interface for Relational Databases. Proceedings of the VLDB Endowment2, 8(1):73--84, 2014.

Digital Library

[20]

Yuliang Li, Jinfeng Li, Yoshihiko Suhara, AnHai Doan, and Wang-Chiew Tan. Deep entity matching with pre-trained language models. Proc. VLDB Endow., 14(1):50--60, September 2020.

Digital Library

[21]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. RoBERTa: A Robustly Optimized BERT Pretraining Approach. 2019.

[22]

Pasquale Minervini, Matko Bosnjak, Tim Rocktäschel, Sebastian Riedel, and Edward Grefenstette. Differentiable reasoning on large knowledge bases and natural language. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pages 5182--5190. AAAI Press, 2020.

[23]

Sidharth Mudgal, Han Li, Theodoros Rekatsinas, AnHai Doan, Youngchoon Park, Ganesh Krishnan, Rohit Deep, Esteban Arcaute, and Vijay Raghavendra. Deep learning for entity matching: A design space exploration. In Gautam Das, Christopher M. Jermaine, and Philip A. Bernstein, editors, Proceedings of the 2018 International Conference on Management of Data, SIGMOD Conference 2018, Houston, TX, USA, June 10-15, 2018, pages 19--34. ACM, 2018.

Digital Library

[24]

Matthew Peters, Mark Neumann, Luke Zettlemoyer, and Wen-tau Yih. Dissecting Contextual Word Embeddings: Architecture and Representation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1499--1509, Brussels, Belgium, 2018. Association for Computational Linguistics.

[25]

Fabio Petroni, Aleksandra Piktus, Angela Fan, Patrick Lewis, Majid Yazdani, Nicola De Cao, James Thorne, Yacine Jernite, Vassilis Plachouras, Tim Rocktäschel, et al. Kilt: a benchmark for knowledge intensive language tasks. arXiv preprint arXiv:2009.02252, 2020.

[26]

Fabio Petroni, Tim Rocktäschel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, Alexander H. Miller, and Sebastian Riedel. In Proceedings of EMNLP-IJCNLP, Hong Kong, China.

[27]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research, 21:1--67, 2020.

[28]

Pranav Rajpurkar, Robin Jia, and Percy Liang. Know what you don't know: Unanswerable questions for SQuAD. ACL 2018 - 56th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers), 2:784--789, 2018.

[29]

Tim Rocktäschel and Sebastian Riedel. End-to-end Differentiable Proving. In I Guyon, U V Luxburg, S Bengio, H Wallach, R Fergus, S Vishwanathan, and R Garnett, editors, Advances in Neural Information Processing Systems 30, pages 3788--3800. Curran Associates, Inc., 2017.

Digital Library

[30]

Marzieh Saeidi, Max Bartolo, Patrick Lewis, Sameer Singh, Tim Rocktäschel, Mike Sheldon, Guillaume Bouchard, and Sebastian Riedel. Interpretation of natural language rules in conversational machine reading. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2018.

[31]

Sainbayar Sukhbaatar, Jason Weston, Rob Fergus, et al. End-to-end memory networks. In Advances in neural information processing systems.

Digital Library

[32]

Nan Tang, Ju Fan, Fangyi Li, Jianhong Tu, Xiaoyong Du, Guoliang Li, Sam Madden, and Mourad Ouzzani. Relational pretrained transformers towards democratizing data preparation [vision]. CoRR, abs/2012.02469, 2020.

[33]

Ian Tenney, Patrick Xia, Berlin Chen, Alex Wang, Adam Poliak, R.Thomas McCoy, Najoung Kim, Benjamin Van Durme, Samuel R Bowman, Dipanjan Das, and Ellie Pavlick. What do you learn from context? Probing for sentence structure in contextualized word representations. ICLR, pages 1--17, 2019.

[34]

James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. FEVER: a large-scale dataset for Fact Extraction and VERification. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 809--819, New Orleans, Louisiana, 2018. Association for Computational Linguistics.

[35]

James Thorne, Majid Yazdani, Marzieh Saeidi, Fabrizio Silvestri, Sebastian Riedel, and Alon Y. Halevy. Neural databases. CoRR, abs/2010.06973, 2020.

[36]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Lilon Jones, Aidan Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 2017.

Digital Library

[37]

Denny Vrandečić and Markus Krötzsch. Wikidata: a free collaborative knowledgebase. Communications of the ACM, 57(10):78--85, 2014.

Digital Library

[38]

Jason Weston, Antoine Bordes, Sumit Chopra, Alexander M Rush, Bart van Merriënboer, Armand Joulin, and Tomas Mikolov. Towards ai-complete question answering: A set of prerequisite toy tasks. arXiv preprint arXiv:1502.05698, 2015.

[39]

Tomer Wolfson, Mor Geva, Ankit Gupta, Matt Gardner, Yoav Goldberg, Daniel Deutch, and Jonathan Berant. Break It Down: A Question Understanding Benchmark. Transactions of the Association for Computational Linguistics, 8:183--198, 2020.

[40]

Jichuan Zeng, Xi Victoria Lin, Caiming Xiong, Richard Socher, Michael R. Lyu, Irwin King, and Steven C. H. Hoi. Photon: A robust cross-domain text-to-sql system.

Cited By

Dargahi Nobari ARafiei D(2024)DTT: An Example-Driven Tabular Transformer for Joinability by Leveraging Large Language ModelsProceedings of the ACM on Management of Data10.1145/36392792:1(1-24)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3639279
Miao XJia ZCui BBarcelo PSanchez-Pi NMeliou ASudarshan S(2024)Demystifying Data Management for Large Language ModelsCompanion of the 2024 International Conference on Management of Data10.1145/3626246.3654683(547-555)Online publication date: 9-Jun-2024
https://dl.acm.org/doi/10.1145/3626246.3654683
Trappolini GSantilli ARodolà EHalevy ASilvestri FChen HDuh WHuang HKato MMothe JPoblete B(2023)Multimodal Neural DatabasesProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591930(2619-2628)Online publication date: 19-Jul-2023
https://dl.acm.org/doi/10.1145/3539618.3591930
Show More Cited By

From natural language processing to neural databases

Recommendations

Query processing over incomplete autonomous databases
VLDB '07: Proceedings of the 33rd international conference on Very large data bases

Incompleteness due to missing attribute values (aka "null values") is very common in autonomous web databases, on which user accesses are usually supported through mediators. Traditional query processing techniques that focus on the strict soundness of ...
Query processing under GLAV mappings for relational and graph databases

Schema mappings establish a correspondence between data stored in two databases, called source and target respectively. Query processing under schema mappings has been investigated extensively in the two cases where each target atom is mapped to a query ...
Natural language querying of databases

Natural language (NL) interfaces for database (DB) query formulation have always been recognized as a much-needed enhancement for DB end-users. NL systems, however, have shortcomings that have led some DB researchers to question their practicality. The ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 14, Issue 6

February 2021

261 pages

ISSN:2150-8097

Editors:
Xin Luna Dong
Amazon
,
Felix Naumann
HPI, University of Potsdam

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 February 2021

Published in PVLDB Volume 14, Issue 6

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
282
Total Downloads

Downloads (Last 12 months)37
Downloads (Last 6 weeks)1

Reflects downloads up to 02 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Dargahi Nobari ARafiei D(2024)DTT: An Example-Driven Tabular Transformer for Joinability by Leveraging Large Language ModelsProceedings of the ACM on Management of Data10.1145/36392792:1(1-24)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3639279
Miao XJia ZCui BBarcelo PSanchez-Pi NMeliou ASudarshan S(2024)Demystifying Data Management for Large Language ModelsCompanion of the 2024 International Conference on Management of Data10.1145/3626246.3654683(547-555)Online publication date: 9-Jun-2024
https://dl.acm.org/doi/10.1145/3626246.3654683
Trappolini GSantilli ARodolà EHalevy ASilvestri FChen HDuh WHuang HKato MMothe JPoblete B(2023)Multimodal Neural DatabasesProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591930(2619-2628)Online publication date: 19-Jul-2023
https://dl.acm.org/doi/10.1145/3539618.3591930
Zheng BBi LXi RChen LGao YZhou XJensen CChen HDuh WHuang HKato MMothe JPoblete B(2023)RHB-Net: A Relation-aware Historical Bridging Network for Text2SQL Auto-CompletionProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591759(1458-1467)Online publication date: 19-Jul-2023
https://dl.acm.org/doi/10.1145/3539618.3591759
Trummer I(2022)From BERT to GPT-3 codexProceedings of the VLDB Endowment10.14778/3554821.355489615:12(3770-3773)Online publication date: 1-Aug-2022
https://dl.acm.org/doi/10.14778/3554821.3554896
Veltri ESantoro DBadaro GSaeed MPapotti PIves ZBonifati AEl Abbadi A(2022)Pythia: Unsupervised Generation of Ambiguous Textual Claims from Relational DataProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3520164(2409-2412)Online publication date: 10-Jun-2022
https://dl.acm.org/doi/10.1145/3514221.3520164
Sauchuk AThorne JHalevy ATonellotto NSilvestri FAmigo ECastells PGonzalo JCarterette BCulpepper JKazai G(2022)On the Role of Relevance in Natural Language Processing TasksProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3477495.3532034(1785-1789)Online publication date: 6-Jul-2022
https://dl.acm.org/doi/10.1145/3477495.3532034
Weikum G(2021)Knowledge graphs 2021Proceedings of the VLDB Endowment10.14778/3476311.347639314:12(3233-3238)Online publication date: 28-Oct-2021
https://dl.acm.org/doi/10.14778/3476311.3476393

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents