research-article

Public Access

DeepDive: declarative knowledge base construction

Authors:

Christopher Ré,

Michael Cafarella,

Christopher De Sa,

Sen WuAuthors Info & Claims

Communications of the ACM, Volume 60, Issue 5

Pages 93 - 102

https://doi.org/10.1145/3060586

Published: 24 April 2017 Publication History

All formats PDF

Abstract

The dark data extraction or knowledge base construction (KBC) problem is to populate a relational database with information from unstructured data sources, such as emails, webpages, and PDFs. KBC is a long-standing problem in industry and research that encompasses problems of data extraction, cleaning, and integration. We describe DeepDive, a system that combines database and machine learning ideas to help to develop KBC systems. The key idea in DeepDive is to frame traditional extract-transform-load (ETL) style data management problems as a single large statistical inference task that is declaratively defined by the user. DeepDive leverages the effectiveness and efficiency of statistical inference and machine learning for difficult extraction tasks, whereas not requiring users to directly write any probabilistic inference algorithms. Instead, domain experts interact with DeepDive by defining features or rules about the domain. DeepDive has been successfully applied to domains such as pharmacogenomics, paleobiology, and antihuman trafficking enforcement, achieving human-caliber quality at machine-caliber scale. We present the applications, abstractions, and techniques used in DeepDive to accelerate the construction of such dark data extraction systems.

References

[1]

Angeli, G. et al. Stanford's 2014 slot filling systems. TAC KBP (2014).

[2]

Banko, M. et al. Open information extraction from the Web. In IJCAI (2007).

[3]

Betteridge, J., Carlson, A., Hong, S.A., Hruschka, E.R., Jr, Law, E.L., Mitchell, T.M., Wang, S.H. Toward never ending language learning. In AAAI Spring Symposium(2009).

[4]

Brin, S. Extracting patterns and relations from the world wide web. In WebDB (1999).

[5]

Brown, E. et al. Tools and methods for building Watson. IBM Research Report (2013).

[6]

Carlson, A. et al. Toward an architecture for never-ending language learning. In AAAI (2010).

[7]

Chen, F., Doan, A., Yang, J., Ramakrishnan, R. Efficient information extraction over evolving text data. In ICDE (2008).

Digital Library

[8]

Chen, F. et al. Optimizing statistical information extraction programs over evolving text. In ICDE (2012).

Digital Library

[9]

Chen, Y., Wang, D.Z. Knowledge expansion over probabilistic knowledge bases. In SIGMOD (2014).

Digital Library

[10]

De Sa, C., Olukotun, K., Ré, C. Ensuring rapid mixing and low bias for asynchronous gibbs sampling. arXiv preprint arXiv:1602.07415 (2016).

[11]

Domingos, P., Lowd, D. Markov Logic: An Interface Layer for Artificial Intelligence. Morgan & Claypool, 2009.

Digital Library

[12]

Dong, X.L. et al. From data fusion to knowledge fusion. In VLDB (2014).

Digital Library

[13]

Ehrenberg, H.R., Shin, J., Ratner, A.J., Fries, J.A., Ré, C. Data programming with DDLite: Putting humans in a different part of the loop. In HILDA'16 SIGMOD (2016), 13.

Digital Library

[14]

Etzioni, O. et al. Web-scale information extraction in KnowItAll: Preliminary results. In WWW (2004).

[15]

Ferrucci, D. et al. Building Watson: An overview of the DeepQA project. AI Magazine (2010).

[16]

Govindaraju, V. et al. Understanding tables in context using standard NLP toolkits. In ACL (2013).

[17]

Gupta, A., Mumick, I.S., Subrahmanian, V.S. Maintaining views incrementally. SIGMOD Rec. (1993).

Digital Library

[18]

Hearst, M.A. Automatic acquisition of hyponyms from large text corpora. In COLING (1992).

Digital Library

[19]

Hoffmann, R. et al. Knowledge-based weak supervision for information extraction of overlapping relations. In ACL (2011).

[20]

Jampani, R. et al. MCDB: A Monte Carlo approach to managing uncertain data. In SIGMOD (2008).

Digital Library

[21]

Jaynes, E.T. Probability Theory: The Logic of Science. Cambridge University Press, 2003.

[22]

Jiang, S. et al. Learning to refine an automatically extracted knowledge base using Markov logic. In ICDM(2012).

Digital Library

[23]

Kasneci, G. et al. The YAGO-NAGA approach to knowledge discovery. SIGMOD Rec. (2009).

Digital Library

[24]

Koc, M.L., Ré, C. Incrementally maintaining classification using an RDBMS. PVLDB (2011).

[25]

Krishnamurthy, R. et al. SystemT: A system for declarative information extraction. SIGMOD Rec. (2009).

Digital Library

[26]

Li, Y., Reiss, F.R., Chiticariu, L. System T: A declarative information extraction system. In HLT (2011).

[27]

Liu, J. and et al. An asynchronous parallel stochastic coordinate descent algorithm. ICML (2014).

[28]

Madhavan, J. et al. Web-scale data integration: You can only afford to pay as you go. In CIDR (2007).

[29]

Mallory, E.K. et al. Large-scale extraction of gene interactions from full text literature using deepdive. Bioinformatics (2015).

[30]

Mintz, M. et al. Distant supervision for relation extraction without labeled data. In ACL (2009).

[31]

Nakashole, N. et al. Scalable knowledge harvesting with high precision and high recall. In WSDM (2011).

Digital Library

[32]

Niu, F. et al. Hogwild! A lock-free approach to parallelizing stochastic gradient descent. In NIPS (2011).

[33]

Niu, F. et al. Tuffy: Scaling up statistical inference in Markov logic networks using an RDBMS. PVLDB(2011).

[34]

Niu, F. et al. Elementary: Large-scale knowledge-base construction via machine learning and statistical inference. Int. J. Semantic Web Inf. Syst. (2012).

Digital Library

[35]

Niu, F. et al. Scaling inference for Markov logic via dual decomposition. In ICDM (2012).

Digital Library

[36]

Peters, S.E. et al. A machine reading system for assembling synthetic Paleontological databases. PloS One (2014).

[37]

Poon, H., Domingos, P. Joint inference in information extraction. In AAAI (2007).

[38]

Ratner, A., De Sa, C., Wu, S., Selsam, D., Ré, C. Data programming: Creating large training sets, quickly. arXiv preprint arXiv:1605.07723 (2016).

[39]

Ré, C. et al. Feature engineering for knowledge base construction. IEEE Data Eng. Bull. (2014).

[40]

Robert, C.P, Casella, G. Monte Carlo Statistical Methods. Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2005.

Digital Library

[41]

Shen, W. et al. Declarative information extraction using datalog with embedded extraction predicates. In VLDB (2007).

Digital Library

[42]

Shin, J. et al. Incremental knowledge base construction using deepdive. PVLDB (2015).

[43]

Suchanek, F.M. et al. SOFIE: A self-organizing framework for information extraction. In WWW (2009).

Digital Library

[44]

Wainwright, M., Jordan, M. Log-determinant relaxation for approximate inference in discrete Markov random fields. Trans. Sig. Proc. (2006).

[45]

Wainwright, M.J., Jordan, M.I. Graphical models, exponential families, and variational inference. FTML (2008).

[46]

Weikum, G., Theobald, M. From information to knowledge: Harvesting entities and relationships from web sources. In PODS (2010).

[47]

Wick, M. et al. Scalable probabilistic databases with factor graphs and MCMC. PVLDB (2010).

[48]

Yates, A. et al. TextRunner: Open information extraction on the Web. In NAACL (2007).

[49]

Zhang, C. et al. GeoDeepDive: Statistical inference using familiar data-processing languages. In SIGMOD (2013).

[50]

Zhang, C., Ré, C. Towards high- throughput Gibbs sampling at scale: A study across storage managers. In SIGMOD (2013).

[51]

Zhang, C., Ré, C. DimmWitted: A study of main-memory statistical analytics. PVLDB (2014).

[52]

Zhu, J. et al. StatSnowball: A statistical approach to extracting entity relationships. In WWW (2009).

Digital Library

[53]

Zinkevich, M. et al. Parallelized stochastic gradient descent. In NIPS(2010), 2595--2603.

Cited By

Etudo UYoon V(2024)Ontology-Based Information Extraction for Labeling Radical Online Content Using Distant SupervisionInformation Systems Research10.1287/isre.2023.122335:1(203-225)Online publication date: 1-Mar-2024
https://dl.acm.org/doi/10.1287/isre.2023.1223
Farhour FAbdellatif AMansour EShihab E(2024)A Weak Supervision-Based Approach to Improve Chatbots for Code RepositoriesProceedings of the ACM on Software Engineering10.1145/36608121:FSE(2378-2401)Online publication date: 12-Jul-2024
https://dl.acm.org/doi/10.1145/3660812
Carmeli NGrohe MKimelfeld BLivshits ETibi M(2024)Database Repairing with Soft Functional DependenciesACM Transactions on Database Systems10.1145/365115649:2(1-34)Online publication date: 10-Apr-2024
https://dl.acm.org/doi/10.1145/3651156
Show More Cited By

Index Terms

DeepDive: declarative knowledge base construction
1. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language features

Recommendations

DeepDive: Declarative Knowledge Base Construction

The dark data extraction or knowledge base construction (KBC) problem is to populate a SQL database with information from unstructured data sources including emails, webpages, and pdf reports. KBC is a long-standing problem in industry and research that ...
Incremental knowledge base construction using DeepDive

Populating a database with unstructured information is a long-standing problem in industry and research that encompasses problems of extraction, cleaning, and integration. Recent names used for this problem include dealing with dark data and knowledge ...
Incremental knowledge base construction using DeepDive

Populating a database with information from unstructured sources--also known as knowledge base construction (KBC)--is a long-standing problem in industry and research that encompasses problems of extraction, cleaning, and integration. In this work, we ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Communications of the ACM

Communications of the ACM Volume 60, Issue 5

May 2017

101 pages

ISSN:0001-0782

EISSN:1557-7317

DOI:10.1145/3084186

Editor:
Moshe Y. Vardi
Association for Computing Machinery, New York, NY

Issue’s Table of Contents

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 April 2017

Published in CACM Volume 60, Issue 5

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article
Research
Refereed

Funding Sources

Google
Toshiba
Gordon and Betty Moore Foundation
National Science Foundation
Defense Advanced Research Projects Agency
Office of Naval Research
Alfred P. Sloan Foundation
American Family Insurance
National Institutes of Health

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

46
Total Citations
View Citations
18,324
Total Downloads

Downloads (Last 12 months)619
Downloads (Last 6 weeks)94

Reflects downloads up to 27 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Etudo UYoon V(2024)Ontology-Based Information Extraction for Labeling Radical Online Content Using Distant SupervisionInformation Systems Research10.1287/isre.2023.122335:1(203-225)Online publication date: 1-Mar-2024
https://dl.acm.org/doi/10.1287/isre.2023.1223
Farhour FAbdellatif AMansour EShihab E(2024)A Weak Supervision-Based Approach to Improve Chatbots for Code RepositoriesProceedings of the ACM on Software Engineering10.1145/36608121:FSE(2378-2401)Online publication date: 12-Jul-2024
https://dl.acm.org/doi/10.1145/3660812
Carmeli NGrohe MKimelfeld BLivshits ETibi M(2024)Database Repairing with Soft Functional DependenciesACM Transactions on Database Systems10.1145/365115649:2(1-34)Online publication date: 10-Apr-2024
https://dl.acm.org/doi/10.1145/3651156
Hanafi MReiss FKatsis YMoore RWood DFalakmasir MLiu C(2024)Machine-Assisted Error Discovery in Conversational AI SystemsExtended Abstracts of the CHI Conference on Human Factors in Computing Systems10.1145/3613905.3651120(1-10)Online publication date: 11-May-2024
https://dl.acm.org/doi/10.1145/3613905.3651120
Kefalidis SPunjani DTsalapati EPlas KPollali MMaret PKoubarakis M(2024)The question answering system GeoQA2 and a new benchmark for its evaluationInternational Journal of Applied Earth Observation and Geoinformation10.1016/j.jag.2024.104203134(104203)Online publication date: Nov-2024
https://doi.org/10.1016/j.jag.2024.104203
Haiyang XRuomei YYan WLixin GMengshan L(2024)Knowledge Graph for Solubility Big Data: Construction and ApplicationsWIREs Data Mining and Knowledge Discovery10.1002/widm.1570Online publication date: Nov-2024
https://doi.org/10.1002/widm.1570
Ghayoomi M(2023)Applications of Chatbots in EducationTrends, Applications, and Challenges of Chatbot Technology10.4018/978-1-6684-6234-8.ch004(80-118)Online publication date: 24-Feb-2023
https://doi.org/10.4018/978-1-6684-6234-8.ch004
Hailemichael MHafiz Muhammad UBekele MGirma Nigus MDechasa KMebratu Mekbibu A(2023)Design and Implementation of Satellite Knowledge Graph and Its Application2023 3rd Asia-Pacific Conference on Communications Technology and Computer Science (ACCTCS)10.1109/ACCTCS58815.2023.00046(240-245)Online publication date: Feb-2023
https://doi.org/10.1109/ACCTCS58815.2023.00046
Wang CLi YChen JMa X(2023)Named entity annotation schema for geological literature mining in the domain of porphyry copper depositsOre Geology Reviews10.1016/j.oregeorev.2022.105243152(105243)Online publication date: Jan-2023
https://doi.org/10.1016/j.oregeorev.2022.105243
Gholami MWard RMahal RMirian MYen KPark KMcKeown MWang Z(2023)Automatic labeling of Parkinson’s Disease gait videos with weak supervisionMedical Image Analysis10.1016/j.media.2023.10287189(102871)Online publication date: Oct-2023
https://doi.org/10.1016/j.media.2023.102871
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Digital Edition

View this article in digital edition.

Digital Edition

Magazine Site

View this article on the magazine site (external)

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Figures

Tables

Media

View Issue’s Table of Contents