Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article
Public Access

DeepDive: declarative knowledge base construction

Published: 24 April 2017 Publication History

Abstract

The dark data extraction or knowledge base construction (KBC) problem is to populate a relational database with information from unstructured data sources, such as emails, webpages, and PDFs. KBC is a long-standing problem in industry and research that encompasses problems of data extraction, cleaning, and integration. We describe DeepDive, a system that combines database and machine learning ideas to help to develop KBC systems. The key idea in DeepDive is to frame traditional extract-transform-load (ETL) style data management problems as a single large statistical inference task that is declaratively defined by the user. DeepDive leverages the effectiveness and efficiency of statistical inference and machine learning for difficult extraction tasks, whereas not requiring users to directly write any probabilistic inference algorithms. Instead, domain experts interact with DeepDive by defining features or rules about the domain. DeepDive has been successfully applied to domains such as pharmacogenomics, paleobiology, and antihuman trafficking enforcement, achieving human-caliber quality at machine-caliber scale. We present the applications, abstractions, and techniques used in DeepDive to accelerate the construction of such dark data extraction systems.

References

[1]
Angeli, G. et al. Stanford's 2014 slot filling systems. TAC KBP (2014).
[2]
Banko, M. et al. Open information extraction from the Web. In IJCAI (2007).
[3]
Betteridge, J., Carlson, A., Hong, S.A., Hruschka, E.R., Jr, Law, E.L., Mitchell, T.M., Wang, S.H. Toward never ending language learning. In AAAI Spring Symposium(2009).
[4]
Brin, S. Extracting patterns and relations from the world wide web. In WebDB (1999).
[5]
Brown, E. et al. Tools and methods for building Watson. IBM Research Report (2013).
[6]
Carlson, A. et al. Toward an architecture for never-ending language learning. In AAAI (2010).
[7]
Chen, F., Doan, A., Yang, J., Ramakrishnan, R. Efficient information extraction over evolving text data. In ICDE (2008).
[8]
Chen, F. et al. Optimizing statistical information extraction programs over evolving text. In ICDE (2012).
[9]
Chen, Y., Wang, D.Z. Knowledge expansion over probabilistic knowledge bases. In SIGMOD (2014).
[10]
De Sa, C., Olukotun, K., Ré, C. Ensuring rapid mixing and low bias for asynchronous gibbs sampling. arXiv preprint arXiv:1602.07415 (2016).
[11]
Domingos, P., Lowd, D. Markov Logic: An Interface Layer for Artificial Intelligence. Morgan & Claypool, 2009.
[12]
Dong, X.L. et al. From data fusion to knowledge fusion. In VLDB (2014).
[13]
Ehrenberg, H.R., Shin, J., Ratner, A.J., Fries, J.A., Ré, C. Data programming with DDLite: Putting humans in a different part of the loop. In HILDA'16 SIGMOD (2016), 13.
[14]
Etzioni, O. et al. Web-scale information extraction in KnowItAll: Preliminary results. In WWW (2004).
[15]
Ferrucci, D. et al. Building Watson: An overview of the DeepQA project. AI Magazine (2010).
[16]
Govindaraju, V. et al. Understanding tables in context using standard NLP toolkits. In ACL (2013).
[17]
Gupta, A., Mumick, I.S., Subrahmanian, V.S. Maintaining views incrementally. SIGMOD Rec. (1993).
[18]
Hearst, M.A. Automatic acquisition of hyponyms from large text corpora. In COLING (1992).
[19]
Hoffmann, R. et al. Knowledge-based weak supervision for information extraction of overlapping relations. In ACL (2011).
[20]
Jampani, R. et al. MCDB: A Monte Carlo approach to managing uncertain data. In SIGMOD (2008).
[21]
Jaynes, E.T. Probability Theory: The Logic of Science. Cambridge University Press, 2003.
[22]
Jiang, S. et al. Learning to refine an automatically extracted knowledge base using Markov logic. In ICDM(2012).
[23]
Kasneci, G. et al. The YAGO-NAGA approach to knowledge discovery. SIGMOD Rec. (2009).
[24]
Koc, M.L., Ré, C. Incrementally maintaining classification using an RDBMS. PVLDB (2011).
[25]
Krishnamurthy, R. et al. SystemT: A system for declarative information extraction. SIGMOD Rec. (2009).
[26]
Li, Y., Reiss, F.R., Chiticariu, L. System T: A declarative information extraction system. In HLT (2011).
[27]
Liu, J. and et al. An asynchronous parallel stochastic coordinate descent algorithm. ICML (2014).
[28]
Madhavan, J. et al. Web-scale data integration: You can only afford to pay as you go. In CIDR (2007).
[29]
Mallory, E.K. et al. Large-scale extraction of gene interactions from full text literature using deepdive. Bioinformatics (2015).
[30]
Mintz, M. et al. Distant supervision for relation extraction without labeled data. In ACL (2009).
[31]
Nakashole, N. et al. Scalable knowledge harvesting with high precision and high recall. In WSDM (2011).
[32]
Niu, F. et al. Hogwild! A lock-free approach to parallelizing stochastic gradient descent. In NIPS (2011).
[33]
Niu, F. et al. Tuffy: Scaling up statistical inference in Markov logic networks using an RDBMS. PVLDB(2011).
[34]
Niu, F. et al. Elementary: Large-scale knowledge-base construction via machine learning and statistical inference. Int. J. Semantic Web Inf. Syst. (2012).
[35]
Niu, F. et al. Scaling inference for Markov logic via dual decomposition. In ICDM (2012).
[36]
Peters, S.E. et al. A machine reading system for assembling synthetic Paleontological databases. PloS One (2014).
[37]
Poon, H., Domingos, P. Joint inference in information extraction. In AAAI (2007).
[38]
Ratner, A., De Sa, C., Wu, S., Selsam, D., Ré, C. Data programming: Creating large training sets, quickly. arXiv preprint arXiv:1605.07723 (2016).
[39]
Ré, C. et al. Feature engineering for knowledge base construction. IEEE Data Eng. Bull. (2014).
[40]
Robert, C.P, Casella, G. Monte Carlo Statistical Methods. Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2005.
[41]
Shen, W. et al. Declarative information extraction using datalog with embedded extraction predicates. In VLDB (2007).
[42]
Shin, J. et al. Incremental knowledge base construction using deepdive. PVLDB (2015).
[43]
Suchanek, F.M. et al. SOFIE: A self-organizing framework for information extraction. In WWW (2009).
[44]
Wainwright, M., Jordan, M. Log-determinant relaxation for approximate inference in discrete Markov random fields. Trans. Sig. Proc. (2006).
[45]
Wainwright, M.J., Jordan, M.I. Graphical models, exponential families, and variational inference. FTML (2008).
[46]
Weikum, G., Theobald, M. From information to knowledge: Harvesting entities and relationships from web sources. In PODS (2010).
[47]
Wick, M. et al. Scalable probabilistic databases with factor graphs and MCMC. PVLDB (2010).
[48]
Yates, A. et al. TextRunner: Open information extraction on the Web. In NAACL (2007).
[49]
Zhang, C. et al. GeoDeepDive: Statistical inference using familiar data-processing languages. In SIGMOD (2013).
[50]
Zhang, C., Ré, C. Towards high- throughput Gibbs sampling at scale: A study across storage managers. In SIGMOD (2013).
[51]
Zhang, C., Ré, C. DimmWitted: A study of main-memory statistical analytics. PVLDB (2014).
[52]
Zhu, J. et al. StatSnowball: A statistical approach to extracting entity relationships. In WWW (2009).
[53]
Zinkevich, M. et al. Parallelized stochastic gradient descent. In NIPS(2010), 2595--2603.

Cited By

View all
  • (2024)Ontology-Based Information Extraction for Labeling Radical Online Content Using Distant SupervisionInformation Systems Research10.1287/isre.2023.122335:1(203-225)Online publication date: 1-Mar-2024
  • (2024)A Weak Supervision-Based Approach to Improve Chatbots for Code RepositoriesProceedings of the ACM on Software Engineering10.1145/36608121:FSE(2378-2401)Online publication date: 12-Jul-2024
  • (2024)Database Repairing with Soft Functional DependenciesACM Transactions on Database Systems10.1145/365115649:2(1-34)Online publication date: 10-Apr-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Communications of the ACM
Communications of the ACM  Volume 60, Issue 5
May 2017
101 pages
ISSN:0001-0782
EISSN:1557-7317
DOI:10.1145/3084186
  • Editor:
  • Moshe Y. Vardi
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 April 2017
Published in CACM Volume 60, Issue 5

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)619
  • Downloads (Last 6 weeks)94
Reflects downloads up to 27 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Ontology-Based Information Extraction for Labeling Radical Online Content Using Distant SupervisionInformation Systems Research10.1287/isre.2023.122335:1(203-225)Online publication date: 1-Mar-2024
  • (2024)A Weak Supervision-Based Approach to Improve Chatbots for Code RepositoriesProceedings of the ACM on Software Engineering10.1145/36608121:FSE(2378-2401)Online publication date: 12-Jul-2024
  • (2024)Database Repairing with Soft Functional DependenciesACM Transactions on Database Systems10.1145/365115649:2(1-34)Online publication date: 10-Apr-2024
  • (2024)Machine-Assisted Error Discovery in Conversational AI SystemsExtended Abstracts of the CHI Conference on Human Factors in Computing Systems10.1145/3613905.3651120(1-10)Online publication date: 11-May-2024
  • (2024)The question answering system GeoQA2 and a new benchmark for its evaluationInternational Journal of Applied Earth Observation and Geoinformation10.1016/j.jag.2024.104203134(104203)Online publication date: Nov-2024
  • (2024)Knowledge Graph for Solubility Big Data: Construction and ApplicationsWIREs Data Mining and Knowledge Discovery10.1002/widm.1570Online publication date: Nov-2024
  • (2023)Applications of Chatbots in EducationTrends, Applications, and Challenges of Chatbot Technology10.4018/978-1-6684-6234-8.ch004(80-118)Online publication date: 24-Feb-2023
  • (2023)Design and Implementation of Satellite Knowledge Graph and Its Application2023 3rd Asia-Pacific Conference on Communications Technology and Computer Science (ACCTCS)10.1109/ACCTCS58815.2023.00046(240-245)Online publication date: Feb-2023
  • (2023)Named entity annotation schema for geological literature mining in the domain of porphyry copper depositsOre Geology Reviews10.1016/j.oregeorev.2022.105243152(105243)Online publication date: Jan-2023
  • (2023)Automatic labeling of Parkinson’s Disease gait videos with weak supervisionMedical Image Analysis10.1016/j.media.2023.10287189(102871)Online publication date: Oct-2023
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Digital Edition

View this article in digital edition.

Digital Edition

Magazine Site

View this article on the magazine site (external)

Magazine Site

Login options

Full Access

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media