Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/860435.860479acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
Article

Table extraction using conditional random fields

Published: 28 July 2003 Publication History

Abstract

The ability to find tables and extract information from them is a necessary component of data mining, question answering, and other information retrieval tasks. Documents often contain tables in order to communicate densely packed, multi-dimensional information. Tables do this by employing layout patterns to efficiently indicate fields and records in two-dimensional form.Their rich combination of formatting and content present difficulties for traditional language modeling techniques, however. This paper presents the use of conditional random fields (CRFs) for table extraction, and compares them with hidden Markov models (HMMs). Unlike HMMs, CRFs support the use of many rich and overlapping layout and language features, and as a result, they perform significantly better. We show experimental results on plain-text government statistical reports in which tables are located with 92% F1, and their constituent lines are classified into 12 table-related categories with 94% accuracy. We also discuss future work on undirected graphical models for segmenting columns, finding cells, and classifying them as data cells or label cells.

References

[1]
R. H. Byrd, J. Nocedal, and R. B. Schnabel. Representations of quasi-Newton matrices and their use in limited memory methods. Mathematical
[2]
S. F. Chen and R. Rosenfeld. A Gaussian prior for smoothing maximum entropy models.Technical Report CMU-CS-99-108, CMU, 1999.
[3]
M. Hurst. The Interpretation of Tables in Texts PhD thesis, University of Edinburgh, School of Cognitive Science, Informatics, 2000.
[4]
M. Hurst. Layout and language: An efficient algorithm for text block detection based on spatial and linguistic evidence. In Proc. Document Recognition and Retrieval VIII pages 56--67, 2001.
[5]
M. Hurst. Layout and language: An efficient algorithm for text block detection based on spatial and linguistic evidence. In Proceedings of the 18th International Conference on Computational Linguistics. ICCL July, 2000.
[6]
M. Hurst and T. Nasukawa. Layout and language: Integrating spatial and linguistic knowledge for layout understanding tasks. In Proceeding of the 18th International Conference on Computational Linguistics. (COLING 2000), 2000.
[7]
J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data.In Proc. ICML, 2001.
[8]
R. Malouf. A comparison of algorithms for maximum entropy parameter estimation. In Sixth Workshop on Computational Language Learning (CoNLL), 2002.
[9]
A. McCallum. Mallet: A machine learning for language toolkit. http://www.cs.umass.edu/~mccallum/mallet, 2002.
[10]
A. McCallum, D. Freitag, and F. Pereira. Maximum entropy Markov models for information extraction and segmentation. In Proc. ICML, 2000 pages 591--598, 2000.
[11]
A. McCallum and K. Nigam. A comparison of event models for naive bayes text classification. In AAAI-98 Workshop on "Learning for Text Categorization", 1998.
[12]
H. T. Ng, C. Y. Kim, and J. L. T. Koo. Learning to recognize tables in free text. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics pages 443--450, 1999.
[13]
D. Pinto, W. Croft, M. Branstein, R. Coleman, M. King, W. Li, and X. Wei. Quasm: A system for question answering using semi-structured data. In Proceedings of the JCDL, 2002 Joint Conference on Digital Libraries pages 46--55, 2002.
[14]
P. Pyreddy and W. Croft. Tintin: A system for retrieval in text tables. In Proceedings of the Second International Conference on Digital Libraries pages 193--200, 1997.
[15]
L. R. Rabiner. A tutorial on hidden markov models and selected applications in speech recognition. In A. Weibel and K.-F. Lee, editors, Readings in Speech Recognition pages 267--296. Morgan Kaufmann, 1990.
[16]
F. Sha and F. Pereira. Shallow parsing with conditional random fields. In Proceedings of Human Language Technology, NAACL, 2003.
[17]
M. J. Wainwright, T. Jaakkola, and A. S. Willsky. Exact MAP estimates by (hyper) tree agreement. In Advances in Neural Information Processing (NIPS), 2003.

Cited By

View all
  • (2024)A Review of Deep Learning Models for Twitter Sentiment Analysis: Challenges and OpportunitiesIEEE Transactions on Computational Social Systems10.1109/TCSS.2023.332200211:3(3550-3579)Online publication date: Jun-2024
  • (2024)Integrated Digital Library System for Long Documents and their ElementsProceedings of the 2023 ACM/IEEE Joint Conference on Digital Libraries10.1109/JCDL57899.2023.00012(13-24)Online publication date: 26-Jun-2024
  • (2024)An Overview of Data Extraction From InvoicesIEEE Access10.1109/ACCESS.2024.336052812(19872-19886)Online publication date: 2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGIR '03: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
July 2003
490 pages
ISBN:1581136463
DOI:10.1145/860435
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 July 2003

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. conditional random fields
  2. hidden Markov models
  3. information extraction
  4. metadata
  5. question answering
  6. tables

Qualifiers

  • Article

Conference

SIGIR03
Sponsor:

Acceptance Rates

SIGIR '03 Paper Acceptance Rate 46 of 266 submissions, 17%;
Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)75
  • Downloads (Last 6 weeks)11
Reflects downloads up to 19 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)A Review of Deep Learning Models for Twitter Sentiment Analysis: Challenges and OpportunitiesIEEE Transactions on Computational Social Systems10.1109/TCSS.2023.332200211:3(3550-3579)Online publication date: Jun-2024
  • (2024)Integrated Digital Library System for Long Documents and their ElementsProceedings of the 2023 ACM/IEEE Joint Conference on Digital Libraries10.1109/JCDL57899.2023.00012(13-24)Online publication date: 26-Jun-2024
  • (2024)An Overview of Data Extraction From InvoicesIEEE Access10.1109/ACCESS.2024.336052812(19872-19886)Online publication date: 2024
  • (2024)ENTRANT: A Large Financial Dataset for Table UnderstandingScientific Data10.1038/s41597-024-03605-511:1Online publication date: 13-Aug-2024
  • (2024)A GAN-BERT based decision making approach in peer reviewSocial Network Analysis and Mining10.1007/s13278-024-01269-y14:1Online publication date: 22-May-2024
  • (2024)A Survey on Various Aspects of Recommendation System Based on Sentiment AnalysisArtificial Intelligence: Theory and Applications10.1007/978-981-99-8479-4_38(517-529)Online publication date: 3-Jan-2024
  • (2023)Text sentiment analysis using deep convolutional networksCRJ10.59380/crj.v1i1.2725(35-43)Online publication date: 18-Sep-2023
  • (2023)An Algorithm for New Energy Battery SOH Prediction Based on Deep LearningProceedings of the 4th International Conference on Big Data Analytics for Cyber-Physical System in Smart City - Volume 210.1007/978-981-99-1157-8_29(232-243)Online publication date: 1-Apr-2023
  • (2023)GriTS: Grid Table Similarity Metric for Table Structure RecognitionDocument Analysis and Recognition - ICDAR 202310.1007/978-3-031-41734-4_33(535-549)Online publication date: 19-Aug-2023
  • (2023)Behavior and Sentiment Analysis of Smart Digital Societies Using Deep Machine Learning TechnologiesCloud-IoT Technologies in Society 5.010.1007/978-3-031-28711-4_3(55-85)Online publication date: 22-Apr-2023
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media