Nothing Special   »   [go: up one dir, main page]

skip to main content
10.5555/1182635.1164185acmconferencesArticle/Chapter ViewAbstractPublication PagesvldbConference Proceedingsconference-collections
Article

Efficiently linking text documents with relevant structured information

Published: 01 September 2006 Publication History

Abstract

Faced with growing knowledge management needs, enterprises are increasingly realizing the importance of interlinking critical business information distributed across structured and unstructured data sources. We present a novel system, called EROCS, for linking a given text document with relevant structured data. EROCS views the structured data as a predefined set of "entities" and identifies the entities that best match the given document. EROCS also embeds the identified entities in the document, effectively creating links between the structured data and segments within the document. Unlike prior approaches, EROCS identifies such links even when the relevant entity is not explicitly mentioned in the document. EROCS uses an efficient algorithm that performs this task keeping the amount of information retrieved from the database at a minimum. Our evaluation shows that EROCS achieves high accuracy with reasonable overheads.

References

[1]
{1} AGICHTEIN, E., and GANTI, V. Mining reference tables for automatic text segmentation. In SIGKDD (2004).
[2]
{2} AGRAWAL, S., CHAUDHURI, S., and DAS, G. DBXplorer: A System for Keyword-Based Search over Relational databases. In ICDE (2002).
[3]
{3} BAEZA-YATES, R., and RIBEIRO-NETO, B. Modern Information Retrieval. Addison Wesley/ACM, 1999.
[4]
{4} BARSALOU, T. View objects for relational databases. Tech. Rep. STAN-CS-90-1310, CS Dept, Stanford University, 1990. Ph.D. thesis.
[5]
{5} BARSALOU, T., KELLER, A. M., SIAMBELA, N., and WIEDERHOLD, G. Updating relational databases through object-based views. In SIGMOD (1991).
[6]
{6} BHALOTIA, G., HULGERI, A., NAKHE, C., CHAKRABARTI, S., and SUDARSHAN, S. Keyword Searching and Browsing in Databases using BANKS. In ICDE (2002).
[7]
{7} BORTHWICK, A., STERLING, J., AGICHTEIN, E., and GRISHMAN, R. Exploiting diverse sources via maximum entropy in named entity recognition. In Sixth Workshop on Very Large Corpora (1998).
[8]
{8} CHAKRABARTI, S. Breaking through the syntax barrier: Searching with entities and relations. In PKDD (2004).
[9]
{9} CHANDEL, A., NAGESH, P., and SARAWAGI, S. Efficient batch top-k search for dictionary-based entity recognition. In ICDE (2006).
[10]
{10} CHAUDHURI, S., GANTI, V., and MOTWANI, R. Robust identification of fuzzy duplicates. In ICDE (2005).
[11]
{11} CHEN, P. P.-S. The Entity-Relationship Model-Toward a Unified View of Data. ACM TODS 1, 1 (1976).
[12]
{12} COHEN, W., and SARAWAGI, S. Exploiting dictionaries in named entity extraction: Combining semi-markov extraction processes and data integration methods. In SIGKDD (2004).
[13]
{13} DOAN, A., and HALEVY, A. Semantic Integration Research in the Database Community: A Brief Survey. AI Magazine: Special Issue on Semantic Integration (2005).
[14]
{14} HRISTIDIS, V., GRAVANO, L., and PAPAKONSTANTINOU, Y. Efficient IR-Style Keyword Search over Relational Databases. In VLDB (2003).
[15]
{15} IBM. IBM DB2 UDB Net Search Extender : Administration and User Guide (version 8.1), 2003.
[16]
{16} LI, X., MORIE, P., and ROTH, D. Semantic Integration in Text: From Ambiguous Names to Identifiable Entities. AI Magazine: Special Issue on Semantic Integration (2005).
[17]
{17} MANSURI, I., and SARAWAGI, S. Integrating unstructured data into relational databases. In ICDE (2006).
[18]
{18} PREMERLANI, W. J., and BLAHA, M. R. An Approach for Reverse Engineering of Relational Databases. CACM 37, 5 (1994).
[19]
{19} ROY, P., MOHANIA, M., BAMBA, B., and RAMAN, S. Towards automatic association of relevant unstructured content with structured query results. In CIKM (2005).
[20]
{20} SARAWAGI, S. Automation in information extraction and integration (tutorial). In VLDB (2002).
[21]
{21} SHANMUGASUNDARAM, J., TUFTE, K., HE, G., ZHANG, C., DEWITT, D., and NAUGHTON, J. Relational databases for querying XML documents: Limitations and opportunities. In VLDB (1999).
[22]
{22} WALKER, M. H., and EATON, N. J. Microsoft Office Visio 2003 Inside Out. Microsoft Press, 2003.
[23]
{23} WINKLER, W. E. The state of record linkage and current research problems. Tech. rep., U.S. Census Bureau, 1999.

Cited By

View all
  • (2017)Textual aggregation approaches in OLAP contextInternational Journal of Information Management: The Journal for Information Professionals10.1016/j.ijinfomgt.2017.06.00537:6(684-692)Online publication date: 1-Dec-2017
  • (2013)Text-Driven Multi-structured Data Analytics for Enterprise IntelligenceProceedings of the 2013 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT) - Volume 0310.1109/WI-IAT.2013.186(213-220)Online publication date: 17-Nov-2013
  • (2012)Associating structured records to text documentsProceedings of the 21st International Conference on World Wide Web10.1145/2187980.2188072(451-452)Online publication date: 16-Apr-2012
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
VLDB '06: Proceedings of the 32nd international conference on Very large data bases
September 2006
1269 pages

Sponsors

  • SIGMOD: ACM Special Interest Group on Management of Data
  • K.I.S.S. SIG on Databases
  • AJU Information Technology Co., Ltd
  • US Army ITC-PAC Asian Research Office
  • Google Inc.
  • The Database Society of Japan
  • Samsung SOS
  • Advanced Information Technology Research Center
  • Naver
  • Microsoft: Microsoft
  • Korea Info Sci Society: Korea Information Science Society
  • SK telecom
  • Systems Applications Products
  • ORACLE: ORACLE
  • International Business Management
  • Air Force Office of Scientific Research/Asian Office of Aerospace R&D
  • Kosef
  • Kaist
  • LG Electronics
  • CCF-DBS

Publisher

VLDB Endowment

Publication History

Published: 01 September 2006

Check for updates

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)2
  • Downloads (Last 6 weeks)0
Reflects downloads up to 14 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2017)Textual aggregation approaches in OLAP contextInternational Journal of Information Management: The Journal for Information Professionals10.1016/j.ijinfomgt.2017.06.00537:6(684-692)Online publication date: 1-Dec-2017
  • (2013)Text-Driven Multi-structured Data Analytics for Enterprise IntelligenceProceedings of the 2013 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT) - Volume 0310.1109/WI-IAT.2013.186(213-220)Online publication date: 17-Nov-2013
  • (2012)Associating structured records to text documentsProceedings of the 21st International Conference on World Wide Web10.1145/2187980.2188072(451-452)Online publication date: 16-Apr-2012
  • (2012)Targeted disambiguation of ad-hoc, homogeneous sets of named entitiesProceedings of the 21st international conference on World Wide Web10.1145/2187836.2187934(719-728)Online publication date: 16-Apr-2012
  • (2011)Matching unstructured product offers to structured product specificationsProceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining10.1145/2020408.2020474(404-412)Online publication date: 21-Aug-2011
  • (2011)Toward total business intelligence incorporating structured and unstructured dataProceedings of the 2nd International Workshop on Business intelligencE and the WEB10.1145/1966883.1966890(12-19)Online publication date: 25-Mar-2011
  • (2010)Evaluating evidences for keyword query disambiguation in entity centric database searchProceedings of the 21st international conference on Database and expert systems applications: Part II10.5555/1887568.1887594(240-247)Online publication date: 30-Aug-2010
  • (2010)DivQProceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval10.1145/1835449.1835506(331-338)Online publication date: 19-Jul-2010
  • (2010)Graph-based concept identification and disambiguation for enterprise searchProceedings of the 19th international conference on World wide web10.1145/1772690.1772709(171-180)Online publication date: 26-Apr-2010
  • (2010)Text-to-queryProceedings of the 2010 EDBT/ICDT Workshops10.1145/1754239.1754255(1-8)Online publication date: 22-Mar-2010
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media