Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2513549.2514739acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
keynote

Big data opportunities and challenges for IR, text mining and NLP

Published: 28 October 2013 Publication History

Abstract

Big Data poses challenges for text analysis and natural language processing due to its characteristics of volume, veracity, and velocity of the data. The sheer volume in terms of numbers of documents challenges traditional local repository and index systems for large-scale analysis and mining. Computation, storage and data representation must work together to provide rapid access, search, and mining of the deep knowledge in the large text collection. Text under copyright poses additional barriers to computational access, where analysis has to be separated from human consumption of the original text. Data preprocessing, in most cases, remains a daunting task for big textual data particularly data veracity is questionable due to age of original materials. Data velocity is rate of change of the data but can also be the rate at which changes and corrections are made.
The HathiTrust Research Center (HTRC) provides new opportunities for IR, NLP and text mining research. HTRC is the research arm of HathiTrust, a consortium that stewards the digital library of content from research libraries around the country. With close to 11 million volumes in HathiTrust collection, HTRC aims to provide large-scale computational access and analytics to these text resources.
With the goal of facilitating scholar's work, HTRC establishes a cyberinfrastructure of software, staff, and services to assist researchers and developers more easily process and mine large scale textual data effectively and efficiently. The primary users of HTRC are digital humanities, informatics, and librarians. They are of different research backgrounds and expertise and thus a variety of tools are made available to them.
In the HTRC model of computing, computation moves to the data, and services grow up around the corpus to serve the research community. In this manner, the architecture is cloud-based. Moving algorithms to the data is important because the copyrighted content must be protected, however, a side benefit is that the paradigm frees scholars from worrying about managing a large corpus of data.
The text analytics currently supported in HTRC is the SEASR suite of analytical algorithms (www.seasr.org). SEASR algorithms, which are written as workflows, include entity extraction, tag cloud, topic modeling, NaiveBayes, Date Entities to Similie Timeline.
In this talk, I introduce the collections, architecture, and text analytics of HTRC, with a focus on the challenges of a BigData corpus and what that means for data storage, access, and large-scale computation.
HTRC is building a user community to better understand and support researcher needs. It opens many exciting possibilities for the NLP, text mining, IR types of research: with so large an amount of textual data and many candidate algorithms, with support for researcher contributed algorithms, many interesting research questions emerge and many interesting results are to follow.

Cited By

View all
  • (2021)CADREProceedings of the 30th ACM International Conference on Information & Knowledge Management10.1145/3459637.3481898(4283-4292)Online publication date: 26-Oct-2021
  • (2019)Big Data and Libraries: Identifying Themes in the LiteratureInternet Reference Services Quarterly10.1080/10875301.2018.1524337(1-26)Online publication date: 12-Jan-2019
  • (2018)Big Data Quality Assessment Model for Unstructured Data2018 International Conference on Innovations in Information Technology (IIT)10.1109/INNOVATIONS.2018.8605945(69-74)Online publication date: Nov-2018
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
UnstructureNLP '13: Proceedings of the 2013 international workshop on Mining unstructured big data using natural language processing
October 2013
74 pages
ISBN:9781450324151
DOI:10.1145/2513549
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 October 2013

Check for updates

Author Tags

  1. big data access
  2. hathitrust
  3. information retrieval
  4. nlp
  5. text mining and analysis

Qualifiers

  • Keynote

Conference

CIKM'13
Sponsor:

Acceptance Rates

UnstructureNLP '13 Paper Acceptance Rate 9 of 12 submissions, 75%;
Overall Acceptance Rate 9 of 12 submissions, 75%

Upcoming Conference

CIKM '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)10
  • Downloads (Last 6 weeks)3
Reflects downloads up to 16 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2021)CADREProceedings of the 30th ACM International Conference on Information & Knowledge Management10.1145/3459637.3481898(4283-4292)Online publication date: 26-Oct-2021
  • (2019)Big Data and Libraries: Identifying Themes in the LiteratureInternet Reference Services Quarterly10.1080/10875301.2018.1524337(1-26)Online publication date: 12-Jan-2019
  • (2018)Big Data Quality Assessment Model for Unstructured Data2018 International Conference on Innovations in Information Technology (IIT)10.1109/INNOVATIONS.2018.8605945(69-74)Online publication date: Nov-2018
  • (2016)We Have Good Information for YouBusiness Intelligence10.4018/978-1-4666-9562-7.ch008(160-179)Online publication date: 2016
  • (2015)We Have Good Information for YouThe Evolution of the Internet in the Business Sector10.4018/978-1-4666-7262-8.ch009(191-210)Online publication date: 2015
  • (2014)Revolutionary entities: Turning data into knowledge to drive personalized exploration of The irish rising of 19162014 IEEE International Conference on Big Data (Big Data)10.1109/BigData.2014.7004450(32-38)Online publication date: Oct-2014

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media