Abstract
Big Data poses challenges for text analysis and natural language processing due to its characteristics of volume, veracity, and velocity of the data. The sheer volume in terms of numbers of documents challenges traditional local repository and index systems for large-scale analysis and mining. Computation, storage and data representation must work together to provide rapid access, search, and mining of the deep knowledge in the large text collection. Text under copyright poses additional barriers to computational access, where analysis has to be separated from human consumption of the original text. Data preprocessing, in most cases, remains a daunting task for big textual data particularly data veracity is questionable due to age of original materials. Data velocity is rate of change of the data but can also be the rate at which changes and corrections are made.
The HathiTrust Research Center (HTRC) provides new opportunities for IR, NLP and text mining research. HTRC is the research arm of HathiTrust, a consortium that stewards the digital library of content from research libraries around the country. With close to 11 million volumes in HathiTrust collection, HTRC aims to provide large-scale computational access and analytics to these text resources.
With the goal of facilitating scholar's work, HTRC establishes a cyberinfrastructure of software, staff, and services to assist researchers and developers more easily process and mine large scale textual data effectively and efficiently. The primary users of HTRC are digital humanities, informatics, and librarians. They are of different research backgrounds and expertise and thus a variety of tools are made available to them.
In the HTRC model of computing, computation moves to the data, and services grow up around the corpus to serve the research community. In this manner, the architecture is cloud-based. Moving algorithms to the data is important because the copyrighted content must be protected, however, a side benefit is that the paradigm frees scholars from worrying about managing a large corpus of data.
The text analytics currently supported in HTRC is the SEASR suite of analytical algorithms (www.seasr.org). SEASR algorithms, which are written as workflows, include entity extraction, tag cloud, topic modeling, NaiveBayes, Date Entities to Similie Timeline.
In this talk, I introduce the collections, architecture, and text analytics of HTRC, with a focus on the challenges of a BigData corpus and what that means for data storage, access, and large-scale computation.
HTRC is building a user community to better understand and support researcher needs. It opens many exciting possibilities for the NLP, text mining, IR types of research: with so large an amount of textual data and many candidate algorithms, with support for researcher contributed algorithms, many interesting research questions emerge and many interesting results are to follow.