Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3219104.3219153acmotherconferencesArticle/Chapter ViewAbstractPublication PagespearcConference Proceedingsconference-collections
research-article

A Computational Notebook Approach to Large-scale Text Analysis: Balancing Accessibility with Scalability

Published: 22 July 2018 Publication History

Abstract

Large-scale text analysis algorithms are important to many fields as they interrogate reams of textual data to extract evidence, correlations, and trends not readily discoverable by a human reader. Unfortunately, there is often an expertise mismatch between computational researchers who have the technical and programming skills necessary to develop workflows at scale and domain scholars who have knowledge of the literary, historical, scientific, or social factors that can affect data as it is manipulated. Our work focuses on the use of scalable computational notebooks as a model to bridge the accessibility gap for domain scholars, putting the power of HPC resources directly in the hands of the researchers who have scholarly questions. The computational notebook approach offers many benefits, including: fine-grained control through modularized functions, interactive analysis that puts the "human in the loop", scalable analysis that leverages Spark-as-a-Service, and complexity hiding interfaces that minimize the need for HPC expertise. In addition, the notebook approach makes it easy to share, reproduce, and sustain research workflows. We illustrate the applicability of our approach with usage scenarios on HPC systems as well as within a restricted computing environment to access sensitive, in-copyright data, and demonstrate the usefulness of the notebook approach with three examples from three different domains and data sources. These sources include historical topic trends in ten thousand scientific articles, sentiment analysis of tweets, and literary analysis of the copyrighted works of Kurt Vonnegut using non-consumptive techniques.

References

[1]
Eric Alexander, Joe Kohlmann, Robin Valenza, Michael Witmore, and Michael Gleicher. 2014. Serendip: Topic Model-driven Visual Exploration of Text Corpora. In Proceedings of IEEE Conference on Visual Analytics Science and Technology (VAST'14). IEEE, 173--182.
[2]
HathiTrust Research Center Analytics. 2018. HathiTrust Research Center Analytics Homepage. (Feb. 2018). Retrieved Feb 13, 2018 from https://analytics.hathitrust.org/
[3]
Hadoop Distributed Filesystem Architecture. 2018. Hadoop Distributed Filesystem Homepage. (Feb. 2018). Retrieved Feb 13, 2018 from https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html
[4]
David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet Allocation. Journal of Machine Learning Research 3 (Jan. 2003), 993--1022.
[5]
LM Aiello et al. CA Davis, GL Ciampaglia. {n. d.}. OSoMe: the IUNI observatory on social media. Peerj Computer Science 2:e87 ({n. d.}).
[6]
HathiTrust Research Center. 2018. HathiTrust Research Center Homepage. (Feb. 2018). Retrieved Feb 13, 2018 from https://www.hathitrust.org/htrc
[7]
Stanford CoreNLP. 2018. Stanford CoreNLP Homepage. (Feb. 2018). Retrieved Feb 13, 2018 from https://stanfordnlp.github.io/CoreNLP/
[8]
PostgreSQL Database. 2018. PostgreSQL Homepage. (Feb. 2018). Retrieved Feb 13, 2018 from https://www.postgresql.org/
[9]
Karst Desktop. 2018. Karst Deskto Homepage. (Feb. 2018). Retrieved Feb 13, 2018 from https://kb.iu.edu/d/bfwp
[10]
Elsevier. 2018. ScienceDirect Repository. (Feb. 2018). Retrieved Feb 13, 2018 from https://www.sciencedirect.com/
[11]
Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. 1996. A Density-based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD'96). ACM, New York, NY, 226--231.
[12]
Ingo Feinerer. 2018. Text Mining Package in R. (Feb. 2018). Retrieved Feb 13, 2018 from http://tm.r-forge.r-project.org/
[13]
Lustre Filesystem. 2018. Lustre Homepage. (Feb. 2018). Retrieved Feb 13, 2018 from http://lustre.org/
[14]
Jing Han, Haihong E, Guan Le, and Jian Du. 2011. Survey on NoSQL Database. In Proceedings of the 6th International Conference on Pervasive Computing and Applications (ICPCA'11). IEEE.
[15]
Thomas Hofmann. 2017. Probabilistic Latent Semantic Indexing. ACM SIGIR Forum 51, 2 (july 2017), 211--218.
[16]
Indiana University Network Science Institute. 2018. IUNI Web Science of Science Homepage. (Feb. 2018). Retrieved Feb 13, 2018 from https://iuni.iu.edu/resources/web-of-science
[17]
Tapas Kanungo, David M. Mount, Nathan S. Netanyahu, Christine D. Piatko, Ruth Silverman, and Angela Y. Wu. 2002. An Efficient k-means Clustering Algorithm: Analysis and Implementation. IEEE Transactions on Pattern Analysis and Machine Intelligence 24, 7 (July 2002), 881--892.
[18]
HathiTrust Digital Library. 2018. HathiTrust Digital Library Homepage. (Feb. 2018). Retrieved Feb 13, 2018 from https://www.hathitrust.org/
[19]
Matlab. 2018. Matlab Homepage. (Feb. 2018). Retrieved Feb 13, 2018 from https://www.mathworks.com/products/matlab.html
[20]
Matplotlib. 2018. Matplotlib Homepage. (Feb. 2018). Retrieved Feb 13, 2018 from https://matplotlib.org/
[21]
Apache OpenNLP. 2018. Apache OpenNLP Homepage. (Feb. 2018). Retrieved Feb 13, 2018 from https://opennlp.apache.org/
[22]
Uriel Cohen Priva and Joseph L. Austerweil. 2015. Analyzing the History of Cognition using Topic Models. Cognition 135 (Feb. 2015), 4--9.
[23]
Jupyter Project. 2018. Jupyter Project Homepage. (Feb. 2018). Retrieved Feb 13, 2018 from http://jupyter.org/
[24]
Radim Řehuřek. 2018. Gensim Homepage. (Feb. 2018). Retrieved Feb 13, 2018 from https://radimrehurek.com/gensim/
[25]
Tristan Richardson, Quentin Stafford-Fraser, Kenneth R. Wood, and Andy Hopper. 1998. Virtual Network Computing. IEEE Internet Computing 2, 1 (Jan. 1998), 33--38.
[26]
RStudio. 2018. RStudio Homepage. (Feb. 2018). Retrieved Feb 13, 2018 from https://www.rstudio.com/
[27]
Guangchen Ruan and Hui Zhang. 2016. In Conquering Big Data with High Performance Computing, R. Arora (Ed.). Springer, Chapter Large-scale Multimodal Data Exploration with Human in the Loop, 253--268.
[28]
Guangchen Ruan and Hui Zhang. 2017. Closed-loop Big Data Analysis with Visualization and Scalable Computing. Big Data Research 8 (Jan. 2017), 12--26.
[29]
Stéfan Sinclair and Geoffrey Rockwell. 2018. Getting Started with Spyral Notebooks. (Feb. 2018). Retrieved March 22, 2018 from https://voyant-tools.org/spyral/alta-start
[30]
Stefan Sinclair and Geoffrey Rockwell. 2018. Voyant Tools Portal. (Feb. 2018). Retrieved Feb 13, 2018 from https://voyant-tools.org/
[31]
Apache Spark. 2018. Apache Spark Homepage. (Feb. 2018). Retrieved Feb 13, 2018 from https://spark.apache.org/
[32]
Natural Language Toolkit. 2018. Natural Language Toolkit Homepage. (Feb. 2018). Retrieved Feb 13, 2018 from http://www.nltk.org/
[33]
Kurt Vonnegut. 2018. Cat's Cradle Wiki Page. (Feb. 2018). Retrieved Feb 13, 2018 from https://en.wikipedia.org/wiki/Cat%27s_Cradle
[34]
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient Distributed Datasets: A Fault-tolerant Abstraction for in-memory Cluster Computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation (NSDI'12). USENIX Association, Berkley, CA.
[35]
Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: Cluster Computing with Working Sets. In HotCloud (10). USENIX Association, Berkley, CA.
[36]
Jiaan Zeng, Guangchen Ruan, Alexander Crowell, Atul Prakash, and Beth Plale. 2014. Cloud Computing Data Capsules for Non-consumptive Use of Texts. In Proceedings of the 5th ACM workshop on Scientific Cloud Computing (ScienceCloud'14). ACM, New York, NY, 9--16.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
PEARC '18: Proceedings of the Practice and Experience on Advanced Research Computing: Seamless Creativity
July 2018
652 pages
ISBN:9781450364461
DOI:10.1145/3219104
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 July 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. HPC
  2. Spark
  3. computational notebook
  4. interactive analysis
  5. scalability
  6. text analysis

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

PEARC '18

Acceptance Rates

PEARC '18 Paper Acceptance Rate 79 of 123 submissions, 64%;
Overall Acceptance Rate 133 of 202 submissions, 66%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 85
    Total Downloads
  • Downloads (Last 12 months)11
  • Downloads (Last 6 weeks)1
Reflects downloads up to 19 Nov 2024

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media