A Computational Notebook Approach to Large-scale Text Analysis: Balancing Accessibility with Scalability
Article No.: 35, Pages 1 - 8
Abstract
Large-scale text analysis algorithms are important to many fields as they interrogate reams of textual data to extract evidence, correlations, and trends not readily discoverable by a human reader. Unfortunately, there is often an expertise mismatch between computational researchers who have the technical and programming skills necessary to develop workflows at scale and domain scholars who have knowledge of the literary, historical, scientific, or social factors that can affect data as it is manipulated. Our work focuses on the use of scalable computational notebooks as a model to bridge the accessibility gap for domain scholars, putting the power of HPC resources directly in the hands of the researchers who have scholarly questions. The computational notebook approach offers many benefits, including: fine-grained control through modularized functions, interactive analysis that puts the "human in the loop", scalable analysis that leverages Spark-as-a-Service, and complexity hiding interfaces that minimize the need for HPC expertise. In addition, the notebook approach makes it easy to share, reproduce, and sustain research workflows. We illustrate the applicability of our approach with usage scenarios on HPC systems as well as within a restricted computing environment to access sensitive, in-copyright data, and demonstrate the usefulness of the notebook approach with three examples from three different domains and data sources. These sources include historical topic trends in ten thousand scientific articles, sentiment analysis of tweets, and literary analysis of the copyrighted works of Kurt Vonnegut using non-consumptive techniques.
References
[1]
Eric Alexander, Joe Kohlmann, Robin Valenza, Michael Witmore, and Michael Gleicher. 2014. Serendip: Topic Model-driven Visual Exploration of Text Corpora. In Proceedings of IEEE Conference on Visual Analytics Science and Technology (VAST'14). IEEE, 173--182.
[2]
HathiTrust Research Center Analytics. 2018. HathiTrust Research Center Analytics Homepage. (Feb. 2018). Retrieved Feb 13, 2018 from https://analytics.hathitrust.org/
[3]
Hadoop Distributed Filesystem Architecture. 2018. Hadoop Distributed Filesystem Homepage. (Feb. 2018). Retrieved Feb 13, 2018 from https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html
[4]
David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet Allocation. Journal of Machine Learning Research 3 (Jan. 2003), 993--1022.
[5]
LM Aiello et al. CA Davis, GL Ciampaglia. {n. d.}. OSoMe: the IUNI observatory on social media. Peerj Computer Science 2:e87 ({n. d.}).
[6]
HathiTrust Research Center. 2018. HathiTrust Research Center Homepage. (Feb. 2018). Retrieved Feb 13, 2018 from https://www.hathitrust.org/htrc
[7]
Stanford CoreNLP. 2018. Stanford CoreNLP Homepage. (Feb. 2018). Retrieved Feb 13, 2018 from https://stanfordnlp.github.io/CoreNLP/
[8]
PostgreSQL Database. 2018. PostgreSQL Homepage. (Feb. 2018). Retrieved Feb 13, 2018 from https://www.postgresql.org/
[9]
Karst Desktop. 2018. Karst Deskto Homepage. (Feb. 2018). Retrieved Feb 13, 2018 from https://kb.iu.edu/d/bfwp
[10]
Elsevier. 2018. ScienceDirect Repository. (Feb. 2018). Retrieved Feb 13, 2018 from https://www.sciencedirect.com/
[11]
Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. 1996. A Density-based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD'96). ACM, New York, NY, 226--231.
[12]
Ingo Feinerer. 2018. Text Mining Package in R. (Feb. 2018). Retrieved Feb 13, 2018 from http://tm.r-forge.r-project.org/
[13]
Lustre Filesystem. 2018. Lustre Homepage. (Feb. 2018). Retrieved Feb 13, 2018 from http://lustre.org/
[14]
Jing Han, Haihong E, Guan Le, and Jian Du. 2011. Survey on NoSQL Database. In Proceedings of the 6th International Conference on Pervasive Computing and Applications (ICPCA'11). IEEE.
[15]
Thomas Hofmann. 2017. Probabilistic Latent Semantic Indexing. ACM SIGIR Forum 51, 2 (july 2017), 211--218.
[16]
Indiana University Network Science Institute. 2018. IUNI Web Science of Science Homepage. (Feb. 2018). Retrieved Feb 13, 2018 from https://iuni.iu.edu/resources/web-of-science
[17]
Tapas Kanungo, David M. Mount, Nathan S. Netanyahu, Christine D. Piatko, Ruth Silverman, and Angela Y. Wu. 2002. An Efficient k-means Clustering Algorithm: Analysis and Implementation. IEEE Transactions on Pattern Analysis and Machine Intelligence 24, 7 (July 2002), 881--892.
[18]
HathiTrust Digital Library. 2018. HathiTrust Digital Library Homepage. (Feb. 2018). Retrieved Feb 13, 2018 from https://www.hathitrust.org/
[19]
Matlab. 2018. Matlab Homepage. (Feb. 2018). Retrieved Feb 13, 2018 from https://www.mathworks.com/products/matlab.html
[20]
Matplotlib. 2018. Matplotlib Homepage. (Feb. 2018). Retrieved Feb 13, 2018 from https://matplotlib.org/
[21]
Apache OpenNLP. 2018. Apache OpenNLP Homepage. (Feb. 2018). Retrieved Feb 13, 2018 from https://opennlp.apache.org/
[22]
Uriel Cohen Priva and Joseph L. Austerweil. 2015. Analyzing the History of Cognition using Topic Models. Cognition 135 (Feb. 2015), 4--9.
[23]
Jupyter Project. 2018. Jupyter Project Homepage. (Feb. 2018). Retrieved Feb 13, 2018 from http://jupyter.org/
[24]
Radim Řehuřek. 2018. Gensim Homepage. (Feb. 2018). Retrieved Feb 13, 2018 from https://radimrehurek.com/gensim/
[25]
Tristan Richardson, Quentin Stafford-Fraser, Kenneth R. Wood, and Andy Hopper. 1998. Virtual Network Computing. IEEE Internet Computing 2, 1 (Jan. 1998), 33--38.
[26]
RStudio. 2018. RStudio Homepage. (Feb. 2018). Retrieved Feb 13, 2018 from https://www.rstudio.com/
[27]
Guangchen Ruan and Hui Zhang. 2016. In Conquering Big Data with High Performance Computing, R. Arora (Ed.). Springer, Chapter Large-scale Multimodal Data Exploration with Human in the Loop, 253--268.
[28]
Guangchen Ruan and Hui Zhang. 2017. Closed-loop Big Data Analysis with Visualization and Scalable Computing. Big Data Research 8 (Jan. 2017), 12--26.
[29]
Stéfan Sinclair and Geoffrey Rockwell. 2018. Getting Started with Spyral Notebooks. (Feb. 2018). Retrieved March 22, 2018 from https://voyant-tools.org/spyral/alta-start
[30]
Stefan Sinclair and Geoffrey Rockwell. 2018. Voyant Tools Portal. (Feb. 2018). Retrieved Feb 13, 2018 from https://voyant-tools.org/
[31]
Apache Spark. 2018. Apache Spark Homepage. (Feb. 2018). Retrieved Feb 13, 2018 from https://spark.apache.org/
[32]
Natural Language Toolkit. 2018. Natural Language Toolkit Homepage. (Feb. 2018). Retrieved Feb 13, 2018 from http://www.nltk.org/
[33]
Kurt Vonnegut. 2018. Cat's Cradle Wiki Page. (Feb. 2018). Retrieved Feb 13, 2018 from https://en.wikipedia.org/wiki/Cat%27s_Cradle
[34]
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient Distributed Datasets: A Fault-tolerant Abstraction for in-memory Cluster Computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation (NSDI'12). USENIX Association, Berkley, CA.
[35]
Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: Cluster Computing with Working Sets. In HotCloud (10). USENIX Association, Berkley, CA.
[36]
Jiaan Zeng, Guangchen Ruan, Alexander Crowell, Atul Prakash, and Beth Plale. 2014. Cloud Computing Data Capsules for Non-consumptive Use of Texts. In Proceedings of the 5th ACM workshop on Scientific Cloud Computing (ScienceCloud'14). ACM, New York, NY, 9--16.
Index Terms
- A Computational Notebook Approach to Large-scale Text Analysis: Balancing Accessibility with Scalability
Recommendations
Extreme-scale workflows: A perspective from the JLESC international community
AbstractThe Joint Laboratory for Extreme-Scale Computing (JLESC) focuses on software challenges in high-performance computing systems to meet the needs of today’s science campaigns, which often require large resources, consist of multiple tasks, and ...
Highlights- We feature extreme-workflow systems developed by JLESC partners.
- We share the lessons learned while developing these extreme-scale workflows.
- We discuss open challenges and important research directions in extreme-scale workflows.
Comments
Please enable JavaScript to view thecomments powered by Disqus.Information & Contributors
Information
Published In
Copyright © 2018 ACM.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
Published: 22 July 2018
Check for updates
Author Tags
Qualifiers
- Research-article
- Research
- Refereed limited
Conference
PEARC '18
PEARC '18: Practice and Experience in Advanced Research Computing
July 22 - 26, 2018
PA, Pittsburgh, USA
Acceptance Rates
PEARC '18 Paper Acceptance Rate 79 of 123 submissions, 64%;
Overall Acceptance Rate 133 of 202 submissions, 66%
Contributors
Other Metrics
Bibliometrics & Citations
Bibliometrics
Article Metrics
- 0Total Citations
- 85Total Downloads
- Downloads (Last 12 months)11
- Downloads (Last 6 weeks)1
Reflects downloads up to 19 Nov 2024
Other Metrics
Citations
View Options
Login options
Check if you have access through your login credentials or your institution to get full access on this article.
Sign in