Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3539781.3539795acmconferencesArticle/Chapter ViewAbstractPublication PagespascConference Proceedingsconference-collections
research-article

Toward a big data analysis system for historical newspaper collections research

Published: 12 July 2022 Publication History

Abstract

The availability and generation of digitized newspaper collections have provided researchers in several domains with a powerful tool to advance their research. More specifically, digitized historical newspapers give us a magnifying glass into the past. In this paper, we propose a scalable and customizable big data analysis system that enables researchers to study complex questions about our society as depicted in news media for the past few centuries by applying cutting-edge text analysis tools to large historical newspaper collections. We discuss our experience with building a preliminary version of such a system, including how we have addressed the following challenges: processing millions of digitized newspaper pages from various publications worldwide, which amount to hundreds of terabytes of data; applying article segmentation and Optical Character Recognition (OCR) to historical newspapers, which vary between and within publications over time; retrieving relevant information to answer research questions from such data collections by applying human-in-the-loop machine learning; and enabling users to analyze topic evolution and semantic dynamics with multiple compatible analysis operators. We also present some preliminary results of using the proposed system to study the social construction of juvenile delinquency in the United States and discuss important remaining challenges to be tackled in the future.

References

[1]
[n.d.]. Apache Airflow. https://airflow.apache.org/
[2]
[n.d.]. Chronicling America historic American newspapers. https://lccn.loc.gov/2007618519
[3]
[n. d.]. Elasticsearch: The Official Distributed Search & Analytics Engine. https://www.elastic.co//elasticsearch
[4]
[n. d.]. Improving the quality of the output. https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html
[5]
[n. d.]. Kibana: Explore, Visualize, Discover Data. https://www.elastic.co/kibana
[6]
[n. d.]. POLICE OF THE MKTROPOLIS. (Hansard, 7 July 1817). https://api.parliament.uk/historic-hansard/commons/1817/jul/07/police-of-the-mktropolis
[7]
[n. d.]. ProQuest Historical NewspapersTM. https://about.proquest.com/products-services/pq-hist-news.html
[8]
[n. d.]. Python Client for Google Cloud Vision --- google-cloud-vision documentation. https://googleapis.dev/python/vision/latest/index.html
[9]
[n. d.]. The Valley of the Shadow: Two Communities in the American Civil War. https://valley.lib.virginia.edu/
[10]
1846. Criminal Tables for the Year 1845.-England and Wales. Journal of the Statistical Society of London 9, 2 (1846), 177--183. Publisher: [Royal Statistical Society, Wiley].
[11]
2021. googleapis/python-vision. https://github.com/googleapis/python-vision original-date: 2019-12-10T00:10:28Z.
[12]
Waleed Abdulla. 2017. Mask R-CNN for object detection and instance segmentation on Keras and TensorFlow. https://github.com/matterport/Mask_RCNN
[13]
Hanna Adoni and Sherrill Mane. 1984. Media and the social construction of reality: Toward an integration of theory and research. Communication Research 11, 3 (1984), 323--340. Place: US Publisher: Sage Publications.
[14]
A. Almutairi and M. Almashan. 2019. Instance Segmentation of Newspaper Elements Using Mask R-CNN. In 2019 18th IEEE International Conference On Machine Learning And Applications (ICMLA). 1371--1375.
[15]
Yadu Babuji, Anna Woodard, Zhuozhao Li, Daniel S. Katz, Ben Clifford, Rohan Kumar, Lukasz Lacinski, Ryan Chard, Justin M. Wozniak, Ian Foster, Michael Wilde, and Kyle Chard. 2019. Parsl: Pervasive Parallel Programming in Python. In Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing (HPDC '19). Association for Computing Machinery, New York, NY, USA, 25--36.
[16]
Stefan Buttcher, Charles LA Clarke, and Gordon V Cormack. 2016. Information retrieval: Implementing and evaluating search engines. Mit Press.
[17]
Guillaume Chiron, Antoine Doucet, Mickaël Coustaty, and Jean-Philippe Moreux. 2017. ICDAR2017 Competition on Post-OCR Text Correction. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Vol. 01. 1423--1428. ISSN: 2379-2140.
[18]
Herbert J. Gans. 1979. Deciding what's news: Story suitability. Soc 16, 3 (March 1979), 65--77.
[19]
Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2018. Mask R-CNN. arXiv:1703.06870 [cs] (Jan. 2018). http://arxiv.org/abs/1703.06870 arXiv:1703.06870.
[20]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. 770--778. https://openaccess.thecvf.com/content_cvpr_2016/html/He_Deep_Residual_Learning_CVPR_2016_paper.html
[21]
Lars-Christer Hydén. 1993. The social construction of juvenile delinquency: Sailing in cold or hot water. Young 1, 3 (1993), 2--10. Publisher: Sage Publications Sage CA: Thousand Oaks, CA.
[22]
Sai Muralidhar Jayanthi, Danish Pruthi, and Graham Neubig. 2020. NeuSpell: A Neural Spelling Correction Toolkit. arXiv:2010.11085 [cs] (Oct. 2020). http://arxiv.org/abs/2010.11085 arXiv: 2010.11085.
[23]
Tom Kenter, Melvin Wevers, Pim Huijnen, and Maarten De Rijke. 2015. Ad hoc monitoring of vocabulary shifts over time. In Proceedings of the 24th ACM international on conference on information and knowledge management. 1191--1200.
[24]
Volodymyr Kindratenko, Dawei Mu, Yan Zhan, John Maloney, Sayed Hadi Hashemi, Benjamin Rabe, Ke Xu, Roy Campbell, Jian Peng, and William Gropp. 2020. HAL: Computer System for Scalable Deep Learning. In Practice and Experience in Advanced Research Computing (PEARC '20). Association for Computing Machinery, New York, NY, USA, 41--48.
[25]
Benjamin Charles Germain Lee, Jaime Mears, Eileen Jakeway, Meghan Ferriter, Chris Adams, Nathan Yarasavage, Deborah Thomas, Kate Zwaard, and Daniel S. Weld. 2020. The Newspaper Navigator Dataset: Extracting And Analyzing Visual Content from 16 Million Historic Newspaper Pages in Chronicling America. arXiv 2005.01583 [cs] (May 2020). http://arxiv.org/abs/2005.01583 arXiv: 2005.01583.
[26]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. In Computer Vision - ECCV 2014 (Lecture Notes in Computer Science), David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars (Eds.). Springer International Publishing, Cham, 740--755.
[27]
Walter Lippmann. 1946. Public Opinion. Transaction Publishers. Google-Books-ID: YhXLOVc6BsoC.
[28]
Luigi Marini, Indira Gutierrez-Polo, Rob Kooper, Sandeep Puthanveetil Satheesan, Maxwell Burnette, Jong Lee, Todd Nicholson, Yan Zhao, and Kenton McHenry. 2018. Clowder: Open Source Data Management for Long Tail Data. In Proceedings of the Practice and Experience on Advanced Research Computing (PEARC '18). Association for Computing Machinery, New York, NY, USA, 1--8.
[29]
Suresh Marru, Lahiru Gunathilake, Chathura Herath, Patanachai Tangchaisin, Marlon Pierce, Chris Mattmann, Raminder Singh, Thilina Gunarathne, Eran Chinthaka, Ross Gardler, Aleksander Slominski, Ate Douma, Srinath Perera, and Sanjiva Weerawarana. 2011. Apache airavata: a framework for distributed applications and computational workflows. In Proceedings of the 2011 ACM workshop on Gateway computing environments (GCE '11). Association for Computing Machinery, New York, NY, USA, 21--28.
[30]
Carlos Martinez-Ortiz, Tom Kenter, Melvin Wevers, Pim Huijnen, Jaap Verheul, and Joris van Eijnatten. 2016. ShiCo: A Visualization Tool for Shifting Concepts Through Time. In Proceedings of the 3rd DH Benelux Conference (DH Benelux 2016). 1.
[31]
Sean Alexander Massung. 2017. Beyond topic-based representations for text mining. Ph. D. Dissertation. University of Illinois at Urbana-Champaign.
[32]
B. Meier, T. Stadelmann, J. Stampfli, M. Arnold, and M. Cieliebak. 2017. Fully Convolutional Neural Networks for Newspaper Article Segmentation. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Vol. 01. 414--419.
[33]
Tim Newburn. 2002. The contemporary politics of youth crime prevention. In Youth Justice: Critical Readings, John Muncie, Gordon Hughes, and Eugene McLaughlin (Eds.). Sage Publications, London, 452--463. Num Pages: 476.
[34]
S. Padhy, G. Jansen, J. Alameda, E. Black, L. Diesendruck, M. Dietze, P. Kumar, R. Kooper, J. Lee, R. Liu, R. Marciano, L. Marini, D. Mattson, B. Minsker, C. Navarro, M. Slavenas, W. Sullivan, J. Votava, I. Zharnitsky, and K. McHenry. 2015. Brown Dog: Leveraging everything towards autocuration. In 2015 IEEE International Conference on Big Data (Big Data). 493--500.
[35]
Robert E. Park. 1940. News as a Form of Knowledge: A Chapter in the Sociology of Knowledge. Amer. J. Sociology 45, 5 (March 1940), 669--686. Publisher: The University of Chicago Press.
[36]
Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. GloVe: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Doha, Qatar, 1532--1543.
[37]
Sandeep Puthanveetil Satheesan. [n. d.]. draw-text-boxes-alto-viz. https://opensource.ncsa.illinois.edu/bitbucket/projects/JUDEL/repos/draw-text-boxes-alto-viz/
[38]
Sandeep Puthanveetil Satheesan. [n. d.]. loc-ca-search-download-app. https://opensource.ncsa.illinois.edu/bitbucket/projects/JUDEL/repos/loc-ca-search-download-app/
[39]
Sandeep Puthanveetil Satheesan. [n. d.]. sandeep-ps/Mask_RCNN. https://github.com/sandeep-ps/Mask_RCNN
[40]
Snell Putney and Gladys J Putney. 1962. Origins of the Reformatory. The Journal of Criminal Law, Criminology, and Police Science 53, 4 (1962), 437--445.
[41]
Sandeep Puthanveetil Satheesan, Jay Alameda, Shannon Bradley, Michael Dietze, Benjamin Galewsky, Gregory Jansen, Rob Kooper, Praveen Kumar, Jong Lee, Richard Marciano, Luigi Marini, Barbara S. Minsker, Christopher M. Navarro, Arthur Schmidt, Marcus Slavenas, William C. Sullivan, Bing Zhang, Yan Zhao, Inna Zharnitsky, and Kenton McHenry. 2018. Brown Dog: Making the Digital World a Better Place, a Few Files at a Time. In Proceedings of the Practice and Experience on Advanced Research Computing (PEARC '18). Association for Computing Machinery, New York, NY, USA, 1--8.
[42]
Sandeep Puthanveetil Satheesan, Alan B. Craig, and Yu Zhang. 2019. A Historical Big Data Analysis to Understand the Social Construction of Juvenile Delinquency in the United States. In 2019 15th International Conference on eScience (eScience). 636--637.
[43]
R. Smith. 2007. An Overview of the Tesseract OCR Engine. In Ninth International Conference on Document Analysis and Recognition (ICDAR 2007), Vol. 2. 629--633. ISSN: 2379-2140.
[44]
M. Stone. 1974. Cross-Validatory Choice and Assessment of Statistical Predictions. Journal of the Royal Statistical Society Series B (Methodological) 36, 2 (1974), 111--133. : https://onlinelibrary.wiley.com/doi/pdf/10.1111/j.2517-6161.1974.tb00994.x.
[45]
Ottokar Tilk and Tanel Alumäe. 2016. Bidirectional Recurrent Neural Network with Attention Mechanism for Punctuation Restoration. In INTERSPEECH.
[46]
J. Towns, T. Cockerill, M. Dahan, I. Foster, K. Gaither, A. Grimshaw, V. Hazlewood, S. Lathrop, D. Lifka, G. D. Peterson, R. Roskies, J. Scott, and N. Wilkins-Diehr. 2014. XSEDE: Accelerating Scientific Discovery. Computing in Science & Engineering 16, 05 (Sept. 2014), 62--74. Place: Los Alamitos, CA, USA Publisher: IEEE Computer Society.
[47]
Ivan Vulić, Edoardo Maria Ponti, Robert Litschko, Goran Glavaš, and Anna Korhonen. 2020. Probing Pretrained Language Models for Lexical Semantics. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 7222--7240.
[48]
Nancy Wilkins-Diehr, Sergiu Sanielevici, Jay Alameda, John Cazes, Lonnie Crosby, Marlon Pierce, and Ralph Roskies. 2016. An overview of the XSEDE extended collaborative support program. In High Performance Computer Applications - 6th International Conference, ISUM 2015, Revised Selected Papers (Communications in Computer and Information Science, Vol. 595). Springer Verlag, Germany, 3--13.
[49]
Duo Zhang, Chengxiang Zhai, and Jiawei Han. 2009. Topic cube: Topic modeling for olap on multidimensional text databases. In Proceedings of the 2009 SIAM International Conference on Data Mining. SIAM, 1124--1135.
[50]
Xiong Zhang, Jonathan Engel, Sara Evensen, Yuliang Li, Çaǧatay Demiralp, and Wang-Chiew Tan. 2020. Teddy: A System for Interactive Review Analysis. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems. 1--13.

Cited By

View all
  • (2024)Understanding the social construction of juvenile delinquency: insights from semantic analysis of big-data historical newspaper collectionsJournal of Computational Social Science10.1007/s42001-024-00254-xOnline publication date: 11-May-2024
  • (2024)Detection of Punjabi Newspaper Articles Using a Deep Learning ApproachInnovations in Electrical and Electronic Engineering10.1007/978-981-99-8661-3_30(409-418)Online publication date: 16-Feb-2024

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
PASC '22: Proceedings of the Platform for Advanced Scientific Computing Conference
June 2022
181 pages
ISBN:9781450394109
DOI:10.1145/3539781
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

In-Cooperation

  • CSCS: Swiss National Supercomputing Centre

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 July 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. big data analysis system
  2. data visualization
  3. historical newspapers
  4. image analysis
  5. information retrieval
  6. juvenile delinquency
  7. natural language processing
  8. newspaper article segmentation
  9. social construction
  10. social science research
  11. text analysis

Qualifiers

  • Research-article

Funding Sources

  • National Science Foundation

Conference

PASC '22
Sponsor:

Acceptance Rates

PASC '22 Paper Acceptance Rate 17 of 22 submissions, 77%;
Overall Acceptance Rate 109 of 221 submissions, 49%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)74
  • Downloads (Last 6 weeks)7
Reflects downloads up to 26 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Understanding the social construction of juvenile delinquency: insights from semantic analysis of big-data historical newspaper collectionsJournal of Computational Social Science10.1007/s42001-024-00254-xOnline publication date: 11-May-2024
  • (2024)Detection of Punjabi Newspaper Articles Using a Deep Learning ApproachInnovations in Electrical and Electronic Engineering10.1007/978-981-99-8661-3_30(409-418)Online publication date: 16-Feb-2024

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media