Automated Processing of Digitized Historical Newspapers beyond the Article Level: Sections and Regular Features

Robert B. Allen⁴ &
Catherine Hall⁴

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6102))

Included in the following conference series:

International Conference on Asian Digital Libraries

1486 Accesses

Abstract

Millions of pages of historical newspapers have been digitized but in most cases access to these are supported by only basic search services. We are exploring interactive services for these collections which would be useful for supporting access, including automatic categorization of articles. Such categorization is difficult because of the uneven quality of the OCR text, but there are many clues which can be useful for improving the accuracy of the categorization. Here, we describe observations of several historical newspapers to determine the characteristics of sections. We then explore how to automatically identify those sections and how to detect serialized feature articles which are repeated across days and weeks. The goal is not the introduction of new algorithms but the development of practical and robust techniques. For both analyses we find substantial success for some categories and articles, but others prove very difficult.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 74.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

An Enhanced Text Classification Using Machine Learning

Exploring History Through Newspaper Archives

Beyond lexical frequencies: using R for text analysis in the digital humanities

Article 08 April 2019

References

Murray, R.L.: Toward a Metadata Standard for Digitized Historical Newspapers. In: Proceedings of IEEE/ACM JCDL, pp. 330–331 (2005)
Google Scholar
Allen, R.B., Waldstein, I., Zhu, W.Z.: Automated Processing of Digitized Historical Newspapers: Identification of Segments and Genres. In: Buchanan, G., Masoodian, M., Cunningham, S.J. (eds.) ICADL 2008. LNCS, vol. 5362, pp. 380–387. Springer, Heidelberg (2008)
Google Scholar
Toms, E., Flora, N.: From Physical to Digital Humanities Library: Designing the Humanities Scholar’s Workbench. In: Siemens, R., Moorman, D. (eds.) Mind Technologies, Humanities Computing, and the Canadian Academic Community, pp. 91–115. U. Calgary Press, Calgary (2006)
Google Scholar
Allen, R.B.: Improving Access to Digitized Historical Newspapers with Text Mining, Coordinated Models, and Formative User Interface Design. In: IFLA International Newspaper Conference: Digital Preservation and Access to News and Views, pp. 54–59 (2010)
Google Scholar
Holley, R.: How Good Can It Get? Analysing and Improving OCR Accuracy in Large Scale Historic Newspaper Digitisation Programs. D-Lib Magazine 15(3/4) (March/April 2009)
Google Scholar
Ihlström, C., Åkesson, M.: Genre Characteristics – A Front Page Analysis of 85 Swedish Online Newspapers. In: Proceedings of the Proceedings of the Hawaii International Conference on System Sciences (2004)
Google Scholar
Foulger, D.: Medium as an Ecology of Genre: Integrating Media Theory and Genre Theory. Media Ecology Association (2006)
Google Scholar
Allen, R.B., Nalluru, S.: Exploring History with Narrative Timelines. In: Smith, M.J., Salvendy, G. (eds.) HCII 2009. LNCS, vol. 5617, pp. 333–338. Springer, Heidelberg (2009)
Google Scholar

Download references

Author information

Authors and Affiliations

The iSchool at Drexel University, 3141 Chestnut Street, Philadelphia, PA, 19104, USA
Robert B. Allen & Catherine Hall

Authors

Robert B. Allen
View author publications
You can also search for this author in PubMed Google Scholar
Catherine Hall
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

University of Technology, Sydney, PO Box 123, 2007, Broadway, NSW, Australia
Gobinda Chowdhury
Nanyang Technological University, 31 Nanyang Link, 637718, Singapore
Chris Koo
The University of Queensland, Brisbane, QLD 4072, Australia
Jane Hunter

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Allen, R.B., Hall, C. (2010). Automated Processing of Digitized Historical Newspapers beyond the Article Level: Sections and Regular Features. In: Chowdhury, G., Koo, C., Hunter, J. (eds) The Role of Digital Libraries in a Time of Global Change. ICADL 2010. Lecture Notes in Computer Science, vol 6102. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-13654-2_11

Download citation

DOI: https://doi.org/10.1007/978-3-642-13654-2_11
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-13653-5
Online ISBN: 978-3-642-13654-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics