Computer Science > Databases

arXiv:2401.07290 (cs)

[Submitted on 14 Jan 2024]

Title:Optimizing a Data Science System for Text Reuse Analysis

Authors:Ananth Mahadevan, Michael Mathioudakis, Eetu Mäkelä, Mikko Tolonen

Abstract:Text reuse is a methodological element of fundamental importance in humanities research: pieces of text that re-appear across different documents, verbatim or paraphrased, provide invaluable information about the historical spread and evolution of ideas. Large modern digitized corpora enable the joint analysis of text collections that span entire centuries and the detection of large-scale patterns, impossible to detect with traditional small-scale analysis. For this opportunity to materialize, it is necessary to develop efficient data science systems that perform the corresponding analysis tasks.
In this paper, we share insights from ReceptionReader, a system for analyzing text reuse in large historical corpora. The system is built upon billions of instances of text reuses from large digitized corpora of 18th-century texts. Its main functionality is to perform downstream text reuse analysis tasks, such as finding reuses that stem from a given article or identifying the most reused quotes from a set of documents, with each task expressed as a database query. For the purposes of the paper, we discuss the related design choices including various database normalization levels and query execution frameworks, such as distributed data processing (Apache Spark), indexed row store engine (MariaDB Aria), and compressed column store engine (MariaDB Columnstore). Moreover, we present an extensive evaluation with various metrics of interest (latency, storage size, and computing costs) for varying workloads, and we offer insights from the trade-offs we observed and the choices that emerged as optimal in our setting. In summary, our results show that (1) for the workloads that are most relevant to text-reuse analysis, the MariaDB Aria framework emerges as the overall optimal choice, (2) big data processing (Apache Spark) is irreplaceable for all processing stages of the system's pipeline.

Comments:	Early Draft
Subjects:	Databases (cs.DB)
Cite as:	arXiv:2401.07290 [cs.DB]
	(or arXiv:2401.07290v1 [cs.DB] for this version)
	https://doi.org/10.48550/arXiv.2401.07290

Submission history

From: Ananth Mahadevan [view email]
[v1] Sun, 14 Jan 2024 13:51:37 UTC (20,864 KB)

Computer Science > Databases

Title:Optimizing a Data Science System for Text Reuse Analysis

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Databases

Title:Optimizing a Data Science System for Text Reuse Analysis

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators