research-article

Fingerprinting and Building Large Reproducible Datasets

Authors:

Romain Lefeuvre,

Jessie Galasso,

Benoit Combemale,

Houari Sahraoui,

Stefano ZacchiroliAuthors Info & Claims

ACM REP '23: Proceedings of the 2023 ACM Conference on Reproducibility and Replicability

Pages 27 - 36

https://doi.org/10.1145/3589806.3600043

Published: 28 June 2023 Publication History

Abstract

Obtaining a relevant dataset is central to conducting empirical studies in software engineering. However, in the context of mining software repositories, the lack of appropriate tooling for large scale mining tasks hinders the creation of new datasets. Moreover, limitations related to data sources that change over time (e.g., code bases) and the lack of documentation of extraction processes make it difficult to reproduce datasets over time. This threatens the quality and reproducibility of empirical studies.

In this paper, we propose a tool-supported approach facilitating the creation of large tailored datasets while ensuring their reproducibility. We leveraged all the sources feeding the Software Heritage append-only archive which are accessible through a unified programming interface to outline a reproducible and generic extraction process. We propose a way to define a unique fingerprint to characterize a dataset which, when provided to the extraction process, ensures that the same dataset will be extracted.

We demonstrate the feasibility of our approach by implementing a prototype. We show how it can help reduce the limitations researchers face when creating or reproducing datasets.

References

[1]

Kevin Allix, Tegawendé F Bissyandé, Jacques Klein, and Yves Le Traon. 2016. Androzoo: Collecting millions of android apps for the research community. In MSR 2016. IEEE, 468–471.

Digital Library

[2]

Lorena A Barba. 2018. Terminologies for reproducible research. arXiv preprint arXiv:1802.03311 (2018).

[3]

Paolo Boldi, Antoine Pietri, Sebastiano Vigna, and Stefano Zacchiroli. 2020. Ultra-Large-Scale Repository Analysis via Graph Compression. In SANER 2020. IEEE, 184–194. https://doi.org/10.1109/SANER48275.2020.9054827

[4]

Jordi Cabot and Martin Gogolla. 2012. Object Constraint Language (OCL): A Definitive Guide. In SFM 2012(LNCS, Vol. 7320), Marco Bernardo, Vittorio Cortellessa, and Alfonso Pierantonio (Eds.). Springer, 58–90. https://doi.org/10.1007/978-3-642-30982-3_3

Digital Library

[5]

Valerio Cosentino, Javier Luis, and Jordi Cabot. 2016. Findings from GitHub: methods, datasets and limitations. In MSR 2016. 137–141.

Digital Library

[6]

Roberto Di Cosmo, Morane Gruenpeter, and Stefano Zacchiroli. 2020. Referencing Source Code Artifacts: A Separate Concern in Software Citation. Comput. Sci. Eng. 22, 2 (2020), 33–43. https://doi.org/10.1109/MCSE.2019.2963148

[7]

Fabio QB Da Silva, Marcos Suassuna, A César C França, Alicia M Grubb, Tatiana B Gouveia, Cleviton VF Monteiro, and Igor Ebrahim dos Santos. 2014. Replication of empirical studies in software engineering research: a systematic mapping study. Empirical Software Engineering 19, 3 (2014), 501–557.

[8]

Roberto Di Cosmo and Stefano Zacchiroli. 2017. Software heritage: Why and how to preserve software source code. In iPRES 2017. 1–10.

[9]

Robert Dyer, Hoan Anh Nguyen, Hridesh Rajan, and Tien N Nguyen. 2015. Boa: Ultra-large-scale software repository and source-code mining. ACM Transactions on Software Engineering and Methodology (TOSEM) 25, 1 (2015), 1–34.

Digital Library

[10]

Nadime Francis, Alastair Green, Paolo Guagliardo, Leonid Libkin, Tobias Lindaaker, Victor Marsault, Stefan Plantikow, Mats Rydberg, Petra Selmer, and Andrés Taylor. 2018. Cypher: An Evolving Query Language for Property Graphs. In SIGMOD 2018, Gautam Das, Christopher M. Jermaine, and Philip A. Bernstein (Eds.). ACM, 1433–1445. https://doi.org/10.1145/3183713.3190657

Digital Library

[11]

Franz-Xaver Geiger, Ivano Malavolta, Luca Pascarella, Fabio Palomba, Dario Di Nucci, and Alberto Bacchelli. 2018. A graph-based dataset of commit history of real-world android apps. In MSR 2018. 30–33.

Digital Library

[12]

Georgios Gousios. 2013. The GHTorent dataset and tool suite. In MSR 2013. IEEE, 233–236.

Digital Library

[13]

Olaf Hartig and Jorge Pérez. 2018. Semantics and Complexity of GraphQL. In WWW 2018, Pierre-Antoine Champin, Fabien Gandon, Mounia Lalmas, and Panagiotis G. Ipeirotis (Eds.). ACM, 1155–1164. https://doi.org/10.1145/3178876.3186014

Digital Library

[14]

Natalia Juristo and Omar S Gómez. 2012. Replication of software engineering experiments. In LASER Summer School. Springer, 60–88.

[15]

Eirini Kalliamvakou, Georgios Gousios, Kelly Blincoe, Leif Singer, Daniel M German, and Daniela Damian. 2016. An in-depth study of the promises and perils of mining GitHub. Empirical Software Engineering 21, 5 (2016), 2035–2071.

Digital Library

[16]

Pei Liu, Li Li, Yanjie Zhao, Xiaoyu Sun, and John Grundy. 2020. Androzooopen: Collecting large-scale open source android apps for the research community. In MSR 2020. 548–552.

Digital Library

[17]

James R Marsden and David E Pingry. 2018. Numerical data quality in IS research and the implications for replication. Decision Support Systems 115 (2018), A1–A7.

[18]

Antoine Pietri. 2021. Organizing the graph of public software development for large-scale mining. Theses. Université Paris Cité. https://hal.science/tel-03515795

[19]

Antoine Pietri, Diomidis Spinellis, and Stefano Zacchiroli. 2019. The Software Heritage graph dataset: public software development under one roof. In MSR 2019. IEEE, 138–142.

Digital Library

[20]

Forrest J Shull, Jeffrey C Carver, Sira Vegas, and Natalia Juristo. 2008. The role of replications in empirical software engineering. Empirical software engineering 13, 2 (2008), 211–218.

[21]

Adam Tutko, Austin Z Henley, and Audris Mockus. 2022. How are Software Repositories Mined? A Systematic Literature Review of Workflows, Methodologies, Reproducibility, and Tools. arXiv preprint arXiv:2204.08108 (2022).

[22]

Gregory Vial. 2019. Reflections on quality requirements for digital trace data in IS research. Decision Support Systems 126 (2019), 113133.

[23]

M Vidoni. 2022. A systematic process for Mining Software Repositories: Results from a systematic literature review. IST 144 (2022), 106791.

[24]

Dong Wang, Yuki Ueda, Raula Gaikovina Kula, Takashi Ishio, and Kenichi Matsumoto. 2021. Can we benchmark Code Review studies? A systematic mapping study of methodology, dataset, and metric. JSS 180 (2021), 111009.

Digital Library

[25]

Li Zhang, Jia-Hao Tian, Jing Jiang, Yi-Jun Liu, Meng-Yuan Pu, and Tao Yue. 2018. Empirical research in software engineering—a literature survey. Journal of Computer Science and Technology 33, 5 (2018), 876–899.

Cited By

Bedekar MMussbacher GCombemale BWimmer MChechik MEgyed A(2024)A Multi-Platform Specification Language and Dataset for the Analysis of DevOps PipelinesProceedings of the ACM/IEEE 27th International Conference on Model Driven Engineering Languages and Systems10.1145/3652620.3686247(264-274)Online publication date: 22-Sep-2024
https://dl.acm.org/doi/10.1145/3652620.3686247

Recommendations

What is in our datasets?: describing a structure of datasets
ACSW '16: Proceedings of the Australasian Computer Science Week Multiconference

In order to facilitate research based on datasets in empirical software engineering, the meaning of data must be able to be interpreted correctly. Datasets contain measurements that are associated with metrics and entities. In some datasets, it is not ...
Reproducible experiments for generating pre-processing pipelines for AutoETL
Abstract
This work is a companion reproducibility paper of the experiments and results reported in Giovanelli et al. (2022), where data pre-processing pipelines are evaluated in order to find pipeline prototypes that reduce the classification error of ...
Highlights
- Assess the impact of data pre-processing on classification tasks.
- Generate pre-processing pipelines that improve the classification accuracy.
- Strongly reproduce the experiments that use intermediate results in the optimization.
IncluSet: A Data Surfacing Repository for Accessibility Datasets
ASSETS '20: Proceedings of the 22nd International ACM SIGACCESS Conference on Computers and Accessibility

Datasets and data sharing play an important role for innovation, benchmarking, mitigating bias, and understanding the complexity of real world AI-infused applications. However, there is a scarcity of available data generated by people with disabilities ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

ACM REP '23: Proceedings of the 2023 ACM Conference on Reproducibility and Replicability

June 2023

127 pages

ISBN:9798400701764

DOI:10.1145/3589806

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

EIGREP: Emerging Interest Group on Reproducibility and Replicability

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 June 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

ACM REP '23

Sponsor:

EIGREP

ACM REP '23: 2023 ACM Conference on Reproducibility and Replicability

June 27 - 29, 2023

CA, Santa Cruz, USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
70
Total Downloads

Downloads (Last 12 months)45
Downloads (Last 6 weeks)4

Reflects downloads up to 13 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Bedekar MMussbacher GCombemale BWimmer MChechik MEgyed A(2024)A Multi-Platform Specification Language and Dataset for the Analysis of DevOps PipelinesProceedings of the ACM/IEEE 27th International Conference on Model Driven Engineering Languages and Systems10.1145/3652620.3686247(264-274)Online publication date: 22-Sep-2024
https://dl.acm.org/doi/10.1145/3652620.3686247

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents