Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3589806.3600043acmotherconferencesArticle/Chapter ViewAbstractPublication Pagesacm-repConference Proceedingsconference-collections
research-article

Fingerprinting and Building Large Reproducible Datasets

Published: 28 June 2023 Publication History

Abstract

Obtaining a relevant dataset is central to conducting empirical studies in software engineering. However, in the context of mining software repositories, the lack of appropriate tooling for large scale mining tasks hinders the creation of new datasets. Moreover, limitations related to data sources that change over time (e.g., code bases) and the lack of documentation of extraction processes make it difficult to reproduce datasets over time. This threatens the quality and reproducibility of empirical studies.
In this paper, we propose a tool-supported approach facilitating the creation of large tailored datasets while ensuring their reproducibility. We leveraged all the sources feeding the Software Heritage append-only archive which are accessible through a unified programming interface to outline a reproducible and generic extraction process. We propose a way to define a unique fingerprint to characterize a dataset which, when provided to the extraction process, ensures that the same dataset will be extracted.
We demonstrate the feasibility of our approach by implementing a prototype. We show how it can help reduce the limitations researchers face when creating or reproducing datasets.

References

[1]
Kevin Allix, Tegawendé F Bissyandé, Jacques Klein, and Yves Le Traon. 2016. Androzoo: Collecting millions of android apps for the research community. In MSR 2016. IEEE, 468–471.
[2]
Lorena A Barba. 2018. Terminologies for reproducible research. arXiv preprint arXiv:1802.03311 (2018).
[3]
Paolo Boldi, Antoine Pietri, Sebastiano Vigna, and Stefano Zacchiroli. 2020. Ultra-Large-Scale Repository Analysis via Graph Compression. In SANER 2020. IEEE, 184–194. https://doi.org/10.1109/SANER48275.2020.9054827
[4]
Jordi Cabot and Martin Gogolla. 2012. Object Constraint Language (OCL): A Definitive Guide. In SFM 2012(LNCS, Vol. 7320), Marco Bernardo, Vittorio Cortellessa, and Alfonso Pierantonio (Eds.). Springer, 58–90. https://doi.org/10.1007/978-3-642-30982-3_3
[5]
Valerio Cosentino, Javier Luis, and Jordi Cabot. 2016. Findings from GitHub: methods, datasets and limitations. In MSR 2016. 137–141.
[6]
Roberto Di Cosmo, Morane Gruenpeter, and Stefano Zacchiroli. 2020. Referencing Source Code Artifacts: A Separate Concern in Software Citation. Comput. Sci. Eng. 22, 2 (2020), 33–43. https://doi.org/10.1109/MCSE.2019.2963148
[7]
Fabio QB Da Silva, Marcos Suassuna, A César C França, Alicia M Grubb, Tatiana B Gouveia, Cleviton VF Monteiro, and Igor Ebrahim dos Santos. 2014. Replication of empirical studies in software engineering research: a systematic mapping study. Empirical Software Engineering 19, 3 (2014), 501–557.
[8]
Roberto Di Cosmo and Stefano Zacchiroli. 2017. Software heritage: Why and how to preserve software source code. In iPRES 2017. 1–10.
[9]
Robert Dyer, Hoan Anh Nguyen, Hridesh Rajan, and Tien N Nguyen. 2015. Boa: Ultra-large-scale software repository and source-code mining. ACM Transactions on Software Engineering and Methodology (TOSEM) 25, 1 (2015), 1–34.
[10]
Nadime Francis, Alastair Green, Paolo Guagliardo, Leonid Libkin, Tobias Lindaaker, Victor Marsault, Stefan Plantikow, Mats Rydberg, Petra Selmer, and Andrés Taylor. 2018. Cypher: An Evolving Query Language for Property Graphs. In SIGMOD 2018, Gautam Das, Christopher M. Jermaine, and Philip A. Bernstein (Eds.). ACM, 1433–1445. https://doi.org/10.1145/3183713.3190657
[11]
Franz-Xaver Geiger, Ivano Malavolta, Luca Pascarella, Fabio Palomba, Dario Di Nucci, and Alberto Bacchelli. 2018. A graph-based dataset of commit history of real-world android apps. In MSR 2018. 30–33.
[12]
Georgios Gousios. 2013. The GHTorent dataset and tool suite. In MSR 2013. IEEE, 233–236.
[13]
Olaf Hartig and Jorge Pérez. 2018. Semantics and Complexity of GraphQL. In WWW 2018, Pierre-Antoine Champin, Fabien Gandon, Mounia Lalmas, and Panagiotis G. Ipeirotis (Eds.). ACM, 1155–1164. https://doi.org/10.1145/3178876.3186014
[14]
Natalia Juristo and Omar S Gómez. 2012. Replication of software engineering experiments. In LASER Summer School. Springer, 60–88.
[15]
Eirini Kalliamvakou, Georgios Gousios, Kelly Blincoe, Leif Singer, Daniel M German, and Daniela Damian. 2016. An in-depth study of the promises and perils of mining GitHub. Empirical Software Engineering 21, 5 (2016), 2035–2071.
[16]
Pei Liu, Li Li, Yanjie Zhao, Xiaoyu Sun, and John Grundy. 2020. Androzooopen: Collecting large-scale open source android apps for the research community. In MSR 2020. 548–552.
[17]
James R Marsden and David E Pingry. 2018. Numerical data quality in IS research and the implications for replication. Decision Support Systems 115 (2018), A1–A7.
[18]
Antoine Pietri. 2021. Organizing the graph of public software development for large-scale mining. Theses. Université Paris Cité. https://hal.science/tel-03515795
[19]
Antoine Pietri, Diomidis Spinellis, and Stefano Zacchiroli. 2019. The Software Heritage graph dataset: public software development under one roof. In MSR 2019. IEEE, 138–142.
[20]
Forrest J Shull, Jeffrey C Carver, Sira Vegas, and Natalia Juristo. 2008. The role of replications in empirical software engineering. Empirical software engineering 13, 2 (2008), 211–218.
[21]
Adam Tutko, Austin Z Henley, and Audris Mockus. 2022. How are Software Repositories Mined? A Systematic Literature Review of Workflows, Methodologies, Reproducibility, and Tools. arXiv preprint arXiv:2204.08108 (2022).
[22]
Gregory Vial. 2019. Reflections on quality requirements for digital trace data in IS research. Decision Support Systems 126 (2019), 113133.
[23]
M Vidoni. 2022. A systematic process for Mining Software Repositories: Results from a systematic literature review. IST 144 (2022), 106791.
[24]
Dong Wang, Yuki Ueda, Raula Gaikovina Kula, Takashi Ishio, and Kenichi Matsumoto. 2021. Can we benchmark Code Review studies? A systematic mapping study of methodology, dataset, and metric. JSS 180 (2021), 111009.
[25]
Li Zhang, Jia-Hao Tian, Jing Jiang, Yi-Jun Liu, Meng-Yuan Pu, and Tao Yue. 2018. Empirical research in software engineering—a literature survey. Journal of Computer Science and Technology 33, 5 (2018), 876–899.

Cited By

View all
  • (2024)A Multi-Platform Specification Language and Dataset for the Analysis of DevOps PipelinesProceedings of the ACM/IEEE 27th International Conference on Model Driven Engineering Languages and Systems10.1145/3652620.3686247(264-274)Online publication date: 22-Sep-2024

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
ACM REP '23: Proceedings of the 2023 ACM Conference on Reproducibility and Replicability
June 2023
127 pages
ISBN:9798400701764
DOI:10.1145/3589806
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 June 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. dataset
  2. empirical studies
  3. open science
  4. reproducibility

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

ACM REP '23
Sponsor:

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)45
  • Downloads (Last 6 weeks)4
Reflects downloads up to 13 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)A Multi-Platform Specification Language and Dataset for the Analysis of DevOps PipelinesProceedings of the ACM/IEEE 27th International Conference on Model Driven Engineering Languages and Systems10.1145/3652620.3686247(264-274)Online publication date: 22-Sep-2024

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media