Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3382494.3410680acmconferencesArticle/Chapter ViewAbstractPublication PagesesemConference Proceedingsconference-collections
research-article

A large-scale comparative analysis of Coding Standard conformance in Open-Source Data Science projects

Published: 23 October 2020 Publication History

Abstract

Background: Meeting the growing industry demand for Data Science requires cross-disciplinary teams that can translate machine learning research into production-ready code. Software engineering teams value adherence to coding standards as an indication of code readability, maintainability, and developer expertise. However, there are no large-scale empirical studies of coding standards focused specifically on Data Science projects. Aims: This study investigates the extent to which Data Science projects follow code standards. In particular, which standards are followed, which are ignored, and how does this differ to traditional software projects? Method: We compare a corpus of 1048 Open-Source Data Science projects to a reference group of 1099 non-Data Science projects with a similar level of quality and maturity. Results: Data Science projects suffer from a significantly higher rate of functions that use an excessive numbers of parameters and local variables. Data Science projects also follow different variable naming conventions to non-Data Science projects. Conclusions: The differences indicate that Data Science codebases are distinct from traditional software codebases and do not follow traditional software engineering conventions. Our conjecture is that this may be because traditional software engineering conventions are inappropriate in the context of Data Science projects.

References

[1]
Carol V. Alexandru, José J. Merchante, Sebastiano Panichella, Sebastian Proksch, Harald C. Gall, and Gregorio Robles. 2018. On the usage of pythonic idioms. In 2018 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software. Association for Computing Machinery (ACM), 1--11. https://doi.org/10.1145/3276954.3276960
[2]
Miltiadis Allamanis, Earl T. Barr, Christian Bird, and Charles Sutton. 2014. Learning natural coding conventions. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering - FSE 2014. ACM Press, New York, New York, USA, 281--293. https://doi.org/10.1145/2635868.2635883
[3]
Brian Allbee. 2018. Hands-On Software Engineering with Python: Move beyond basic programming and construct reliable and efficient software with complex code. Packt Publishing Ltd.
[4]
Nikolaos Bafatakis, Niels Boecker, Wenjie Boon, Martin Cabello Salazar, Jens Krinke, Gazi Oznacar, and Robert White. 2019. Python Coding Style Compliance on Stack Overflow. In Proceedings of the 16th International Conference on Mining Software Repositories (2019), 210--214. https://doi.org/10.1109/MSR.2019.00042
[5]
Scott Barnett. 2017. Extracting Technical Domain Knowledge to Improve Software Architecture. Ph.D. Dissertation. Swinburne University of Technology.
[6]
Sumon Biswas, Md Johirul Islam, Yijia Huang, and Hridesh Rajan. 2019. Boa Meets Python: A Boa Dataset of Data Science Software in Python Language. In 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR), Margaret-Anne D Storey, Bram Adams, and Sonia Haiduc (Eds.). IEEE, 577--581. https://doi.org/10.1109/MSR.2019.00086
[7]
Houssem Ben Braiek, Foutse Khomh, and Bram Adams. 2018. The open-closed principle of modern machine learning frameworks. In Proceedings of the 15th International Conference on Mining Software Repositories - MSR '18, Andy Zaidman, Yasutaka Kamei, and Emily Hill (Eds.). ACM Press, 353--363. https://doi.org/10.1145/3196398.3196445
[8]
Rodrigo Magalhes Dos Santos and Marco Aurélio Gerosa. 2018. Impacts of coding practices on readability. Proceedings - International Conference on Software Engineering (2018), 277--285. https://doi.org/10.1145/3196321.3196342
[9]
Martin Fowler. 2018. Refactoring: improving the design of existing code. Addison-Wesley Professional.
[10]
Georgios Gousios. 2013. The GHTorent dataset and tool suite. In 2013 10th Working Conference on Mining Software Repositories (MSR). IEEE, 233--236. https://doi.org/10.1109/MSR.2013.6624034
[11]
Johannes C Hofmeister, Janet Siegmund, and Daniel V Holt. 2019. Shorter identifier names take longer to comprehend. Empirical Software Engineering 24, 1 (2019), 417--443.
[12]
Oskar Jarczyk, Błażej Gruszka, Szymon Jaroszewicz, Leszek Bukowski, and Adam Wierzbicki. 2014. GitHub Projects. Quality Analysis of Open-Source Software. In International Conference on Social Informatics. 80--94. https://doi.org/10.1007/978-3-319-13734-6_6
[13]
Miryung Kim, Thomas Zimmermann, Robert Deline, and Andrew Begel. 2018. Data scientists in software teams: State of the art and challenges. IEEE Transactions on Software Engineering 44, 11 (11 2018), 1024--1038. https://doi.org/10.1109/TSE.2017.2754374
[14]
Taek Lee, Jung Been Lee, and Hoh Peter In. 2015. Effect analysis of coding convention violations on readability of post-delivered code. IEICE Transactions on Information and Systems E98D, 7 (2015), 1286--1296. https://doi.org/10.1587/transinf.2014EDP7327
[15]
Brian A. Malloy and James F.Power.2019. An empirical analysis of the transition from Python 2 to Python 3. Empirical Software Engineering 24, 2 (2019), 751--778. https://doi.org/10.1007/s10664-018-9637-2
[16]
Vadim Markovtsev, Waren Long, Hugo Mougard, Konstantin Slavnov, and Egor Bulychev. 2019. Style-Analyzer: Fixing Code Style Inconsistencies with Interpretable Unsupervised Algorithms. In 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR). IEEE, 468--478. https://doi.org/10.1109/MSR.2019.00073
[17]
Robert C Martin. 2009. Clean code: a handbook of agile software craftsmanship. Pearson Education.
[18]
T.J. McCabe. 1976. A Complexity Measure. IEEE Transactions on Software Engineering SE-2, 4 (12 1976), 308--320. https://doi.org/10.1109/TSE.1976.233837
[19]
Ruchuta Nundhapana and Twittie Senivongse. 2018. Enhancing understandability of objective C programs using naming convention checking framework. Lecture Notes in Engineering and Computer Science 2237 (2018), 314--319.
[20]
Safwan Omari and Gina Martinez. 2020. Enabling Empirical Research: A Corpus of Large-Scale Python Systems. In Proceedings of the Future Technologies Conference (FTC) 2019, Kohei Arai, Rahul Bhatia, and Supriya Kapoor (Eds.). Springer International Publishing, 661--669. https://doi.org/10.1007/978-3-030-32523-7_49
[21]
Aggelos Papamichail, Apostolos V. Zarras, and Panos Vassiliadis. 2020. Do People Use Naming Conventions in SQL Programming?. In SOFSEM 2020: Theory and Practice of Computer Science. 429--440. https://doi.org/10.1007/978-3-030-38919-2_35
[22]
Daryl Posnett, Abram Hindle, and Premkumar Devanbu. 2011. A simpler model of software readability. In Proceedings of the 8th working conference on mining software repositories. 73--82.
[23]
pylint. [n.d.]. Pylint - code analysis for Python | www.pylint.org. https://www.pylint.org/
[24]
python.org. 2019. PEP 8 - Style Guide for Python Code | Python.org. https://www.python.org/dev/peps/pep-0008/
[25]
Radon. 2018. Radon 2.4.0 documentation. https://radon.readthedocs.io/en/latest/
[26]
Adithya Raghuraman, Truong Ho-Quang, Michel R V Chaudron Chalmers, Alexander Serebrenik, and Bogdan Vasilescu. 2019. Does UML Modeling Associate with Lower Defect Proneness?: A Preliminary Empirical Investigation. 16th International Conference on Mining Software Repositories (2019). https://pypi.org/project/langdetect/
[27]
D Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-François Crespo, and Dan Dennison. 2015. Hidden Technical Debt in Machine Learning Systems. In Advances in Neural Information Processing Systems. 2503--2511. http://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf
[28]
Michael Smit, Barry Gergel, H. James Hoover, and Eleni Stroulia. 2011. Code convention adherence in evolving software. In 2011 27th IEEE International Conference on Software Maintenance (ICSM). IEEE, 504--507. https://doi.org/10.1109/ICSM.2011.6080819
[29]
Diomidis Spinellis. 2011. elytS edoC. IEEE Software 28, 2 (2011), 103--104. https://doi.org/10.1109/MS.2011.31
[30]
Avram Joel Spolsky. 2008. More Joel on software: further thoughts on diverse and occasionally related matters that will prove of interest to software developers, designers, and managers, and to those who, whether by good fortune or ill luck, work with them in some capacity. Apress.
[31]
Jiaxin Zhu, Minghui Zhou, and Audris Mockus. 2014. Patterns of folder use and project popularity: A case study of GitHub repositories. In Proceedings of the 8th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement. ACM, 30.
[32]
Weiqin Zou, Jifeng Xuan, Xiaoyuan Xie, Zhenyu Chen, and Baowen Xu. 2019. How does code style inconsistency affect pull request integration? An exploratory study on 117 GitHub projects. Empirical Software Engineering 24, 6 (2019), 3871--3903.

Cited By

View all
  • (2024)Comparative analysis of real issues in open-source machine learning projectsEmpirical Software Engineering10.1007/s10664-024-10467-329:3Online publication date: 2-May-2024
  • (2023)Suboptimal Comments in Java Projects: From Independent Comment Changes to Commenting PracticesACM Transactions on Software Engineering and Methodology10.1145/354694932:2(1-33)Online publication date: 29-Mar-2023
  • (2023)Do Code Quality and Style Issues Differ Across (Non-)Machine Learning Notebooks? Yes!2023 IEEE 23rd International Working Conference on Source Code Analysis and Manipulation (SCAM)10.1109/SCAM59687.2023.00018(72-83)Online publication date: 2-Oct-2023
  • Show More Cited By

Index Terms

  1. A large-scale comparative analysis of Coding Standard conformance in Open-Source Data Science projects

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ESEM '20: Proceedings of the 14th ACM / IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM)
    October 2020
    412 pages
    ISBN:9781450375801
    DOI:10.1145/3382494
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 23 October 2020

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Open-source software
    2. code conventions
    3. code quality
    4. code smells
    5. code style
    6. data science
    7. machine learning

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Conference

    ESEM '20
    Sponsor:

    Acceptance Rates

    ESEM '20 Paper Acceptance Rate 26 of 123 submissions, 21%;
    Overall Acceptance Rate 130 of 594 submissions, 22%

    Upcoming Conference

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)80
    • Downloads (Last 6 weeks)4
    Reflects downloads up to 24 Sep 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Comparative analysis of real issues in open-source machine learning projectsEmpirical Software Engineering10.1007/s10664-024-10467-329:3Online publication date: 2-May-2024
    • (2023)Suboptimal Comments in Java Projects: From Independent Comment Changes to Commenting PracticesACM Transactions on Software Engineering and Methodology10.1145/354694932:2(1-33)Online publication date: 29-Mar-2023
    • (2023)Do Code Quality and Style Issues Differ Across (Non-)Machine Learning Notebooks? Yes!2023 IEEE 23rd International Working Conference on Source Code Analysis and Manipulation (SCAM)10.1109/SCAM59687.2023.00018(72-83)Online publication date: 2-Oct-2023
    • (2023)Simple stupid insecure practices and GitHub’s code searchJournal of Systems and Software10.1016/j.jss.2023.111698202:COnline publication date: 1-Aug-2023
    • (2022)Asset Management in Machine Learning: State-of-research and State-of-practiceACM Computing Surveys10.1145/354384755:7(1-35)Online publication date: 15-Dec-2022
    • (2022)A large-scale comparison of Python code in Jupyter notebooks and scriptsProceedings of the 19th International Conference on Mining Software Repositories10.1145/3524842.3528447(353-364)Online publication date: 23-May-2022
    • (2022)Code smells for machine learning applicationsProceedings of the 1st International Conference on AI Engineering: Software Engineering for AI10.1145/3522664.3528620(217-228)Online publication date: 16-May-2022
    • (2022)MLSmellHoundProceedings of the ACM/IEEE 44th International Conference on Software Engineering: New Ideas and Emerging Results10.1145/3510455.3512773(66-70)Online publication date: 21-May-2022
    • (2022)Lint-Based Warnings in Python Code: Frequency, Awareness and Refactoring2022 IEEE 22nd International Working Conference on Source Code Analysis and Manipulation (SCAM)10.1109/SCAM55253.2022.00030(208-218)Online publication date: Oct-2022
    • (2022)An Empirical Study of Code Smells in Transformer-based Code Generation Techniques2022 IEEE 22nd International Working Conference on Source Code Analysis and Manipulation (SCAM)10.1109/SCAM55253.2022.00014(71-82)Online publication date: Oct-2022
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media