Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Speeding Up Data Manipulation Tasks with Alternative Implementations: An Exploratory Study

Published: 23 July 2021 Publication History

Abstract

As data volume and complexity grow at an unprecedented rate, the performance of data manipulation programs is becoming a major concern for developers. In this article, we study how alternative API choices could improve data manipulation performance while preserving task-specific input/output equivalence. We propose a lightweight approach that leverages the comparative structures in Q&A sites to extracting alternative implementations. On a large dataset of Stack Overflow posts, our approach extracts 5,080 pairs of alternative implementations that invoke different data manipulation APIs to solve the same tasks, with an accuracy of 86%. Experiments show that for 15% of the extracted pairs, the faster implementation achieved >10x speedup over its slower alternative. We also characterize 68 recurring alternative API pairs from the extraction results to understand the type of APIs that can be used alternatively. To put these findings into practice, we implement a tool, AlterApi7, to automatically optimize real-world data manipulation programs. In the 1,267 optimization attempts on the Kaggle dataset, 76% achieved desirable performance improvements with up to orders-of-magnitude speedup. Finally, we discuss notable challenges of using alternative APIs for optimizing data manipulation programs. We hope that our study offers a new perspective on API recommendation and automatic performance optimization.

References

[1]
Leonard Richardson. 2007. Beautiful Soup Documentation. (2007).
[2]
Django Software Foundation. 2021. Django. https://djangoproject.com.
[3]
Wikipedia contributors. 2021. Einstein notation. https://en.wikipedia.org/wiki/Einstein_notation.
[4]
Stack Exchange. 2021. How to Create a Minimal, Reproducible Example. https://stackoverflow.com/help/minimalreproducible-example.
[5]
The IPython Development Team. 2021. IPython built-in magic commands. https://ipython.readthedocs.io/en/stable/interactive/magics.html.
[6]
Kaggle Inc. 2021. Kaggle. https://www.kaggle.com/
[7]
Kaggle Inc. 2021. Kaggle Competitions. https://www.kaggle.com/competitions/
[8]
Charles R. Harris et al. 2020. Array programming with NumPy. Nature 585 (2020), 357--362. https://doi.org/10.1038/s41586-020-2649-2
[9]
Wes McKinney. 2010. Data structures for statistical computing in python. In Proceedings of the 9th Python in Science Conference (SciPy'10), Vol. 445. 51--56.
[10]
Python Software Foundation. 2021. The Python ast module. https://docs.python.org/3/library/ast.html.
[11]
Python Software Foundation. 2021. Python Qualified Name. https://docs.python.org/3/glossary.html#term-qualifiedname.
[12]
Python Software Foundation. 2021. Debugging and Profiling. https://docs.python.org/3/library/debug.html.
[13]
Wikipedia contributors. 2020. Quartile coefficient of dispersion. https://en.wikipedia.org/wiki/Quartile_coefficient_of_dispersion.
[14]
R Core Team. 2013. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. http://www.R-project.org/.
[15]
Pauli Virtanen et al. 2020. SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods 17 (2020), 261--272. https://doi.org/10.1038/s41592-019-0686-2
[16]
Travis E. Oliphant. 2007. Python for Scientific Computing. Computing in Science Engineering 9, 3 (2007), 10--20. https://doi.org/10.1109/MCSE.2007.58
[17]
Nathanael Chambers, Daniel Cer, Trond Grenager, David Hall, Chloe Kiddon, Bill MacCartney, Marie-Catherine De Marneffe, Daniel Ramage, Eric Yeh, and Christopher D. Manning. 2007. Learning alignments and leveraging natural logic. In Proceedings of the ACL-PASCAL Workshop on Textual Entailment and Paraphrasing. 165--170.
[18]
Stack Exchange Inc. 2019. Stack Exchange Data Dump. https://archive.org/details/stackexchange.
[19]
Python Software Foundation. 2021. timeit: Measure execution time of small code snippets. https://docs.python.org/3.6/library/timeit.html.
[20]
Sven Amann, Sarah Nadi, Hoan Anh Nguyen, Tien N. Nguyen, and Mira Mezini. 2016. MUBench: A benchmark for API-misuse detectors. In Proceedings of the 13th International Conference on Mining Software Repositories (MSR’16).
[21]
Sven Amann, Hoan Anh Nguyen, Sarah Nadi, Tien N. Nguyen, and Mira Mezini. 2019. A systematic evaluation of static API-misuse Detectors. IEEE Transactions on Software Engineering 45, 12 (2019), 1170--1188. https://doi.org/10.1109/TSE.2018.2827384
[22]
Eduardo C. Campos, Martin Monperrus, and Marcelo A. Maia. 2016. Searching stack overflow for API-usage-related bug fixes using snippet-based queries. In Proceedings of the 26th Annual International Conference on Computer Science and Software Engineering (CASCON’16). IBM Corp., Riverton, NJ, 232–242.
[23]
A. Carzaniga, A. Mattavelli, and M. Pezzè. 2015. Measuring Software Redundancy. In 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering (ICSE'15), Vol. 1. 156--166. https://doi.org/10.1109/ICSE.2015.37
[24]
Yanto Chandra and Liang Shang. 2019. Qualitative Research Using R: A Systematic Approach. Springer.
[25]
Chunyang Chen and Zhenchang Xing. 2016. SimilarTech: Automatically recommend analogical libraries across different programming languages. In Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering (ASE’16). ACM, New York, NY, 834–839.
[26]
Chunyang Chen, Zhenchang Xing, and Yang Liu. 2019. What's Spain's Paris? Mining analogical libraries from Q&A discussions. Empirical Software Engineering 24, 3 (2019), 1155--1194.
[27]
Samir Gupta, A. S. M. Ashique Mahmood, Karen E. Ross, Cathy H. Wu, and K. Vijay-Shanker. 2017. Identifying comparative structures in biomedical text. In Proceedings of the 16th SIGBioMed Workshop on Biomedical Language Processing (BioNLP'17). 206--215.
[28]
Homa B. Hashemi and Rebecca Hwa. 2016. An evaluation of parser robustness for ungrammatical sentences. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP'16). 1765--1774.
[29]
Yi Huang, Chunyang Chen, Zhenchang Xing, Tian Lin, and Yang Liu. 2018. Tell them apart: Distilling technology differences from crowd-scale comparison discussions. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering (ASE’18). ACM, New York, NY, 214–224.
[30]
Lingxiao Jiang and Zhendong Su. 2009. Automatic mining of functionally equivalent code fragments via random testing. In Proceedings of the 18th International Symposium on Software Testing and Analysis (ISSTA’09). ACM, New York, NY, 81–92.
[31]
D. Kawrykow and M. P. Robillard. 2009. Detecting inefficient API usage. In 2009 31st International Conference on Software Engineering - Companion Volume (ICSE'09). 183--186. https://doi.org/10.1109/ICSE-COMPANION.2009.5070977
[32]
Stefan Krüger, Johannes Späth, Karim Ali, Eric Bodden, and Mira Mezini. 2018. CrySL: An extensible approach to validating the correct usage of cryptographic APIs. In 32nd European Conference on Object-Oriented Programming (ECOOP’18). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik.
[33]
Mario Linares-Vásquez, Gabriele Bavota, Carlos Bernal-Cárdenas, Rocco Oliveto, Massimiliano Di Penta, and Denys Poshyvanyk. 2014. Mining energy-greedy API usage patterns in Android apps: An empirical study. In Proceedings of the 11th Working Conference on Mining Software Repositories (MSR’14). ACM, New York, NY, 2–11.
[34]
Yepang Liu, Chang Xu, and Shing-Chi Cheung. 2014. Characterizing and detecting performance bugs for smartphone applications. In Proceedings of the 36th International Conference on Software Engineering (ICSE’14). ACM, New York, NY, 1013–1024.
[35]
Yepang Liu, Chang Xu, Shing-Chi Cheung, and Jian Lü. 2014. GreenDroid: Automated diagnosis of energy inefficiency for smartphone applications. IEEE Transactions on Software Engineering 40, 9 (Sep. 2014), 911–940.
[36]
Stephan Lukasczyk, Florian Kroiß, and Gordon Fraser. 2020. Automated unit test generation for Python. In International Symposium on Search Based Software Engineering (SSBSE'20). 9--24.
[37]
Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. 2014. The Stanford CoreNLP natural language processing toolkit. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations (ACL'14). 55--60.
[38]
Mary L. McHugh. 2012. Interrater reliability: the kappa statistic. Biochemia Medica 22, 3 (2012), 276--282.
[39]
Wes McKinney. 2012. Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython. O’Reilly Media, Inc.
[40]
Trong Duc Nguyen, Anh Tuan Nguyen, Hung Dang Phan, and Tien N. Nguyen. 2017. Exploring API embedding for API usages and applications. In Proceedings of the 39th International Conference on Software Engineering (ICSE’17). IEEE Press, Piscataway, NJ, 438–449.
[41]
Wellington Oliveira, Renato Oliveira, Fernando Castor, Benito Fernandes, and Gustavo Pinto. 2019. Recommending energy-efficient Java collections. In Proceedings of the 16th International Conference on Mining Software Repositories (MSR’19). IEEE Press, Piscataway, NJ, 160–170.
[42]
C. E. Otero and A. Peter. 2015. Research directions for engineering big data analytics software. IEEE Intelligent Systems 30, 1 (Jan. 2015), 13–19.
[43]
Marta Recasens, Marie-Catherine de Marneffe, and Christopher Potts. 2013. The life and death of discourse entities: Identifying singleton mentions. In North American Association for Computational Linguistics (NAACL’13).
[44]
Peter C. Rigby and Martin P. Robillard. 2013. Discovering essential code elements in informal documentation. In Proceedings of the 2013 International Conference on Software Engineering (ICSE’13). IEEE Press, Piscataway, NJ, 832–841.
[45]
David Robinson. 2017. Why is Python Growing So Quickly? https://stackoverflow.blog/2017/09/14/python-growing-quickly/
[46]
Jacob T. Schwartz. 1980. Fast probabilistic algorithms for verification of polynomial identities. Journal of the ACM (JACM) 27, 4 (1980), 701–717.
[47]
M. Selakovic and M. Pradel. 2016. Performance issues and optimizations in JavaScript: An empirical study. In 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE’16). 61–72.
[48]
Fang-Hsiang Su, J. Bell, G. Kaiser, and S. Sethumadhavan. 2016. Identifying functionally similar code in complex codebases. In 2016 IEEE 24th International Conference on Program Comprehension (ICPC’16). 1–10.
[49]
Yida Tao, Shan Tang, Yepang Liu, Zhiwu Xu, and Shengchao Qin. 2019. How do API selections affect the runtime performance of data analytics tasks? In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE'19). 665--668. https://doi.org/10.1109/ASE.2019.00067
[50]
Christoph Treude and Martin P. Robillard. 2016. Augmenting API documentation with insights from stack overflow. In Proceedings of the 38th International Conference on Software Engineering (ICSE’16). ACM, New York, NY, 392–403.
[51]
Junwen Yang, Pranav Subramaniam, Shan Lu, Cong Yan, and Alvin Cheung. 2018. How not to structure your database-backed web applications: A study of performance bugs in the wild. In Proceedings of the 40th International Conference on Software Engineering (ICSE’18). ACM, New York, NY, 800–810.
[52]
Deheng Ye, Zhenchang Xing, Chee Yong Foo, Jing Li, and Nachiket Kapre. 2016. Learning to extract API mentions from informal natural language discussions. In 2016 IEEE International Conference on Software Maintenance and Evolution (ICSME’16). 389–399.
[53]
Deheng Ye, Zhenchang Xing, Jing Li, and Nachiket Kapre. 2016. Software-specific part-of-speech tagging: An experimental study on stack overflow. In Proceedings of the 31st Annual ACM Symposium on Applied Computing (SAC’16). ACM, New York, NY, 1378–1385.
[54]
Jasmine Zakir, Tom Seymour, and Kristi Berg. 2015. Big data analytics. Issues in Information Systems 16, 2 (2015), 81--90.
[55]
Tianyi Zhang, Ganesha Upadhyaya, Anastasia Reinhardt, Hridesh Rajan, and Miryung Kim. 2018. Are code examples on an online Q&A forum reliable?: A study of API misuse on stack overflow. In Proceedings of the 40th International Conference on Software Engineering (ICSE’18). ACM, New York, NY, 886–896.
[56]
Hao Zhong, Tao Xie, Lu Zhang, Jian Pei, and Hong Mei. 2009. MAPO: mining and recommending API usage patterns. In Proceedings of the 23rd European Conference on Object-Oriented Programming (ECOOP'09). 318--343. https://doi.org/10.1007/978-3-642-03013-0_15

Cited By

View all
  • (2023)Predicting satisfaction with democracy in Brazil considering data form an opinion surveyRevista Gestão da Produção Operações e Sistemas10.15675/gepros.296518Online publication date: 5-Dec-2023
  • (2023)Inventiveness of Text Extraction with Inspiration of Cloud Computing and ML Using Python LogicIntelligent Systems Design and Applications10.1007/978-3-031-35507-3_25(248-256)Online publication date: 3-Jun-2023

Index Terms

  1. Speeding Up Data Manipulation Tasks with Alternative Implementations: An Exploratory Study

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Software Engineering and Methodology
    ACM Transactions on Software Engineering and Methodology  Volume 30, Issue 4
    Continuous Special Section: AI and SE
    October 2021
    613 pages
    ISSN:1049-331X
    EISSN:1557-7392
    DOI:10.1145/3461694
    • Editor:
    • Mauro Pezzè
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 23 July 2021
    Accepted: 01 March 2021
    Revised: 01 March 2021
    Received: 01 September 2020
    Published in TOSEM Volume 30, Issue 4

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. API selection
    2. data manipulation
    3. empirical study
    4. mining software repository
    5. performance optimization

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)27
    • Downloads (Last 6 weeks)8
    Reflects downloads up to 24 Sep 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Predicting satisfaction with democracy in Brazil considering data form an opinion surveyRevista Gestão da Produção Operações e Sistemas10.15675/gepros.296518Online publication date: 5-Dec-2023
    • (2023)Inventiveness of Text Extraction with Inspiration of Cloud Computing and ML Using Python LogicIntelligent Systems Design and Applications10.1007/978-3-031-35507-3_25(248-256)Online publication date: 3-Jun-2023

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media