research-article

Speeding Up Data Manipulation Tasks with Alternative Implementations: An Exploratory Study

Authors:

Shengchao QinAuthors Info & Claims

ACM Transactions on Software Engineering and Methodology (TOSEM), Volume 30, Issue 4

Article No.: 49, Pages 1 - 28

https://doi.org/10.1145/3456873

Published: 23 July 2021 Publication History

Abstract

As data volume and complexity grow at an unprecedented rate, the performance of data manipulation programs is becoming a major concern for developers. In this article, we study how alternative API choices could improve data manipulation performance while preserving task-specific input/output equivalence. We propose a lightweight approach that leverages the comparative structures in Q&A sites to extracting alternative implementations. On a large dataset of Stack Overflow posts, our approach extracts 5,080 pairs of alternative implementations that invoke different data manipulation APIs to solve the same tasks, with an accuracy of 86%. Experiments show that for 15% of the extracted pairs, the faster implementation achieved >10x speedup over its slower alternative. We also characterize 68 recurring alternative API pairs from the extraction results to understand the type of APIs that can be used alternatively. To put these findings into practice, we implement a tool, AlterApi7, to automatically optimize real-world data manipulation programs. In the 1,267 optimization attempts on the Kaggle dataset, 76% achieved desirable performance improvements with up to orders-of-magnitude speedup. Finally, we discuss notable challenges of using alternative APIs for optimizing data manipulation programs. We hope that our study offers a new perspective on API recommendation and automatic performance optimization.

References

[1]

Leonard Richardson. 2007. Beautiful Soup Documentation. (2007).

[2]

Django Software Foundation. 2021. Django. https://djangoproject.com.

[3]

Wikipedia contributors. 2021. Einstein notation. https://en.wikipedia.org/wiki/Einstein_notation.

[4]

Stack Exchange. 2021. How to Create a Minimal, Reproducible Example. https://stackoverflow.com/help/minimalreproducible-example.

[5]

The IPython Development Team. 2021. IPython built-in magic commands. https://ipython.readthedocs.io/en/stable/interactive/magics.html.

[6]

Kaggle Inc. 2021. Kaggle. https://www.kaggle.com/

[7]

Kaggle Inc. 2021. Kaggle Competitions. https://www.kaggle.com/competitions/

[8]

Charles R. Harris et al. 2020. Array programming with NumPy. Nature 585 (2020), 357--362. https://doi.org/10.1038/s41586-020-2649-2

[9]

Wes McKinney. 2010. Data structures for statistical computing in python. In Proceedings of the 9th Python in Science Conference (SciPy'10), Vol. 445. 51--56.

[10]

Python Software Foundation. 2021. The Python ast module. https://docs.python.org/3/library/ast.html.

[11]

Python Software Foundation. 2021. Python Qualified Name. https://docs.python.org/3/glossary.html#term-qualifiedname.

[12]

Python Software Foundation. 2021. Debugging and Profiling. https://docs.python.org/3/library/debug.html.

[13]

Wikipedia contributors. 2020. Quartile coefficient of dispersion. https://en.wikipedia.org/wiki/Quartile_coefficient_of_dispersion.

[14]

R Core Team. 2013. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. http://www.R-project.org/.

[15]

Pauli Virtanen et al. 2020. SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods 17 (2020), 261--272. https://doi.org/10.1038/s41592-019-0686-2

[16]

Travis E. Oliphant. 2007. Python for Scientific Computing. Computing in Science Engineering 9, 3 (2007), 10--20. https://doi.org/10.1109/MCSE.2007.58

Digital Library

[17]

Nathanael Chambers, Daniel Cer, Trond Grenager, David Hall, Chloe Kiddon, Bill MacCartney, Marie-Catherine De Marneffe, Daniel Ramage, Eric Yeh, and Christopher D. Manning. 2007. Learning alignments and leveraging natural logic. In Proceedings of the ACL-PASCAL Workshop on Textual Entailment and Paraphrasing. 165--170.

[18]

Stack Exchange Inc. 2019. Stack Exchange Data Dump. https://archive.org/details/stackexchange.

[19]

Python Software Foundation. 2021. timeit: Measure execution time of small code snippets. https://docs.python.org/3.6/library/timeit.html.

[20]

Sven Amann, Sarah Nadi, Hoan Anh Nguyen, Tien N. Nguyen, and Mira Mezini. 2016. MUBench: A benchmark for API-misuse detectors. In Proceedings of the 13th International Conference on Mining Software Repositories (MSR’16).

Digital Library

[21]

Sven Amann, Hoan Anh Nguyen, Sarah Nadi, Tien N. Nguyen, and Mira Mezini. 2019. A systematic evaluation of static API-misuse Detectors. IEEE Transactions on Software Engineering 45, 12 (2019), 1170--1188. https://doi.org/10.1109/TSE.2018.2827384

[22]

Eduardo C. Campos, Martin Monperrus, and Marcelo A. Maia. 2016. Searching stack overflow for API-usage-related bug fixes using snippet-based queries. In Proceedings of the 26th Annual International Conference on Computer Science and Software Engineering (CASCON’16). IBM Corp., Riverton, NJ, 232–242.

[23]

A. Carzaniga, A. Mattavelli, and M. Pezzè. 2015. Measuring Software Redundancy. In 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering (ICSE'15), Vol. 1. 156--166. https://doi.org/10.1109/ICSE.2015.37

[24]

Yanto Chandra and Liang Shang. 2019. Qualitative Research Using R: A Systematic Approach. Springer.

[25]

Chunyang Chen and Zhenchang Xing. 2016. SimilarTech: Automatically recommend analogical libraries across different programming languages. In Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering (ASE’16). ACM, New York, NY, 834–839.

Digital Library

[26]

Chunyang Chen, Zhenchang Xing, and Yang Liu. 2019. What's Spain's Paris? Mining analogical libraries from Q&A discussions. Empirical Software Engineering 24, 3 (2019), 1155--1194.

Digital Library

[27]

Samir Gupta, A. S. M. Ashique Mahmood, Karen E. Ross, Cathy H. Wu, and K. Vijay-Shanker. 2017. Identifying comparative structures in biomedical text. In Proceedings of the 16th SIGBioMed Workshop on Biomedical Language Processing (BioNLP'17). 206--215.

[28]

Homa B. Hashemi and Rebecca Hwa. 2016. An evaluation of parser robustness for ungrammatical sentences. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP'16). 1765--1774.

[29]

Yi Huang, Chunyang Chen, Zhenchang Xing, Tian Lin, and Yang Liu. 2018. Tell them apart: Distilling technology differences from crowd-scale comparison discussions. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering (ASE’18). ACM, New York, NY, 214–224.

Digital Library

[30]

Lingxiao Jiang and Zhendong Su. 2009. Automatic mining of functionally equivalent code fragments via random testing. In Proceedings of the 18th International Symposium on Software Testing and Analysis (ISSTA’09). ACM, New York, NY, 81–92.

Digital Library

[31]

D. Kawrykow and M. P. Robillard. 2009. Detecting inefficient API usage. In 2009 31st International Conference on Software Engineering - Companion Volume (ICSE'09). 183--186. https://doi.org/10.1109/ICSE-COMPANION.2009.5070977

[32]

Stefan Krüger, Johannes Späth, Karim Ali, Eric Bodden, and Mira Mezini. 2018. CrySL: An extensible approach to validating the correct usage of cryptographic APIs. In 32nd European Conference on Object-Oriented Programming (ECOOP’18). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik.

[33]

Mario Linares-Vásquez, Gabriele Bavota, Carlos Bernal-Cárdenas, Rocco Oliveto, Massimiliano Di Penta, and Denys Poshyvanyk. 2014. Mining energy-greedy API usage patterns in Android apps: An empirical study. In Proceedings of the 11th Working Conference on Mining Software Repositories (MSR’14). ACM, New York, NY, 2–11.

Digital Library

[34]

Yepang Liu, Chang Xu, and Shing-Chi Cheung. 2014. Characterizing and detecting performance bugs for smartphone applications. In Proceedings of the 36th International Conference on Software Engineering (ICSE’14). ACM, New York, NY, 1013–1024.

Digital Library

[35]

Yepang Liu, Chang Xu, Shing-Chi Cheung, and Jian Lü. 2014. GreenDroid: Automated diagnosis of energy inefficiency for smartphone applications. IEEE Transactions on Software Engineering 40, 9 (Sep. 2014), 911–940.

[36]

Stephan Lukasczyk, Florian Kroiß, and Gordon Fraser. 2020. Automated unit test generation for Python. In International Symposium on Search Based Software Engineering (SSBSE'20). 9--24.

[37]

Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. 2014. The Stanford CoreNLP natural language processing toolkit. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations (ACL'14). 55--60.

[38]

Mary L. McHugh. 2012. Interrater reliability: the kappa statistic. Biochemia Medica 22, 3 (2012), 276--282.

[39]

Wes McKinney. 2012. Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython. O’Reilly Media, Inc.

[40]

Trong Duc Nguyen, Anh Tuan Nguyen, Hung Dang Phan, and Tien N. Nguyen. 2017. Exploring API embedding for API usages and applications. In Proceedings of the 39th International Conference on Software Engineering (ICSE’17). IEEE Press, Piscataway, NJ, 438–449.

[41]

Wellington Oliveira, Renato Oliveira, Fernando Castor, Benito Fernandes, and Gustavo Pinto. 2019. Recommending energy-efficient Java collections. In Proceedings of the 16th International Conference on Mining Software Repositories (MSR’19). IEEE Press, Piscataway, NJ, 160–170.

Digital Library

[42]

C. E. Otero and A. Peter. 2015. Research directions for engineering big data analytics software. IEEE Intelligent Systems 30, 1 (Jan. 2015), 13–19.

Digital Library

[43]

Marta Recasens, Marie-Catherine de Marneffe, and Christopher Potts. 2013. The life and death of discourse entities: Identifying singleton mentions. In North American Association for Computational Linguistics (NAACL’13).

[44]

Peter C. Rigby and Martin P. Robillard. 2013. Discovering essential code elements in informal documentation. In Proceedings of the 2013 International Conference on Software Engineering (ICSE’13). IEEE Press, Piscataway, NJ, 832–841.

[45]

David Robinson. 2017. Why is Python Growing So Quickly? https://stackoverflow.blog/2017/09/14/python-growing-quickly/

[46]

Jacob T. Schwartz. 1980. Fast probabilistic algorithms for verification of polynomial identities. Journal of the ACM (JACM) 27, 4 (1980), 701–717.

Digital Library

[47]

M. Selakovic and M. Pradel. 2016. Performance issues and optimizations in JavaScript: An empirical study. In 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE’16). 61–72.

[48]

Fang-Hsiang Su, J. Bell, G. Kaiser, and S. Sethumadhavan. 2016. Identifying functionally similar code in complex codebases. In 2016 IEEE 24th International Conference on Program Comprehension (ICPC’16). 1–10.

[49]

Yida Tao, Shan Tang, Yepang Liu, Zhiwu Xu, and Shengchao Qin. 2019. How do API selections affect the runtime performance of data analytics tasks? In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE'19). 665--668. https://doi.org/10.1109/ASE.2019.00067

Digital Library

[50]

Christoph Treude and Martin P. Robillard. 2016. Augmenting API documentation with insights from stack overflow. In Proceedings of the 38th International Conference on Software Engineering (ICSE’16). ACM, New York, NY, 392–403.

[51]

Junwen Yang, Pranav Subramaniam, Shan Lu, Cong Yan, and Alvin Cheung. 2018. How not to structure your database-backed web applications: A study of performance bugs in the wild. In Proceedings of the 40th International Conference on Software Engineering (ICSE’18). ACM, New York, NY, 800–810.

Digital Library

[52]

Deheng Ye, Zhenchang Xing, Chee Yong Foo, Jing Li, and Nachiket Kapre. 2016. Learning to extract API mentions from informal natural language discussions. In 2016 IEEE International Conference on Software Maintenance and Evolution (ICSME’16). 389–399.

[53]

Deheng Ye, Zhenchang Xing, Jing Li, and Nachiket Kapre. 2016. Software-specific part-of-speech tagging: An experimental study on stack overflow. In Proceedings of the 31st Annual ACM Symposium on Applied Computing (SAC’16). ACM, New York, NY, 1378–1385.

Digital Library

[54]

Jasmine Zakir, Tom Seymour, and Kristi Berg. 2015. Big data analytics. Issues in Information Systems 16, 2 (2015), 81--90.

[55]

Tianyi Zhang, Ganesha Upadhyaya, Anastasia Reinhardt, Hridesh Rajan, and Miryung Kim. 2018. Are code examples on an online Q&A forum reliable?: A study of API misuse on stack overflow. In Proceedings of the 40th International Conference on Software Engineering (ICSE’18). ACM, New York, NY, 886–896.

Digital Library

[56]

Hao Zhong, Tao Xie, Lu Zhang, Jian Pei, and Hong Mei. 2009. MAPO: mining and recommending API usage patterns. In Proceedings of the 23rd European Conference on Object-Oriented Programming (ECOOP'09). 318--343. https://doi.org/10.1007/978-3-642-03013-0_15

Digital Library

Cited By

Rosa DDos Santos BLima R(2023)Predicting satisfaction with democracy in Brazil considering data form an opinion surveyRevista Gestão da Produção Operações e Sistemas10.15675/gepros.296518Online publication date: 5-Dec-2023
https://doi.org/10.15675/gepros.2965
Tripathi RDwivedi S(2023)Inventiveness of Text Extraction with Inspiration of Cloud Computing and ML Using Python LogicIntelligent Systems Design and Applications10.1007/978-3-031-35507-3_25(248-256)Online publication date: 3-Jun-2023
https://doi.org/10.1007/978-3-031-35507-3_25

Index Terms

Speeding Up Data Manipulation Tasks with Alternative Implementations: An Exploratory Study
1. Software and its engineering
  1. Software notations and tools
    1. Software libraries and repositories

Recommendations

How do API selections affect the runtime performance of data analytics tasks?
ASE '19: Proceedings of the 34th IEEE/ACM International Conference on Automated Software Engineering

As data volume and complexity grow at an unprecedented rate, the performance of data analytics programs is becoming a major concern for developers. We observed that developers sometimes use alternative data analytics APIs to improve program runtime ...
An Empirical Comparison Between Tutorials and Crowd Documentation of Application Programming Interface
Abstract
API (application programming interface) documentation is critical for developers to learn APIs. However, it is unclear whether API documentation indeed improves the API learnability for developers. In this paper, we focus on two types of API ...
Mashups: who? what? why?
CHI EA '08: CHI '08 Extended Abstracts on Human Factors in Computing Systems

In recent years major web services have opened their systems to outside use through the implementation of public APIs. As a result, web developers have begun to experiment with mashups - software applications that merge separate APIs and data sources ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Software Engineering and Methodology

ACM Transactions on Software Engineering and Methodology Volume 30, Issue 4

Continuous Special Section: AI and SE

October 2021

613 pages

ISSN:1049-331X

EISSN:1557-7392

DOI:10.1145/3461694

Editor:
Mauro Pezzè
Università della Svizzera italiana and Università di Milano-Bicocca, Switzerland

Issue’s Table of Contents

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 July 2021

Accepted: 01 March 2021

Revised: 01 March 2021

Received: 01 September 2020

Published in TOSEM Volume 30, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

National Key Research and Development Program of China
National Natural Science Foundation of China
Guangdong Basic and Applied Basic Research Foundation

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
223
Total Downloads

Downloads (Last 12 months)27
Downloads (Last 6 weeks)8

Reflects downloads up to 24 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Rosa DDos Santos BLima R(2023)Predicting satisfaction with democracy in Brazil considering data form an opinion surveyRevista Gestão da Produção Operações e Sistemas10.15675/gepros.296518Online publication date: 5-Dec-2023
https://doi.org/10.15675/gepros.2965
Tripathi RDwivedi S(2023)Inventiveness of Text Extraction with Inspiration of Cloud Computing and ML Using Python LogicIntelligent Systems Design and Applications10.1007/978-3-031-35507-3_25(248-256)Online publication date: 3-Jun-2023
https://doi.org/10.1007/978-3-031-35507-3_25

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Issue’s Table of Contents