Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3377812.3382145acmconferencesArticle/Chapter ViewAbstractPublication PagesicseConference Proceedingsconference-collections
demonstration
Public Access

BigTest: a symbolic execution based systematic test generation tool for Apache spark

Published: 01 October 2020 Publication History

Abstract

Data-intensive scalable computing (DISC) systems such as Google's MapReduce, Apache Hadoop, and Apache Spark are prevalent in many production services. Despite their popularity, the quality of DISC applications suffers due to a lack of exhaustive and automated testing. Current practices of testing DISC applications are limited to using a small random sample of the entire input dataset which merely exposes any program faults. Unlike SQL queries, testing DISC applications has new challenges due to a composition of both dataflow and relational operators, and user-defined functions (UDF) that could be arbitrarily long and complex.
To address this problem, we demonstrate a new white-box testing framework called BigTest that takes an Apache Spark program as input and automatically generates synthetic, concrete data for effective and efficient testing. BigTest combines the symbolic execution of UDFs with the logical specifications of dataflow and relational operators to explore all paths in a DISC application. Our experiments show that BigTest is capable of generating test data that can reveal up to 2X more faults than the entire data set with 194X less testing time. We implement BigTest in a Java-based command line tool with a pre-compile binary jar. It exposes a configuration file in which a user can edit preferences, including the path of a target program, the upper bound of loop exploration, and a choice of theorem solver. The demonstration video of BigTest is available at https://youtu.be/OeHhoKiDYso and BigTest is available at https://github.com/maligulzar/BigTest.

References

[1]
2019. Eclipse Java development tools (JDT). https://www.eclipse.org/jdt/.
[2]
2019. Hadoop. http://hadoop.apache.org/.
[3]
2019. Java Decompiler. http://java-decompiler.github.io/.
[4]
2019. Spark. https://spark.apache.org/.
[5]
2020. Apache Flink. https://flink.apache.org/.
[6]
Clark Barrett, Christopher L. Conway, Morgan Deters, Liana Hadarean, Dejan Jovanovi'c, Tim King, Andrew Reynolds, and Cesare Tinelli. 2011. CVC4. In Proceedings of the 23rd International Conference on Computer Aided Verification (CAV '11), Vol. 6806. Springer, 171--177. Snowbird, Utah.
[7]
Leonardo De Moura and Nikolaj Bjorner. 2008. Z3: An efficient SMT solver. In International conference on Tools and Algorithms for the Construction and Analysis of Systems. Springer, 337--340.
[8]
Michael Emmi, Rupak Majumdar, and Koushik Sen. 2007. Dynamic Test Input Generation for Database Applications. In Proceedings of the 2007 International Symposium on Software Testing and Analysis (London, United Kingdom) (ISSTA '07). ACM, New York, NY, USA, 151--162.
[9]
Muhammad Ali Gulzar, Matteo Interlandi, Xueyuan Han, Mingda Li, Tyson Condie, and Miryung Kim. 2017. Automated Debugging in Data-intensive Scalable Computing. In Proceedings of the 2017 Symposium on Cloud Computing (Santa Clara, California) (SoCC '17). ACM, New York, NY, USA, 520--534.
[10]
Muhammad Ali Gulzar, Matteo Interlandi, Seunghyun Yoo, Sai Deep Tetali, Tyson Condie, Todd Millstein, and Miryung Kim. 2016. BigDebug: Debugging Primitives for Interactive Big Data Processing in Spark. In Proceedings of the 38th International Conference on Software Engineering (Austin, Texas) (ICSE '16). ACM, New York, NY, USA, 784--795.
[11]
Muhammad Ali Gulzar, Shaghayegh Mardani, Madanlal Musuvathi, and Miryung Kim. 2019. White-box Testing of Big Data Analytics with Complex User-defined Functions. In Proceedings of the 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (Tallinn, Estonia) (ESEC/FSE 2019). ACM, New York, NY, USA, 290--301.
[12]
Kaituo Li, Christoph Reichenbach, Yannis Smaragdakis, Yanlei Diao, and Christoph Csallner. 2013. SEDGE: Symbolic example data generation for dataflow programs. In Automated Software Engineering (ASE), 2013 IEEE/ACM 28th International Conference on. IEEE, 235--245.
[13]
Christopher Olston, Shubham Chopra, and Utkarsh Srivastava. 2009. Generating Example Data for Dataflow Programs. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data (Providence, Rhode Island, USA) (SIGMOD '09). ACM, New York, NY, USA, 245--256.
[14]
Corina S. Pǎsǎreanu, Peter C. Mehlitz, David H. Bushnell, Karen Gundy-Burlet, Michael Lowry, Suzette Person, and Mark Pape. 2008. Combining Unit-level Symbolic Execution and System-level Concrete Execution for Testing Nasa Software. In Proceedings of the 2008 International Symposium on Software Testing and Analysis (Seattle, WA, USA) (ISSTA '08). ACM, New York, NY, USA, 15--26.
[15]
Willem Visser, Corina S. Pǎsǎreanu, and Sarfraz Khurshid. 2004. Test Input Generation with Java PathFinder. In Proceedings of the 2004 ACM SIGSOFT International Symposium on Software Testing and Analysis (Boston, Massachusetts, USA) (ISSTA '04). ACM, New York, NY, USA, 97--107.
[16]
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient Distributed Datasets: A Fault-tolerant Abstraction for In-memory Cluster Computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (San Jose, CA) (NSDI'12). Berkeley, CA, USA, 2--2.

Cited By

View all
  • (2023)An Approach of Improving the Efficiency of Software Fault Localization based on Feedback Ranking InformationApplied Sciences10.3390/app13181035113:18(10351)Online publication date: 15-Sep-2023
  • (2023)Contract-Driven Design of Scientific Data Analysis Workflows2023 IEEE 19th International Conference on e-Science (e-Science)10.1109/e-Science58273.2023.10254898(1-10)Online publication date: 9-Oct-2023
  • (2023)Generating Test Databases for Database-Backed Applications2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE)10.1109/ICSE48619.2023.00173(2048-2059)Online publication date: May-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
ICSE '20: Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering: Companion Proceedings
June 2020
357 pages
ISBN:9781450371223
DOI:10.1145/3377812
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

In-Cooperation

  • KIISE: Korean Institute of Information Scientists and Engineers
  • IEEE CS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 October 2020

Check for updates

Author Tags

  1. data intensive scalable computing
  2. dataflow programs
  3. map reduce
  4. symbolic execution
  5. test generation

Qualifiers

  • Demonstration

Funding Sources

Conference

ICSE '20
Sponsor:

Acceptance Rates

Overall Acceptance Rate 276 of 1,856 submissions, 15%

Upcoming Conference

ICSE 2025

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)62
  • Downloads (Last 6 weeks)6
Reflects downloads up to 01 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2023)An Approach of Improving the Efficiency of Software Fault Localization based on Feedback Ranking InformationApplied Sciences10.3390/app13181035113:18(10351)Online publication date: 15-Sep-2023
  • (2023)Contract-Driven Design of Scientific Data Analysis Workflows2023 IEEE 19th International Conference on e-Science (e-Science)10.1109/e-Science58273.2023.10254898(1-10)Online publication date: 9-Oct-2023
  • (2023)Generating Test Databases for Database-Backed Applications2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE)10.1109/ICSE48619.2023.00173(2048-2059)Online publication date: May-2023
  • (2021)SPOT: Testing Stream Processing Programs with Symbolic Execution and Stream SynthesizingApplied Sciences10.3390/app1117805711:17(8057)Online publication date: 30-Aug-2021

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media