Article

Tree2tree Structural Language Modeling for Compiler Fuzzing

Authors:

Peidai XieAuthors Info & Claims

Algorithms and Architectures for Parallel Processing: 20th International Conference, ICA3PP 2020, New York City, NY, USA, October 2–4, 2020, Proceedings, Part I

Pages 563 - 578

https://doi.org/10.1007/978-3-030-60245-1_38

Published: 02 October 2020 Publication History

Abstract

Compiler fuzzing requires well-formed test cases. Only syntactically correct programs can pass the parsing stage of a compiler. Recently, advanced compiler fuzzers produce test cases by learning a generative language model of regular programs. They treat programs as natural language texts without leveraging any syntactic structure, making them hard to produce syntactically correct programs when programs get long. In this paper, we propose a novel tree-to-tree (tree2tree) model to leverage the structural information for a robust test case generation. We adopt an encoder-decoder architecture to map a partial abstract syntax tree (AST) to a complete AST. We introduce a C compiler fuzzing framework, named TSmith. Specifically, TSmith employs a tree-based encoder to encode the input partial AST to capture the hierarchical structure information. It then adopts a tree decoder to generate a complete AST by expanding the non-terminals in a top-down manner. Finally, the output ASTs are converted into corresponding test programs. Experimental results show that our generation strategies have a maximum parsing pass rate of 83%, which is about 21% higher than sequential models. Besides, TSmith significantly improves the code coverage of the compiler. Benefiting from the high pass rate and broad code coverage, TSmith has found 14 new bugs in GCC.

References

[1]

Chakraborty, S., Allamanis, M., Ray, B.: Tree2tree neural translation model for learning source code changes. arXiv preprint arXiv:1810.00314 (2018)

[2]

Chen, P., Chen, H.: Angora: efficient fuzzing by principled search. In: 2018 IEEE Symposium on Security and Privacy (SP), pp. 711–725. IEEE (2018)

[3]

Chen, X., Liu, C., Song, D.: Tree-to-tree neural networks for program translation. In: Advances in Neural Information Processing Systems, pp. 2547–2557 (2018)

[4]

Cummins, C., Petoumenos, P., Murray, A., Leather, H.: Compiler fuzzing through deep learning. In: Proceedings of the 27th ACM SIGSOFT International Symposium on Software Testing and Analysis, pp. 95–105. ACM (2018)

[5]

Dong, L., Lapata, M.: Language to logical form with neural attention. arXiv preprint arXiv:1601.01280 (2016)

[6]

Godefroid, P., Peleg, H., Singh, R.: Learn&fuzz: machine learning for input fuzzing. In: Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering, pp. 50–59. IEEE Press (2017)

[7]

Google: Honggfuzz (2016). https://github.com/google/honggfuzz

[8]

Hochreiter S and Schmidhuber J Long short-term memory Neural Comput. 1997 9 8 1735-1780

[9]

Karpathy, A., Johnson, J., Fei-Fei, L.: Visualizing and understanding recurrent networks. arXiv preprint arXiv:1506.02078 (2015)

[10]

Liu, X., Li, X., Prajapati, R., Wu, D.: Deepfuzz: automatic generation of syntax valid c programs for fuzz testing. In: Proceedings of the AAAI Conference on Artificial Intelligence (2019)

[11]

LLVM: libfuzzer: a library for coverage-guided fuzz testing (2017). https://llvm.org/docs/LibFuzzer.html

[12]

Patra, J., Pradel, M.: Learning to fuzz: Application-independent fuzz testing with probabilistic, generative models of input data. TU Darmstadt, Department of Computer Science, Technical report, TUD-CS-2016-14664 (2016)

[13]

Rabinovich, M., Stern, M., Klein, D.: Abstract syntax networks for code generation and semantic parsing. arXiv preprint arXiv:1704.07535 (2017)

[14]

Rawat, S., Jain, V., Kumar, A., Cojocar, L., Giuffrida, C., Bos, H.: Vuzzer: application-aware evolutionary fuzzing. In: NDSS, vol. 17, pp. 1–14 (2017)

[15]

Tai, K.S., Socher, R., Manning, C.D.: Improved semantic representations from tree-structured long short-term memory networks. arXiv preprint arXiv:1503.00075 (2015)

[16]

Wang, J., Chen, B., Wei, L., Liu, Y.: Skyfire: data-driven seed generation for fuzzing. In: 2017 IEEE Symposium on Security and Privacy (SP), pp. 579–594. IEEE (2017)

[17]

Wei, H., Li, M.: Supervised deep features for software functional clone detection by exploiting lexical and syntactical information in source code. In: IJCAI, pp. 3034–3040 (2017)

[18]

White, M., Tufano, M., Vendome, C., Poshyvanyk, D.: Deep learning code fragments for code clone detection. In: 2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 87–98. IEEE (2016)

[19]

Yang, X., Chen, Y., Eide, E., Regehr, J.: Finding and understanding bugs in c compilers. In: Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 283–294 (2011)

[20]

Yin, P., Neubig, G.: A syntactic neural model for general-purpose code generation. arXiv preprint arXiv:1704.01696 (2017)

[21]

Zalewski, M.: American fuzzy lop (2017). http://lcamtuf.coredump.cx/afl/

Cited By

Tian YXu ZDong YSun CCheung SElkind E(2023)Revisiting the evaluation of deep learning-based compiler testingProceedings of the Thirty-Second International Joint Conference on Artificial Intelligence10.24963/ijcai.2023/542(4873-4882)Online publication date: 19-Aug-2023
https://dl.acm.org/doi/10.24963/ijcai.2023/542

Recommendations

A Generative and Mutational Approach for Synthesizing Bug-Exposing Test Cases to Guide Compiler Fuzzing
ESEC/FSE 2023: Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering

Random test case generation, or fuzzing, is a viable means for uncovering compiler bugs. Unfortunately, compiler fuzzing can be time-consuming and inefficient with purely randomly generated test cases due to the complexity of modern compilers. We present ...
WhiteFox: White-Box Compiler Fuzzing Empowered by Large Language Models

Compiler correctness is crucial, as miscompilation can falsify program behaviors, leading to serious consequences over the software supply chain. In the literature, fuzzing has been extensively studied to uncover compiler defects. However, compiler ...
Making no-fuss compiler fuzzing effective
CC 2022: Proceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction

Developing a bug-free compiler is difficult; modern optimizing compilers are among the most complex software systems humans build. Fuzzing is one way to identify subtle compiler bugs that are hard to find with human-constructed tests. Grammar-based ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings

Algorithms and Architectures for Parallel Processing: 20th International Conference, ICA3PP 2020, New York City, NY, USA, October 2–4, 2020, Proceedings, Part I

Oct 2020

756 pages

ISBN:978-3-030-60244-4

DOI:10.1007/978-3-030-60245-1

Editor:
Meikang Qiu
Columbia University, New York, NY, USA

© Springer Nature Switzerland AG 2020.

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 02 October 2020

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 08 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Tian YXu ZDong YSun CCheung SElkind E(2023)Revisiting the evaluation of deep learning-based compiler testingProceedings of the Thirty-Second International Joint Conference on Artificial Intelligence10.24963/ijcai.2023/542(4873-4882)Online publication date: 19-Aug-2023
https://dl.acm.org/doi/10.24963/ijcai.2023/542

View Options

View options

Figures

Tables

Media

View Table of Conten