Fuzzing deep learning compilers with hirgen
Proceedings of the 32nd ACM SIGSOFT International Symposium on Software …, 2023•dl.acm.org
Deep Learning (DL) compilers are widely adopted to optimize advanced DL models for
efficient deployment on diverse hardware. Their quality has a profound effect on the quality
of compiled DL models. A recent bug study shows that the optimization of high-level
intermediate representations (IRs) is the most error-prone compilation stage and bugs in this
stage account for 44.92% of the whole collected ones. However, existing testing techniques
do not consider the features related to high-level optimization (eg, the high-level IR), and are …
efficient deployment on diverse hardware. Their quality has a profound effect on the quality
of compiled DL models. A recent bug study shows that the optimization of high-level
intermediate representations (IRs) is the most error-prone compilation stage and bugs in this
stage account for 44.92% of the whole collected ones. However, existing testing techniques
do not consider the features related to high-level optimization (eg, the high-level IR), and are …
Deep Learning (DL) compilers are widely adopted to optimize advanced DL models for efficient deployment on diverse hardware. Their quality has a profound effect on the quality of compiled DL models. A recent bug study shows that the optimization of high-level intermediate representations (IRs) is the most error-prone compilation stage and bugs in this stage account for 44.92% of the whole collected ones. However, existing testing techniques do not consider the features related to high-level optimization (e.g., the high-level IR), and are therefore weak in exposing bugs at this stage. To bridge this gap, we propose HirGen, an automated testing technique that effectively exposes coding mistakes in the optimization of high-level IRs. The design of HirGen includes 1) three coverage criteria to generate diverse and valid computational graphs; 2) the use of the high-level IR’s language features to generate diverse IRs; 3) three test oracles of which two are inspired by metamorphic testing and differential testing. HirGen has successfully detected 21 bugs that occur at TVM, with 17 bugs confirmed and 12 fixed. Further, we construct four baselines using state-of-the-art DL compiler fuzzers that can cover the high-level optimization stage. Our experiment results show that HirGen can detect 10 crashes and inconsistencies that cannot be detected by the baselines in 48 hours. We also evaluate the usefulness of our proposed coverage criteria and test oracles.
ACM Digital Library