Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3453933.3454016acmconferencesArticle/Chapter ViewAbstractPublication PagesveeConference Proceedingsconference-collections
research-article

Effective exploitation of SIMD resources in cross-ISA virtualization

Published: 07 April 2021 Publication History

Abstract

System virtualization is a fundamental technology that enables many important applications. However, existing virtualization techniques suffer from a critical limitation due to their limited exploitation of host SIMD hardware resources, especially when a guest application does not have inherently fine-grained data-level parallelism. To bridge this utilization gap and unleash the full potential of host SIMD resources, this paper proposes an effective and unconventional SIMD exploitation technique. The proposed exploitation takes advantage of ample host SIMD registers and powerful host SIMD instructions to generate more efficient host binary code for guest applications even without any fine-grained data-level parallelism. It also mitigates the shortage of general-purpose registers on the host platform, as well as improves the efficiency of accessing guest registers. We have implemented the exploitation in an extensively-used virtualization platform, QEMU. Experimental results on a comprehensive list of benchmarks from PARSEC, SPEC-CPU2017, and Google Octane JavaScript benchmark suite show that an average of 2.2X performance speedup can be achieved for AArch64 binaries on an x86-64 host machine. We believe the proposed technique will provide a new perspective for our community to rethink the exploitation of SIMD hardware resources.

References

[1]
Berkin Akin, Zeshan A. Chishti, and Alaa R. Alameldeen. 2019. ZCOMP: Reducing DNN Cross-Layer Memory Footprint Using Vector Extensions. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture (Columbus, OH, USA) ( MICRO '52). Association for Computing Machinery, New York, NY, USA, 126-138. https://doi.org/10.1145/3352460.3358305
[2]
Android. 2020. Run apps on the Android Emulator. https://developer. android.com/studio/run/emulator.
[3]
Sara S. Baghsorkhi, Nalini Vasudevan, and Youfeng Wu. 2016. FlexVec: Auto-Vectorization for Irregular Loops. In Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation (Santa Barbara, CA, USA) ( PLDI '16). Association for Computing Machinery, New York, NY, USA, 697-710. https://doi.org/10.1145/2908080.2908111
[4]
Fabrice Bellard. 2005. QEMU, a Fast and Portable Dynamic Translator. In Proceedings of the Annual Conference on USENIX Annual Technical Conference (Anaheim, CA) (USENIX ATC '05). USENIX Association, USA, 41.
[5]
Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. The PARSEC Benchmark Suite: Characterization and Architectural Implications. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques (Toronto, Ontario, Canada) ( PACT '08). Association for Computing Machinery, New York, NY, USA, 72-81. https://doi.org/10.1145/1454115.1454128
[6]
Derek L. Bruening and Saman Amarasinghe. 2004. Eficient, Transparent, and Comprehensive Runtime Code Manipulation. Ph.D. Dissertation. Massachusetts Institute of Technology, USA.
[7]
James C. Dehnert, Brian K. Grant, John P. Banning, Richard Johnson, Thomas Kistler, Alexander Klaiber, and Jim Mattson. 2003. The Transmeta Code MorphingTM Software: Using Speculation, Recovery, and Adaptive Retranslation to Address Real-Life Challenges. In Proceedings of the International Symposium on Code Generation and Optimization: Feedback-Directed and Runtime Optimization (San Francisco, California, USA) ( CGO '03). IEEE Computer Society, USA, 15-24.
[8]
Matthew DeVuyst, Ashish Venkat, and Dean M. Tullsen. 2012. Execution Migration in a Heterogeneous-ISA Chip Multiprocessor. In Proceedings of the Seventeenth International Conference on Architectural Support for Programming Languages and Operating Systems (London, England, UK) (ASPLOS XVII). Association for Computing Machinery, New York, NY, USA, 261-272. https://doi.org/10.1145/2150976.2151004
[9]
Dolphin Emulator Project. 2020. A GameCube and Wii emulator. https://dolphin-emu.org.
[10]
Amanieu D'Antras, Cosmin Gorgovan, Jim Garside, and Mikel Luján. 2017. Low Overhead Dynamic Binary Translation on ARM. In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation (Barcelona, Spain) ( PLDI 2017 ). Association for Computing Machinery, New York, NY, USA, 333-346. https://doi.org/10.1145/3062341.3062371
[11]
Carol Eidt and Tanner Gooding. 2020. SIMD Support in.NET: Abstract and Concrete Vector Types and Operations. In Proceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization (San Diego, CA, USA) ( CGO 2020 ). Association for Computing Machinery, New York, NY, USA, 229-241. https://doi.org/10.1145/3368826.3377926
[12]
Sheng-Yu Fu, Ding-Yong Hong, Yu-Ping Liu, Jan-Jan Wu, and WeiChung Hsu. 2017. Dynamic Translation of Structured Loads/Stores and Register Mapping for Architectures with SIMD Extensions. In Proceedings of the 18th ACM SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded Systems (Barcelona, Spain) (LCTES 2017 ). Association for Computing Machinery, New York, NY, USA, 31-40. https://doi.org/10.1145/3078633.3081029
[13]
Evangelos Georganas, Sasikanth Avancha, Kunal Banerjee, Dhiraj Kalamkar, Greg Henry, Hans Pabst, and Alexander Heinecke. 2018. Anatomy of High-Performance Deep Learning Convolutions on SIMD Architectures. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (Dallas, Texas) ( SC '18). IEEE Press, Article 66, 12 pages.
[14]
Google. 2020. The JavaScript Benchmark Suite for the modern web. https://developers.google.com/octane.
[15]
Google. 2020. V8 JavaScript engine. https://v8.dev.
[16]
Shuo Han, Lei Zou, and Jefrey Xu Yu. 2018. Speeding Up Set Intersections in Graph Algorithms Using SIMD Instructions. In Proceedings of the 2018 International Conference on Management of Data (Houston, TX, USA) ( SIGMOD '18). Association for Computing Machinery, New York, NY, USA, 1587-1602. https://doi.org/10.1145/3183713.3196924
[17]
Kaixi Hou, Hao Wang, and Wu-chun Feng. 2015. ASPaS: A Framework for Automatic SIMDization of Parallel Sorting on X86-Based Many-Core Processors. In Proceedings of the 29th ACM on International Conference on Supercomputing (Newport Beach, California, USA) ( ICS '15). Association for Computing Machinery, New York, NY, USA, 383-392. https://doi.org/10.1145/2751205.2751247
[18]
Joonmoo Huh and James Tuck. 2017. Improving the Efectiveness of Searching for Isomorphic Chains in Superword Level Parallelism. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture (Cambridge, Massachusetts) (MICRO-50 ' 17 ). Association for Computing Machinery, New York, NY, USA, 718-729. https://doi.org/10.1145/3123939.3124554
[19]
Jinhu Jiang, Rongchao Dong, Zhongjun Zhou, Changheng Song, Wenwen Wang, Pen-Chung Yew, and Weihua Zhang. 2020. More with Less-Deriving More Translation Rules with Less Training Data for DBTs Using Parameterization. In 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 415-426. https://doi.org/10.1109/MICRO50266. 2020.00043
[20]
Timo Kersten, Viktor Leis, Alfons Kemper, Thomas Neumann, Andrew Pavlo, and Peter Boncz. 2018. Everything You Always Wanted to Know about Compiled and Vectorized Queries but Were Afraid to Ask. Proc. VLDB Endow. 11, 13 (Sept. 2018 ), 2209-2222. https://doi.org/10.14778/ 3275366.3284966
[21]
Martin Kong, Richard Veras, Kevin Stock, Franz Franchetti, LouisNoël Pouchet, and P. Sadayappan. 2013. When Polyhedral Transformations Meet SIMD Code Generation. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (Seattle, Washington, USA) ( PLDI '13). Association for Computing Machinery, New York, NY, USA, 127-138. https://doi.org/10.1145/2491956.2462187
[22]
Jianhui Li, Qi Zhang, Shu Xu, and Bo Huang. 2006. Optimizing Dynamic Binary Translation for SIMD Instructions. In Proceedings of the International Symposium on Code Generation and Optimization (CGO '06). IEEE Computer Society, USA, 269-280. https://doi.org/10.1109/ CGO. 2006.27
[23]
Y. Liu, D. Hong, J. Wu, S. Fu, and W. Hsu. 2017. Exploiting Asymmetric SIMD Register Configurations in ARM-to-x86 Dynamic Binary Translation. In 2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT). 343-355.
[24]
Charith Mendis, Ajay Jain, Paras Jain, and Saman Amarasinghe. 2019. Revec: Program Rejuvenation through Revectorization. In Proceedings of the 28th International Conference on Compiler Construction (Washington, DC, USA) ( CC 2019). Association for Computing Machinery, New York, NY, USA, 29-41. https://doi.org/10.1145/3302516.3307357
[25]
Microsoft. 2018. How x86 emulation works on ARM. https://docs.microsoft.com/en-us/windows/uwp/porting/appson-arm-x86-emulation.
[26]
Barton P. Miller, Mark D. Callaghan, Jonathan M. Cargille, Jefrey K. Hollingsworth, R. Bruce Irvin, Karen L. Karavanic, Krishna Kunchithapadam, and Tia Newhall. 1995. The Paradyn Parallel Performance Measurement Tool. Computer 28, 11 (Nov. 1995 ), 37-46. https://doi.org/10.1109/2.471178
[27]
Nicholas Nethercote and Julian Seward. 2007. Valgrind: A Framework for Heavyweight Dynamic Binary Instrumentation. In Proceedings of the 28th ACM SIGPLAN Conference on Programming Language Design and Implementation (San Diego, California, USA) ( PLDI '07). Association for Computing Machinery, New York, NY, USA, 89-100. https://doi.org/10.1145/1250734.1250746
[28]
Dorit Nuzman, Sergei Dyshel, Erven Rohou, Ira Rosen, Kevin Williams, David Yuste, Albert Cohen, and Ayal Zaks. 2011. Vapor SIMD: AutoVectorize Once, Run Everywhere. In Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO '11). IEEE Computer Society, USA, 151-160.
[29]
Yihan Pang, Robert Lyerly, and Binoy Ravindran. 2019. Cross-ISA Execution of SIMD Regions for Improved Performance. In Proceedings of the 12th ACM International Conference on Systems and Storage (Haifa, Israel) (SYSTOR '19). Association for Computing Machinery, New York, NY, USA, 55-67. https://doi.org/10.1145/3319647.3325832
[30]
Orestis Polychroniou, Arun Raghavan, and Kenneth A. Ross. 2015. Rethinking SIMD Vectorization for In-Memory Databases. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (Melbourne, Victoria, Australia) ( SIGMOD '15). Association for Computing Machinery, New York, NY, USA, 1493-1508. https://doi.org/10.1145/2723372.2747645
[31]
V. Porpodas. 2017. SuperGraph-SLP Auto-Vectorization. In 2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT). 330-342.
[32]
Vasileios Porpodas, Rodrigo C. O. Rocha, Evgueni Brevnov, Luís F. W. Góes, and Timothy Mattson. 2019. Super-Node SLP : Optimized Vectorization for Code Sequences Containing Operators and Their Inverse Elements. In Proceedings of the 2019 IEEE/ACM International Symposium on Code Generation and Optimization (Washington, DC, USA) ( CGO 2019). IEEE Press, 206-216.
[33]
Vijay Janapa Reddi, Dan Connors, Robert Cohn, and Michael D. Smith. 2007. Persistent Code Caching: Exploiting Code Reuse Across Executions and Applications. In Proceedings of the International Symposium on Code Generation and Optimization (CGO '07). IEEE Computer Society, USA, 74-88. https://doi.org/10.1109/CGO. 2007.29
[34]
Changheng Song, Wenwen Wang, Pen-Chung Yew, Antonia Zhai, and Weihua Zhang. 2019. Unleashing the Power of Learning: An Enhanced Learning-Based Approach for Dynamic Binary Translation. In Proceedings of the 2019 USENIX Conference on Usenix Annual Technical Conference (Renton, WA, USA) ( USENIX ATC '19). USENIX Association, USA, 77-89.
[35]
Tom Spink, Harry Wagstaf, and Björn Franke. 2019. A Retargetable System-Level DBT Hypervisor. In Proceedings of the 2019 USENIX Conference on Usenix Annual Technical Conference (Renton, WA, USA) ( USENIX ATC '19). USENIX Association, USA, 505-520.
[36]
Standard Performance Evaluation Corporation. 2020. SPEC CPU 2017. https://www.spec.org/cpu2017.
[37]
Alen Stojanov, Ivaylo Toskov, Tiark Rompf, and Markus Püschel. 2018. SIMD Intrinsics on Managed Language Runtimes. In Proceedings of the 2018 International Symposium on Code Generation and Optimization (Vienna, Austria) ( CGO 2018 ). Association for Computing Machinery, New York, NY, USA, 2-15. https://doi.org/10.1145/3168810
[38]
Wenwen Wang. 2021. Helper Function Inlining in Dynamic Binary Translation. In Proceedings of the 30th ACM SIGPLAN International Conference on Compiler Construction (Virtual, Republic of Korea) (CC 2021 ). Association for Computing Machinery, New York, NY, USA, 107-118. https://doi.org/10.1145/3446804.3446851
[39]
Wenwen Wang, Stephen McCamant, Antonia Zhai, and Pen-Chung Yew. 2018. Enhancing Cross-ISA DBT Through Automatically Learned Translation Rules. In Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems (Williamsburg, VA, USA) ( ASPLOS '18). Association for Computing Machinery, New York, NY, USA, 84-97. https://doi. org/10.1145/3173162.3177160
[40]
Wenwen Wang, Chenggang Wu, Tongxin Bai, Zhenjiang Wang, Xiang Yuan, and Huimin Cui. 2014. A Pattern Translation Method for Flags in Binary Translation. Journal of Computer Research and Development 51, 10 ( 2014 ), 2336-2347. http://crad.ict.ac.cn/EN/10.7544/issn1000-1239. 2014.20130018
[41]
Wenwen Wang, Jiacheng Wu, Xiaoli Gong, Tao Li, and Pen-Chung Yew. 2018. Improving Dynamically-Generated Code Performance on Dynamic Binary Translators. In Proceedings of the 14th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (Williamsburg, VA, USA) ( VEE '18). Association for Computing Machinery, New York, NY, USA, 17-30. https://doi.org/10.1145/ 3186411.3186413
[42]
Wenwen Wang, Pen-Chung Yew, Antonia Zhai, and Stephen McCamant. 2016. A General Persistent Code Caching Framework for Dynamic Binary Translation (DBT). In Proceedings of the 2016 USENIX Conference on Usenix Annual Technical Conference (Denver, CO, USA) ( USENIX ATC '16). USENIX Association, USA, 591-603.
[43]
Wenwen Wang, Pen-Chung Yew, Antonia Zhai, and Stephen McCamant. 2020. Eficient and Scalable Cross-ISA Virtualization of Hardware Transactional Memory. In Proceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization (San Diego, CA, USA) ( CGO 2020 ). Association for Computing Machinery, New York, NY, USA, 107-120. https://doi.org/10.1145/3368826.3377919
[44]
Wenwen Wang, Pen-Chung Yew, Antonia Zhai, Stephen McCamant, Youfeng Wu, and Jayaram Bobba. 2017. Enabling Cross-ISA Ofloading for COTS Binaries. In Proceedings of the 15th Annual International Conference on Mobile Systems, Applications, and Services (Niagara Falls, New York, USA) ( MobiSys '17). Association for Computing Machinery, New York, NY, USA, 319-331. https://doi.org/10.1145/3081333.3081337
[45]
Jin Wu, Jian Dong, Ruili Fang, Wenwen Wang, and Decheng Zuo. 2020. PerfDBT: Eficient Performance Regression Testing of Dynamic Binary Translation. In 2020 IEEE 38th International Conference on Computer Design (ICCD). 389-392. https://doi.org/10.1109/ICCD50377. 2020. 00071
[46]
Qifan Yang, Zhenhua Li, Yunhao Liu, Hai Long, Yuanchao Huang, Jiaming He, Tianyin Xu, and Ennan Zhai. 2019. Mobile Gaming on Personal Computers with Direct Android Emulation. In The 25th Annual International Conference on Mobile Computing and Networking (Los Cabos, Mexico) ( MobiCom '19). Association for Computing Machinery, New York, NY, USA, Article 19, 15 pages. https://doi.org/10.1145/3300061.3300122
[47]
Ziyi Zhao, Zhang Jiang, Ying Chen, Xiaoli Gong, Wenwen Wang, and Pen-Chung Yew. 2021. Enhancing Atomic Instruction Emulation for Cross-ISA Dynamic Binary Translation. In 19th IEEE/ACM International Symposium on Code Generation and Optimization (CGO 2021 ). Association for Computing Machinery, New York, NY, USA.
[48]
Ziyi Zhao, Zhang Jiang, Ximing Liu, Xiaoli Gong, Wenwen Wang, and Pen-Chung Yew. 2020. DQEMU: A Scalable Emulator with Retargetable DBT on Distributed Platforms. In 49th International Conference on Parallel Processing-ICPP (Edmonton, AB, Canada) ( ICPP '20). Association for Computing Machinery, New York, NY, USA, Article 7, 11 pages. https://doi.org/10.1145/3404397.3404403
[49]
Jingren Zhou and Kenneth A. Ross. 2002. Implementing Database Operations Using SIMD Instructions. In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data (Madison, Wisconsin) (SIGMOD '02). Association for Computing Machinery, New York, NY, USA, 145-156. https://doi.org/10.1145/564691.564709

Cited By

View all
  • (2024)Performance Improvements via Peephole Optimization in Dynamic Binary TranslationElectronics10.3390/electronics1309160813:9(1608)Online publication date: 23-Apr-2024
  • (2023)Towards Efficient Dynamic Binary Translation Optimizations Based on RISC Architectural FeaturesJournal of Circuits, Systems and Computers10.1142/S021812662450104433:06Online publication date: 26-Oct-2023
  • (2023)Efficient condition code emulation for dynamic binary translation systemsThird International Symposium on Computer Engineering and Intelligent Communications (ISCEIC 2022)10.1117/12.2660798(25)Online publication date: 2-Feb-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
VEE 2021: Proceedings of the 17th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments
April 2021
200 pages
ISBN:9781450383943
DOI:10.1145/3453933
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 April 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Cross-ISA virtualization
  2. Dynamic binary translation
  3. QEMU
  4. SIMD optimization

Qualifiers

  • Research-article

Funding Sources

Conference

VEE '21

Acceptance Rates

Overall Acceptance Rate 80 of 235 submissions, 34%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)49
  • Downloads (Last 6 weeks)3
Reflects downloads up to 20 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Performance Improvements via Peephole Optimization in Dynamic Binary TranslationElectronics10.3390/electronics1309160813:9(1608)Online publication date: 23-Apr-2024
  • (2023)Towards Efficient Dynamic Binary Translation Optimizations Based on RISC Architectural FeaturesJournal of Circuits, Systems and Computers10.1142/S021812662450104433:06Online publication date: 26-Oct-2023
  • (2023)Efficient condition code emulation for dynamic binary translation systemsThird International Symposium on Computer Engineering and Intelligent Communications (ISCEIC 2022)10.1117/12.2660798(25)Online publication date: 2-Feb-2023
  • (2022)FADATestProceedings of the 44th International Conference on Software Engineering10.1145/3510003.3510169(896-908)Online publication date: 21-May-2022
  • (2022)WDBTJournal of Systems and Software10.1016/j.jss.2022.111247187:COnline publication date: 1-May-2022
  • (2021)WDBT: Wear Characterization, Reduction, and Leveling of DBT Systems for Non-Volatile MemoryProceedings of the International Symposium on Memory Systems10.1145/3488423.3519337(1-13)Online publication date: 27-Sep-2021
  • (2021)TCStream: Large-Scale Graph Triangle-Counting on a single Machine using GPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.3135329(1-1)Online publication date: 2021

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media