Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article
Open access

Asynchronous Automata Processing on GPUs

Published: 02 March 2023 Publication History

Abstract

Finite-state automata serve as compute kernels for many application domains such as pattern matching and data analytics. Existing approaches on GPUs exploit three levels of parallelism in automata processing tasks: 1)~input stream level, 2)~automaton-level and 3)~state-level. Among these, only state-level parallelism is intrinsic to automata while the other two levels of parallelism depend on the number of automata and input streams to be processed. As GPU resources increase, a parallelism-limited automata processing task can underutilize GPU compute resources.
To this end, we propose AsyncAP, a low-overhead approach that optimizes for both scalability and throughput. Our insight is that most automata processing tasks have an additional source of parallelism originating from the input symbols which has not been leveraged before. Making the matching process associated with the automata tasks asynchronous, i.e., parallel GPU threads start processing an input stream from different input locations instead of processing it serially, improves throughput significantly and scales with input length.
When the task does not have enough parallelism to utilize all the GPU cores, detailed evaluation across 12 evaluated applications shows that AsyncAP achieves up to 58× speedup on average over the state-of-the-art GPU automata processing engine. When the tasks have enough parallelism to utilize GPU cores, AsyncAP still achieves 2.4× speedup.

References

[1]
M. Karim Abdalla et al. 2013. Scheduling and Execution of Compute Tasks. US 2013/0185728A1.
[2]
018)]% mnrl, K. Angstadt, J. Wadden, V. Dang, T. Xie, D. Kramp, W. Weimer, M. Stan, and K. Skadron. 2018. MNCaRT: An Open-Source, Multi-Architecture Automata-Processing Research and Execution Ecosystem. IEEE Computer Architecture Letters (CAL) (2018).
[3]
016)]% multi-stride-ton-2016, Matteo Avalle, Fulvio Risso, and Riccardo Sisto. 2016. Scalable Algorithms for NFA Multi-Striding and NFA-Based Deep Packet Inspection on GPUs. IEEE/ACM Transactions on Networking (ToN) (2016).
[4]
009)]% gpgpu-sim, A. Bakhoda, G.L. Yuan, W.W.L. Fung, H. Wong, and T.M. Aamodt. 2009. Analyzing CUDA Workloads Using a Detailed GPU Simulator. In ISPASS.
[5]
, Michela Becchi and Patrick Crowley. 2007. An Improved Algorithm to Accelerate Regular Expression Evaluation. In Proceedings of the 3rd ACM/IEEE Symposium on Architecture for Networking and Communications Systems (ANCS).
[6]
Michela Becchi and Patrick Crowley. 2008. Efficient Regular Expression Evaluation: Theory to Practice. In Proceedings of the Symposium on Architectures for Networking and Communications Systems (ANCS).
[7]
Michela Becchi, Mark Franklin, and Patrick Crowley. 2008. A Workload for Evaluating Deep Packet Inspection Architectures. In Proceedings of the International Symposium on Workload Characterization (IISWC).
[8]
Chunkun Bo, Vinh Dang, Elaheh Sadredini, and Kevin Skadron. 2018. Searching for Potential gRNA Off-Target Sites for CRISPR/Cas9 using Automata Processing across Different Platforms. In Proceedings of the International Symposium on High Performance Computer Architecture (HPCA).
[9]
Benjamin C. Brodie, David E. Taylor, and Ron K. Cytron. 2006. A Scalable Architecture For High-Throughput Regular-Expression Pattern Matching. In Proceedings of the International Symposium on Computer Architecture (ISCA).
[10]
Niccolo' Cascarano, Pierluigi Rolando, Fulvio Risso, and Riccardo Sisto. 2010. iNFAnt: NFA Pattern Matching on GPGPU Devices. SIGCOMM Computer Communication Review (CCR) (2010).
[11]
Paul Dlugosch, Dave Brown, Paul Glendenning, Leventhal Leventhal, and Harold Noyes. 2014. An Efficient and Scalable Semiconductor Architecture for Parallel Automata Processing. IEEE Transactions on Parallel and Distributed Systems (TPDS) (2014).
[12]
Amr S. Elhelw and Sreepathi Pai. 2020. Horus: A Modular GPU Emulator Framework. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).
[13]
Yuanwei Fang, Tung T. Hoang, Michela Becchi, and Andrew A. Chien. 2015. Fast Support for Unstructured Data Processing: The Unified Automata Processor. In Proceedings of the International Symposium on Microarchitecture (MICRO).
[14]
Adi Fuchs and David Wentzlaff. 2019. The Accelerator Wall: Limits of Chip Specialization. In Proceedings of the International Symposium on High Performance Computer Architecture (HPCA).
[15]
Victor Mikhaylovich Glushkov. 1961. The Abstract Theory of Automata. Russian Mathematical Surveys, Vol. 16, 5 (1961), 1.
[16]
Todd J. Green, Ashish Gupta, Gerome Miklau, Makoto Onizuka, and Dan Suciu. 2004. Processing XML Streams with Deterministic Automata and Stream Indexes. ACM Transactions on Database Systems (2004).
[17]
Bingsheng He, Qiong Luo, and Byron Choi. 2006. Cache-conscious Automata for XML filtering. IEEE Transactions on Knowledge and Data Engineering (TKDE), Vol. 18, 12 (2006), 1629--1644.
[18]
Peng Jiang and Gagan Agrawal. 2017. Combining SIMD and Many/Multi-Core Parallelism for Finite State Machines with Enumerative Speculation. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP).
[19]
Mahmoud Khairy, Zhesheng Shen, Tor M. Aamodt, and Timothy G. Rogers. 2020. Accel-Sim: An Extensible Simulation Framework for Validated GPU Modeling. In Proceedings of the ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA).
[20]
Lingkun Kong, Qixuan Yu, Agnishom Chattopadhyay, Alexis Le Glaunec, Yi Huang, Konstantinos Mamouras, and Kaiyuan Yang. 2022. Software-Hardware Codesign for Efficient in-Memory Regular Pattern Matching. In Proceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation (PLDI).
[21]
Richard E. Ladner and Michael J. Fischer. 1980. Parallel Prefix Computation. Journal of the ACM (JACM) (1980).
[22]
Hongyuan Liu, Mohamed Ibrahim, Onur Kayiran, Sreepathi Pai, and Adwait Jog. 2018. Architectural Support for Efficient Large-Scale Automata Processing. In Proceedings of the International Symposium on Microarchitecture (MICRO).
[23]
Hongyuan Liu, Sreepathi Pai, and Adwait Jog. 2020. Why GPUs are Slow at Executing NFAs and How to Make Them Faster. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS).
[24]
Duane Merrill, Michael Garland, and Andrew Grimshaw. 2012. Policy-based tuning for performance portability and library co-optimization. In Proceedings of the 2012 Innovative Parallel Computing (InPar). 1--10.
[25]
Todd Mytkowicz, Madanlal Musuvathi, and Wolfram Schulte. 2014. Data-parallel Finite-state Machines. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS).
[26]
Marziyeh Nourian, Xiang Wang, Xiaodong Yu, Wu-chun Feng, and Michela Becchi. 2017. Demystifying Automata Processing: GPUs, FPGAs or Micron's AP?. In Proceedings of the International Conference on Supercomputing (ICS).
[27]
Marziyeh Nourian, Hancheng Wu, and Michela Becchi. 2018. A Compiler Framework for Fixed-Topology Non-Deterministic Finite Automata on SIMD Platforms. In Proceedings of the International Conference on Parallel and Distributed Systems (ICPADS).
[28]
Marziyeh Nourian, Mostafa Eghbali Zarch, and Michela Becchi. 2020. Optimizing Complex OpenCL Code for FPGA: A Case Study on Finite Automata Traversal. In Proceedings of the IEEE 26th International Conference on Parallel and Distributed Systems (ICPADS).
[29]
Junqiao Qiu, Lin Jiang, and Zhijia Zhao. 2020. Challenging Sequential Bitstream Processing via Principled Bitwise Speculation. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS).
[30]
Junqiao Qiu, Xiaofan Sun, Amir Hossein Nodehi Sabet, and Zhijia Zhao. 2021. Scalable FSM Parallelization via Path Fusion and Higher-Order Speculation. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS).
[31]
Junqiao Qiu, Zhijia Zhao, and Bin Ren. 2016. MicroSpec: Speculation-Centric Fine-Grained Parallelization for FSM Computations. In Proceedings of the 2016 International Conference on Parallel Architectures and Compilation (PACT).
[32]
Junqiao Qiu, Zhijia Zhao, Bo Wu, Abhinav Vishnu, and Shuaiwen Leon Song. 2017. Enabling Scalability-sensitive Speculative Parallelization for FSM Computations. In Proceedings of the International Conference on Supercomputing (ICS).
[33]
Reza Rahimi, Elaheh Sadredini, Mircea Stan, and Kevin Skadron. 2020. Grapefruit: An Open-Source, Full-Stack, and Customizable Automata Processing on FPGAs. In Proceedings of the IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).
[34]
Bin Ren, Tomi Poutanen, Todd Mytkowicz, Wolfram Schulte, Gagan Agrawal, and James R. Larus. 2013. SIMD parallelization of applications that traverse irregular data structures. In Proceedings of the International Symposium on Code Generation and Optimization (CGO).
[35]
Martin Roesch. 1999. Snort - Lightweight Intrusion Detection for Networks. In Proceedings of the USENIX Conference on System Administration (LISA).
[36]
Elaheh Sadredini, Deyuan Guo, Chunkun Bo, Reza Rahimi, Kevin Skadron, and Hongning Wang. 2018. A Scalable Solution for Rule-Based Part-of-Speech Tagging on Novel Hardware Accelerators. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD).
[37]
Elaheh Sadredini, Reza Rahimi, Marzieh Lenjani, Mircea Stan, and Kevin Skadron. 2020a. FlexAmata: A Universal and Efficient Adaption of Applications to Spatial Automata Processing Accelerators. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS).
[38]
Elaheh Sadredini, Reza Rahimi, Lenjani Marzieh, Stan Mircea, and Skadron Kevin. 2020b. Impala: Algorithm/Architecture Co-Design for In-Memory Multi-Stride Pattern Matching. In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA).
[39]
Elaheh Sadredini, Reza Rahimi, Vaibhav Verma, Mohsen Imani, and Kevin Skadron. 2021. Sunder: Enabling Low-Overhead and Scalable Near-Data Pattern Matching Acceleration. In Proceedings of the International Symposium on Microarchitecture (MICRO).
[40]
Elaheh Sadredini, Reza Rahimi, Vaibhav Verma, Mircea Stan, and Kevin Skadron. 2019a. A Scalable and Efficient In-Memory Interconnect Architecture for Automata Processing. IEEE Computer Architecture Letters (CAL) (2019).
[41]
Elaheh Sadredini, Reza Rahimi, Vaibhav Verma, Mircea Stan, and Kevin Skadron. 2019b. eAP: A Scalable and Efficient in Memory Accelerator for Automata Processing. In Proceedings of the International Symposium on Microarchitecture (MICRO).
[42]
Elaheh Sadredini, Reza Rahimi, Ke Wang, and Kevin Skadron. 2017. Frequent Subtree Mining on the Automata Processor: Challenges and Opportunities. In Proceedings of the International Conference on Supercomputing (ICS).
[43]
Randy Smith, Neelam Goyal, Justin Ormont, Karthikeyan Sankaralingam, and Cristian Estan. 2009. Evaluating GPUs for Network Packet Signature Matching. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).
[44]
Arun Subramaniyan and Reetuparna Das. 2017. Parallel Automata Processor. In Proceedings of the International Symposium on Computer Architecture (ISCA).
[45]
Arun Subramaniyan, Jingcheng Wang, Ezhil R. M. Balasubramanian, David Blaauw, Dennis Sylvester, and Reetuparna Das. 2017. Cache Automaton. In Proceedings of the International Symposium on Microarchitecture (MICRO).
[46]
Yifan Sun, Nicolas Bohm Agostini, Shi Dong, and David Kaeli. 2020. Summarizing CPU and GPU Design Trends with Product Data. arxiv: 1911.11313 [cs.DC]
[47]
Tommy Tracy, Yao Fu, Indranil Roy, Eric Jonas, and Paul Glendenning. 2016. Towards machine learning on the Automata Processor. In Proceedings of the International Conference on High Performance Computing (HiPC).
[48]
Tuan Tu Tran, Yongchao Liu, and Bertil Schmidt. 2016. Bit-parallel approximate pattern matching: Kepler GPU versus Xeon Phi. Parallel Comput., Vol. 54 (2016), 128--138.
[49]
Giorgos Vasiliadis, Michalis Polychronakis, and Sotiris Ioannidis. 2011. Parallelization and characterization of pattern matching using GPUs. In Proceedings of the International Symposium on Workload Characterization (IISWC).
[50]
Kien Chi Vu. 2020. Accelerating bit-based finite automaton on a GPGPU device. (2020).
[51]
Jack Wadden, Vinh Dang, Nathan Brunelle, Tom Tracy II, Deyuan Guo, Elaheh Sadredini, Ke Wang, Chunkun Bo, Gabriel Robins, Mircea Stan, and Kevin Skadron. 2016. ANMLZoo: A Benchmark Suite for Exploring Bottlenecks in Automata Processing Engines and Architectures. In Proceedings of the International Symposium on Workload Characterization (IISWC).
[52]
Jack Wadden and Kevin Skadron. 2016. VASim: An Open Virtual Automata Simulator for Automata Processing Application and Architecture Research. Technical Report CS2016-03. University of Virginia.
[53]
Jack Wadden, Tom Tracy II, Elaheh Sadredini, Lingzi Wu, Chunkun Bo, Jesse Du, Yizhou Wei, Matthew Wallace, Jeffrey Udall, Mircea Stan, and Kevin Skadron. 2018. AutomataZoo: A Modern Automata Processing Benchmark Suite. In Proceedings of the International Symposium on Workload Characterization (IISWC).
[54]
Ke Wang, Kevin Angstadt, Chunkun Bo, Nathan Brunelle, Elaheh Sadredini, Tommy Tracy, Jack Wadden, Mircea Stan, and Kevin Skadron. 2016a. An Overview of Micron's Automata Processor. In Proceedings of the International Conference on Hardware/Software Codesign and System Synthesis (CODESISSS).
[55]
Ke Wang, Elaheh Sadredini, and Kevin Skadron. 2016b. Sequential Pattern Mining with the Micron Automata Processor. In Proceedings of the International Conference on Computing Frontiers (CF).
[56]
Lei Wang, Shuhui Chen, Yong Tang, and Jinshu Su. 2011. Gregex: GPU Based High Speed Regular Expression Matching Engine. In Proceedings of the International Conference on Innovative Mobile and Internet Services in Ubiquitous Computing.
[57]
Xiang Wang, Yang Hong, Harry Chang, KyoungSoo Park, Geoff Langdale, Jiayu Hu, and Heqing Zhu. 2019. Hyperscan: A Fast Multi-pattern Regex Matcher for Modern CPUs. In Proceedings of the USENIX Symposium on Networked Systems Design and Implementation (NSDI).
[58]
Yuguang Wang, Robbie Watling, Junqiao Qiu, and Zhenlin Wang. 2022. GSpecPal: Speculation-Centric Finite State Machine Parallelization on GPUs. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS).
[59]
Yang Xia, Peng Jiang, and Gagan Agrawal. 2020. Scaling out Speculative Execution of Finite-State Machines with Parallel Merge. In Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP).
[60]
Chengcheng Xu, Shuhui Chen, Jinshu Su, Siu Ming Yiu, and Lucas Chi Kwong Hui. 2016. A Survey on Regular Expression Matching for Deep Packet Inspection: Applications, Algorithms, and Hardware Platforms. IEEE Communications Surveys Tutorials (2016).
[61]
Yi-Hua Yang and Viktor Prasanna. 2012. High-Performance and Compact Architecture for Regular Expression Matching on FPGA. IEEE Trans. Comput. (2012).
[62]
Xiaodong Yu and Michela Becchi. 2013a. Exploring Different Automata Representations for Efficient Regular Expression Matching on GPUs. In Proceedings of the Principles and Practice of Parallel Programming (PPoPP).
[63]
Xiaodong Yu and Michela Becchi. 2013b. GPU Acceleration of Regular Expression Matching for Large Datasets: Exploring the Implementation Space. In Proceedings of the International Conference on Computing Frontiers (CF).
[64]
Xiaodong Yu, Wu-chun Feng, Danfeng Yao, and Michela Becchi. 2016. O3FA: A scalable finite automata-based pattern-matching engine for out-of-order deep packet inspection. In Proceedings of the 2016 ACM/IEEE Symposium on Architectures for Networking and Communications Systems (ANCS).
[65]
Zhipeng Zhao, Hugo Sadok, Nirav Atre, James C. Hoe, Vyas Sekar, and Justine Sherry. 2020. Achieving 100Gbps Intrusion Prevention on a Single Server. In Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI).
[66]
Zhijia Zhao and Xipeng Shen. 2015. On-the-Fly Principled Speculation for FSM Parallelization. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS).
[67]
Zhijia Zhao, Bo Wu, and Xipeng Shen. 2014. Challenging the "Embarrassingly Sequential": Parallelizing Finite State Machine-based Computations Through Principled Speculation. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS).
[68]
Yuan Zu, Ming Yang, Zhonghu Xu, Lin Wang, Xin Tian, Kunyang Peng, and Qunfeng Dong. 2012. GPU-based NFA Implementation for Memory Efficient High Speed Regular Expression Matching. In Proceedings of the Symposium on Principles and Practice of Parallel Programming (PPoPP).

Cited By

View all
  • (2024)HybridSA: GPU Acceleration of Multi-pattern Regex Matching using Bit ParallelismProceedings of the ACM on Programming Languages10.1145/36897718:OOPSLA2(1699-1728)Online publication date: 8-Oct-2024
  • (2024)Regular Expressions on Modern GPGPUsProceedings of the 16th Workshop on General Purpose Processing Using GPU10.1145/3649411.3649416(26-32)Online publication date: 2-Mar-2024
  • (2024)Poster: Performance Analysis of TCP CUBIC and BBR over V2V Wi-FiProceedings of the 22nd Annual International Conference on Mobile Systems, Applications and Services10.1145/3643832.3661413(668-669)Online publication date: 3-Jun-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Measurement and Analysis of Computing Systems
Proceedings of the ACM on Measurement and Analysis of Computing Systems  Volume 7, Issue 1
POMACS
March 2023
749 pages
EISSN:2476-1249
DOI:10.1145/3586099
Issue’s Table of Contents
This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 March 2023
Published in POMACS Volume 7, Issue 1

Check for updates

Author Tags

  1. GPGPUs
  2. automata processing
  3. pattern matching

Qualifiers

  • Research-article

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)465
  • Downloads (Last 6 weeks)60
Reflects downloads up to 09 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)HybridSA: GPU Acceleration of Multi-pattern Regex Matching using Bit ParallelismProceedings of the ACM on Programming Languages10.1145/36897718:OOPSLA2(1699-1728)Online publication date: 8-Oct-2024
  • (2024)Regular Expressions on Modern GPGPUsProceedings of the 16th Workshop on General Purpose Processing Using GPU10.1145/3649411.3649416(26-32)Online publication date: 2-Mar-2024
  • (2024)Poster: Performance Analysis of TCP CUBIC and BBR over V2V Wi-FiProceedings of the 22nd Annual International Conference on Mobile Systems, Applications and Services10.1145/3643832.3661413(668-669)Online publication date: 3-Jun-2024
  • (2023)Enabling Multi-tenancy on SSDs with Accurate IO Interference ModelingProceedings of the 2023 ACM Symposium on Cloud Computing10.1145/3620678.3624657(216-232)Online publication date: 30-Oct-2023
  • (2023)Asynchronous Automata Processing on GPUsACM SIGMETRICS Performance Evaluation Review10.1145/3606376.359352451:1(23-24)Online publication date: 27-Jun-2023

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media