research-article

Public Access

Combining SIMD and Many/Multi-core Parallelism for Finite State Machines with Enumerative Speculation

Authors:

Gagan AgrawalAuthors Info & Claims

PPoPP '17: Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

Pages 179 - 191

https://doi.org/10.1145/3018743.3018760

Published: 26 January 2017 Publication History

Abstract

Finite State Machine (FSM) is the key kernel behind many popular applications, including regular expression matching, text tokenization, and Huffman decoding. Parallelizing FSMs is extremely difficult because of the strong dependencies and unpredictable memory accesses. Previous efforts have largely focused on multi-core parallelization, and used different approaches, including {\em speculative} and {\em enumerative} execution, both of which have been effective but also have limitations. With increasing width and improving flexibility in SIMD instruction sets, this paper focuses on combining SIMD and multi/many-core parallelism for FSMs. We have developed a novel strategy, called {\em enumerative speculation}. Instead of speculating on a single state as in speculative execution or enumerating all possible states as in enumerative execution, our strategy speculates transitions from several possible states, reducing the prediction overheads of speculation approach and the large amount of redundant work in the enumerative approach. A simple lookback approach produces a set of guessed states to achieve high speculation success rates in our enumerative speculation. We evaluate our method with four popular FSM applications: Huffman decoding, regular expression matching, HTML tokenization, and Div7. We obtain up to 2.5x speedup using SIMD on one core and up to 95x combining SIMD with 60 cores of an Intel Xeon Phi. On a single core, we outperform the best single-state speculative execution version by an average of 1.6x, and in combining SIMD and many-core parallelism, outperform enumerative execution by an average of 2x.

References

[1]

K. Asanovic, R. Bodik, J. Demmel, T. Keaveny, K. Keutzer, J. Kubiatowicz, N. Morgan, D. Patterson, K. Sen, J. Wawrzynek, D. Wessel, and K. Yelick. A view of the parallel computing landscape. Commun. ACM, 52(10):56--67, Oct. 2009.

Digital Library

[2]

G. E. Blelloch. Prefix Sums and Their Applications. Technical Report Carnegie Mellon University-CS-90--190, School of Computer Science, Carnegie Mellon University, Nov. 1990.

[3]

L. Chen, P. Jiang, and G. Agrawal. Exploiting Recent SIMD Architectural Advances for Irregular Applications. In Proceedings of the 2016 International Symposium on Code Generation and Optimization, CGO 2016, 2016.

Digital Library

[4]

F. Franchetti and M. Puschel. A SIMD Vectorizing Compiler for Digital Signal Processing Algorithms. In Proceedings of the 2002 IEEE International Parallel and Distributed Processing Symposium (IPDPS 2002), pages 20 --26. IEEE, 2002.

[5]

C. Garca, R. Lario, M. Prieto, L. Piuel, and F. Tirado. Vectorization of Multigrid Codes Using SIMD ISA Extensions. Proceedings of the 2003 IEEE Inernational Parallel and Distributed Processing Symposium (IPDPS 2003), 0:58a, 2003.

Digital Library

[6]

W. D. Hillis and G. L. Steele, Jr. Data Parallel Algorithms. Commun. ACM, 29(12), Dec. 1986.

Digital Library

[7]

J. Holub and S.vStekr. On Parallel Implementations of Deterministic Finite Automata. In Proceedings of the 14th International Conference on Implementation and Application of Automata, CIAA '09, 2009.

Digital Library

[8]

N. Ide, M. Hirano, Y. Endo, S. Yoshioka, H. Murakami, A. Kunimatsu, T. Sato, T. Kamei, T. Okada, and M. Suzuoki. 2.44-GFLOPS 300-MHz Floating-Point Vector-Processing Unit for High-Performance 3D Graphics Computing. IEEE Journal of Solid-State Circuits, 35(7):1025 --1033, Jul 2000.

[9]

P. Jiang, L. Chen, and G. Agrawal. Reusing data reorganization for efficient simd parallelization of adaptive irregular applications. In Proceedings of the 2016 International Conference on Supercomputing, ICS '16, 2016.

Digital Library

[10]

F. Khorasani, K. Vora, R. Gupta, and L. N. Bhuyan. CuSha: Vertex-centric Graph Processing on GPUs. In Proceedings of the 23rd International Symposium on High-performance Parallel and Distributed Computing, HPDC '14, 2014.

Digital Library

[11]

S. Kim and H. Han. Efficient simd code generation for irregular kernels. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '12, pages 55--64, New York, NY, USA, 2012. ACM.

Digital Library

[12]

S. T. Klein and Y. Wiseman. Parallel Huffman Decoding with Applications to JPEG Files. THE COMPUTER JOURNAL, 46:487--497, 2003.

[13]

R. E. Ladner and M. J. Fischer. Parallel prefix computation. J. ACM, 27(4), Oct. 1980.

Digital Library

[14]

X. Liu, M. Smelyanskiy, E. Chow, and P. Dubey. Efficient Sparse Matrix-vector Multiplication on x86-based Many-core Processors. ICS '13, 2013.

Digital Library

[15]

D. Luchaup, R. Smith, C. Estan, and S. Jha. Speculative Parallel Pattern Matching. IEEE Transactions on Information Forensics and Security, 6(2):438--451, June 2011.

Digital Library

[16]

D. Merrill, M. Garland, and A. Grimshaw. Scalable gpu graph traversal. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '12, pages 117--128, New York, NY, USA, 2012. ACM.

Digital Library

[17]

T. Mytkowicz, M. Musuvathi, and W. Schulte. Data-parallel finite-state machines. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '14, pages 529--542, New York, NY, USA, 2014. ACM.

Digital Library

[18]

Y. Pan, Y. Zhang, and K. Chiu. Simultaneous Transducers for Data-Parallel XML Parsing. In Parallel and Distributed Processing, 2008. IPDPS 2008. IEEE International Symposium on, pages 1--12, 2008.

[19]

S. J. Pennycook, C. J. Hughes, M. Smelyanskiy, and S. A. Jarvis. Exploring simd for molecular dynamics, using intel xeon processors and intel xeon phi coprocessors. In Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing, IPDPS '13, 2013.

Digital Library

[20]

P. Prabhu, G. Ramalingam, and K. Vaswani. Safe Programmable Speculative Parallelism. In Proceedings of Programming Language Design and Implementation (PLDI). Association for Computing Machinery, Inc., June 2010.

Digital Library

[21]

J. Qiu, Z. Zhao, and B. Ren. Microspec: Speculation-centric fine-grained parallelization for fsm computations. In Proceedings of the 2016 International Conference on Parallel Architectures and Compilation, PACT '16, pages 221--233, New York, NY, USA, 2016. ACM.

Digital Library

[22]

T. Rognes and E. Seeberg. Six-Fold Speed-up of Smithwaterman Sequence Database Searches Using Parallel Processing on Common Microprocessors. 16(8):699--706, 2000.

[23]

E. Saule and U. V. Catalyurek. An early evaluation of the scalability of graph algorithms on the intel mic architecture. In Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum, IPDPSW '12, pages 1629--1639, Washington, DC, USA, 2012. IEEE Computer Society.

Digital Library

[24]

W. T. Tang, R. Zhao, M. Lu, Y. Liang, H. P. Huynh, X. Li, and R. S. M. Goh. Optimizing and Auto-tuning Scale-free Sparse Matrix-vector Multiplication on Intel Xeon Phi. In Proceedings of the 13th Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO '15, 2015.

[25]

Z. Zhao and X. Shen. On-the-Fly Principled Speculation for FSM Parallelization. SIGARCH Comput. Archit. News, 43(1), Mar. 2015.

[26]

Z. Zhao, B. Wu, and X. Shen. Challenging the "Embarrassingly Sequential": Parallelizing Finite State Machine-based Computations Through Principled Speculation. SIGPLAN Not., 2014.

Digital Library

Cited By

Ge TZhang TLiu HTsafrir DMUSUVATHI MGupta RAbu-Ghazaleh N(2024)ngAP: Non-blocking Large-scale Automata Processing on GPUsProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3617232.3624848(268-285)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3617232.3624848
Liu HPai SJog A(2023)Asynchronous Automata Processing on GPUsProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/35794537:1(1-27)Online publication date: 2-Mar-2023
https://dl.acm.org/doi/10.1145/3579453
Xu CYu KXu XBao XWu SZhao B(2022)Offset-FA: A Uniform Method to Handle Both Unbounded and Bounded Repetitions in Regular Expression MatchingSensors10.3390/s2220778122:20(7781)Online publication date: 13-Oct-2022
https://doi.org/10.3390/s22207781
Show More Cited By

Index Terms

Combining SIMD and Many/Multi-core Parallelism for Finite State Machines with Enumerative Speculation

Recommendations

Combining SIMD and Many/Multi-core Parallelism for Finite State Machines with Enumerative Speculation
PPoPP '17

Finite State Machine (FSM) is the key kernel behind many popular applications, including regular expression matching, text tokenization, and Huffman decoding. Parallelizing FSMs is extremely difficult because of the strong dependencies and unpredictable ...
Combining SIMD and Many/Multi-core Parallelism for Finite-state Machines with Enumerative Speculation
Special Issue on PPoPP 2017 (Part 2) and Regular Papers

Finite-state Machine (FSM) is the key kernel behind many popular applications, including regular expression matching, text tokenization, and Huffman decoding. Parallelizing FSMs is extremely difficult because of the strong dependencies and unpredictable ...
Boundary element quadrature schemes for multi- and many-core architectures

In the paper we study the performance of the regularized boundary element quadrature routines implemented in the BEM4I library developed by the authors. Apart from the results obtained on the classical multi-core architecture represented by the Intel ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

PPoPP '17: Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

January 2017

476 pages

ISBN:9781450344937

DOI:10.1145/3018743

General Chair:
Vivek Sarkar
Rice University, USA
,
Program Chair:
Lawrence Rauchwerger
Texas A&M University, USA

ACM SIGPLAN Notices Volume 52, Issue 8
PPoPP '17
August 2017
442 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/3155284
Editor:
Matthew Fluet
Issue’s Table of Contents

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGPLAN: ACM Special Interest Group on Programming Languages

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 January 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

NSF

Conference

PPoPP '17

Sponsor:

SIGPLAN

PPoPP '17: 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

February 4 - 8, 2017

Texas, Austin, USA

Acceptance Rates

PPoPP '17 Paper Acceptance Rate 29 of 132 submissions, 22%;

Overall Acceptance Rate 230 of 1,014 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

18
Total Citations
View Citations
1,124
Total Downloads

Downloads (Last 12 months)220
Downloads (Last 6 weeks)35

Reflects downloads up to 09 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Ge TZhang TLiu HTsafrir DMUSUVATHI MGupta RAbu-Ghazaleh N(2024)ngAP: Non-blocking Large-scale Automata Processing on GPUsProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3617232.3624848(268-285)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3617232.3624848
Liu HPai SJog A(2023)Asynchronous Automata Processing on GPUsProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/35794537:1(1-27)Online publication date: 2-Mar-2023
https://dl.acm.org/doi/10.1145/3579453
Xu CYu KXu XBao XWu SZhao B(2022)Offset-FA: A Uniform Method to Handle Both Unbounded and Bounded Repetitions in Regular Expression MatchingSensors10.3390/s2220778122:20(7781)Online publication date: 13-Oct-2022
https://doi.org/10.3390/s22207781
Stylianopoulos CAlmgren MLandsiedel OPapatriantafilou M(2022)Multiple pattern matching for network security applicationsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2019.10.011137:C(34-52)Online publication date: 21-Apr-2022
https://dl.acm.org/doi/10.1016/j.jpdc.2019.10.011
Qiu JSun XSabet AZhao ZSherwood TBerger EKozyrakis C(2021)Scalable FSM parallelization via path fusion and higher-order speculationProceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3445814.3446705(887-901)Online publication date: 19-Apr-2021
https://dl.acm.org/doi/10.1145/3445814.3446705
Liu HPai SJog ALarus JCeze LStrauss K(2020)Why GPUs are Slow at Executing NFAs and How to Make them FasterProceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3373376.3378471(251-265)Online publication date: 9-Mar-2020
https://dl.acm.org/doi/10.1145/3373376.3378471
Keiser JLemire D(2020)Validating UTF‐8 in less than one instruction per byteSoftware: Practice and Experience10.1002/spe.292051:5(950-964)Online publication date: 29-Oct-2020
https://doi.org/10.1002/spe.2920
Xia YJiang PAgrawal GAmaral JKulkarni M(2019)Enabling prefix sum parallelism pattern for recurrences with principled function reconstructionProceedings of the 28th International Conference on Compiler Construction10.1145/3302516.3307354(17-28)Online publication date: 16-Feb-2019
https://dl.acm.org/doi/10.1145/3302516.3307354
Jiang PChen LAgrawal GEvripidou SStenström PO'Boyle M(2018)Revealing parallel scans and reductions in recurrences through function reconstructionProceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques10.1145/3243176.3243204(1-13)Online publication date: 1-Nov-2018
https://dl.acm.org/doi/10.1145/3243176.3243204
Hou KWang HFeng WVetter JLee S(2018)Highly Efficient Compensation-Based Parallelism for Wavefront Loops on GPUs2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS.2018.00037(276-285)Online publication date: May-2018
https://doi.org/10.1109/IPDPS.2018.00037
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents