research-article

BMW Tree: Large-scale, High-throughput and Modular PIFO Implementation using Balanced Multi-Way Sorting Tree

Authors:

H. Jonathan ChaoAuthors Info & Claims

ACM SIGCOMM '23: Proceedings of the ACM SIGCOMM 2023 Conference

Pages 208 - 219

https://doi.org/10.1145/3603269.3604862

Published: 01 September 2023 Publication History

Abstract

Push-In-First-Out (PIFO) queue has been extensively studied as a programmable scheduler. To achieve accurate, large-scale, and high-throughput PIFO implementation, we propose the Balanced Multi-way (BMW) Sorting Tree for real-time packet sorting. The tree is highly modularized, insertion-balanced and pipeline-friendly with autonomous nodes.

Based on it, we design two simple and efficient hardware designs. The first one is a register-based (R-BMW) scheme. With a pipeline, it features an impressively high and stable throughput without any frequency reduction theoretically even under more levels. We then propose Ranking Processing Units to drive the BMW-Tree (RPU-BMW) to improve the scalability, where nodes are stored in SRAMs and dynamically loaded into/off from RPUs. As the capacity of BMW-Tree grows exponentially, only a few RPUs are needed for a large scale.

The evaluation shows that when deployed on the Xilinx Alveo U200 card, R-BMW improves the throughput by 4.8x compared to the original PIFO implementation, while exhibiting a similar capacity. RPU-BMW is synthesized in GlobalFoundries 28nm process, costing a modest 0.522% (1.043mm²) chip area and 0.57MB off-chip memory to support 87k flows at 200Mpps. To our best knowledge, RPU-BMW is the first accurate PIFO implementation supporting over 80k flows at as fast as 200Mpps.

References

[1]

Alan Demers, Srinivasan Keshav, and Scott Shenker. 1989. Analysis and simulation of a fair queueing algorithm. ACM SIGCOMM Computer Communication Review (1989).

[2]

Madhavapeddi Shreedhar and George Varghese. 1996. Efficient fair queuing using deficit round-robin. IEEE/ACM Transactions on networking (ToN) (1996).

[3]

Linus E Schrage and Louis W Miller. 1966. The queue M/G/1 with the shortest remaining processing time discipline. Operations Research (1966).

[4]

Wikipedia. 2022. Token bucket. https://en.wikipedia.org/wiki/Token_bucket. (2022).

[5]

Mohammad Hedayati, Kai Shen, Michael L. Scott, and Mike Marty. 2019. Multi-Queue Fair Queuing. In USENIX Annual Technical Conference (ATC).

[6]

Zhuolong Yu, Jingfeng Wu, Vladimir Braverman, Ion Stoica, and Xin Jin. 2021. Twenty Years After: Hierarchical Core-Stateless Fair Queueing. In 18th USENIX Symposium on Networked Systems Design and Implementation (NSDI).

[7]

Anirudh Sivaraman, Suvinay Subramanian, Mohammad Alizadeh, Sharad Chole, Shang-Tse Chuang, Anurag Agrawal, Hari Balakrishnan, Tom Edsall, Sachin Katti, and Nick McKeown. 2016. Programmable packet scheduling at line rate. In Proceedings of the ACM Special Interest Group on Data Communication (SIGCOMM).

Digital Library

[8]

Vishal Shrivastav. 2019. Fast, scalable, and programmable packet scheduler in hardware. In Proceedings of the ACM SIGCOMM Conference (SIGCOMM).

Digital Library

[9]

Ahmed Saeed, Nandita Dukkipati, Vytautas Valancius, Vinh The Lam, Carlo Contavalli, and Amin Vahdat. 2017. Carousel: Scalable traffic shaping at end hosts. In Proceedings of the ACM SIGCOMM Conference (SIGCOMM).

Digital Library

[10]

Sung-Whan Moon, Jennifer Rexford, and Kang G Shin. 2000. Scalable hardware priority queue architectures for high-speed packet switches. IEEE Transactions on computers (2000).

[11]

Muhuan Huang, Kevin Lim, and Jason Cong. 2014. A scalable, high-performance customized priority queue. In 24th International Conference on Field Programmable Logic and Applications (FPL).

[12]

Chetan Kumar Ng, Sudhanshu Vyas, Jonathan A Shidal, RonK Cytron, ChristopherD Gill, Joseph Zambreno, and Phillip H Jones. 2012. Improving system predictability and performance via hardware accelerated data structures. Procedia Computer Science (2012).

[13]

NG Chetan Kumar, Sudhanshu Vyas, Ron K Cytron, Christopher D Gill, Joseph Zambreno, and Phillip H Jones. 2014. Hardware-software architecture for priority queue management in real-time and embedded systems. International Journal of Embedded Systems (2014).

[14]

Aggelos Ioannou and Manolis GH Katevenis. 2007. Pipelined heap (priority queue) management for advanced scheduling in high-speed networks. IEEE/ACM Transactions on Networking (ToN) (2007).

Digital Library

[15]

Ranjita Bhagwan and Bill Lin. 2000. Fast and scalable priority queue architecture for high-speed network switches. In IEEE International Conference on Computer Communications (INFOCOM).

[16]

Ravikesh Chandra and Oliver Sinnen. 2010. Improving application performance with hardware data structures. In IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[17]

Gedare Bloom, Gabriel Parmer, Bhagirath Narahari, and Rahul Simha. 2012. Shared hardware data structures for hard real-time systems. In Proceedings of the 10th ACM international conference on Embedded software.

Digital Library

[18]

Pierre Lavoie, David Haccoun, and Yvon Savaria. 1994. A systolic architecture for fast stack sequential decoders. IEEE Transactions on Communications (1994).

[19]

Randy Brown. 1988. Calendar queues: a fast 0 (1) priority queue implementation for the simulation event set problem. Commun. ACM (1988).

[20]

Imad Benacer, François-Raymond Boyer, and Yvon Savaria. 2018. A Fast, Single-Instruction-Multiple-Data, Scalable Priority Queue. IEEE Transactions on Very Large Scale Integration (VLSI) Systems (2018).

[21]

Albert Gran Alcoz, Alexander Dietmüller, and Laurent Vanbever. 2020. SP-PIFO: approximating push-in first-out behaviors using strict-priority queues. In 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI).

[22]

Zhuolong Yu, Chuheng Hu, Jingfeng Wu, Xiao Sun, Vladimir Braverman, Mosharaf Chowdhury, Zhenhua Liu, and Xin Jin. 2021. Programmable packet scheduling with a single queue. In Proceedings of the ACM Special Interest Group on Data Communication (SIGCOMM).

Digital Library

[23]

Naveen Kr Sharma, Ming Liu, Kishore Atreya, and Arvind Krishnamurthy. 2018. Approximating fair queueing on reconfigurable switches. In 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI).

[24]

Jingling Liu, Jiawei Huang, Ning Jiang, Weihe Li, and Jianxin Wang. 2020. Achieving High Utilization for Approximate Fair Queueing in Data Center. In IEEE 40th International Conference on Distributed Computing Systems (ICDCS).

[25]

Naveen Kr Sharma, Chenxingyu Zhao, Ming Liu, Pravein G Kannan, Changhoon Kim, Arvind Krishnamurthy, and Anirudh Sivaraman. 2020. Programmable calendar queues for high-speed packet scheduling. In 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI).

[26]

Peixuan Gao, Anthony Dalleggio, Yang Xu, and H. Jonathan Chao. 2022. Gearbox: A Hierarchical Packet Scheduler for Approximate Weighted Fair Queuing. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI).

[27]

Ahmed Saeed, Yimeng Zhao, Nandita Dukkipati, Ellen Zegura, Mostafa Ammar, Khaled Harras, and Amin Vahdat. 2019. Eiffel: Efficient and Flexible Software Packet Scheduling. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI).

[28]

Jun Xu and Richard J Lipton. 2002. On fundamental tradeoffs between delay bounds and computational complexity in packet scheduling algorithms. ACM SIGCOMM Computer Communication Review (2002).

[29]

Praveen Kumar, Nandita Dukkipati, Nathan Lewis, Yi Cui, Yaogong Wang, Chonggang Li, Valas Valancius, Jake Adriaens, Steve Gribble, Nate Foster, et al. 2019. PicNIC: predictable virtualized NIC. In Proceedings of the ACM Special Interest Group on Data Communication (SIGCOMM). 351--366.

Digital Library

[30]

Brent Stephens, Aditya Akella, and Michael M Swift. 2018. Your programmable NIC should be a programmable switch. In Proceedings of the 17th ACM Workshop on Hot Topics in Networks (HotNets). 36--42.

Digital Library

[31]

Tushar Swamy, Alexander Rucker, Muhammad Shahbaz, Ishan Gaur, and Kunle Olukotun. 2022. Taurus: a data plane architecture for per-packet ML. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 1099--1114.

Digital Library

[32]

Ting Qu, Deke Guo, Jie Wu, Xiaolei Zhou, Xin Lu, and Zhong Liu. 2019. Efficient event scheduling of network update. IEEE Transactions on Network and Service Management (TNSM) 17, 1 (2019), 416--429.

Digital Library

[33]

Wei Quan, Wenwen Fu, Jinli Yan, and Zhigang Sun. 2020. OpenTSN: an open-source project for time-sensitive networking system development. CCF Transactions on Networking 3, 1 (2020), 51--65.

[34]

Christoph Gärtner, Amr Rizk, Boris Koldehofe, Rhaban Hark, René Guillaume, Ralf Kundel, and Ralf Steinmetz. 2021. POSTER: Leveraging PIFO Queues for Scheduling in Time-Sensitive Networks. In 2021 IEEE International Symposium on Local and Metropolitan Area Networks (LANMAN). IEEE, 1--2.

[35]

Pawan Goyal, Harrick M Vin, and Haichen Cheng. 1997. Start-time fair queueing: A scheduling algorithm for integrated services packet switching networks. IEEE/ACM Transactions on networking 5, 5 (1997), 690--704.

[36]

Sivasankar Radhakrishnan, Yilong Geng, Vimalkumar Jeyakumar, Abdul Kabbani, George Porter, and Amin Vahdat. 2014. SENIC: Scalable NIC for end-host rate limiting. In 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI).

[37]

S Jamaloddin Golestani. 1990. A stop-and-go queueing framework for congestion management. In Proceedings of the ACM symposium on Communications architectures & protocols.

Digital Library

[38]

George F Riley and Thomas R Henderson. 2010. The ns-3 network simulator. In Modeling and tools for network simulation. Springer, 15--34.

[39]

Mohammad Alizadeh, Shuang Yang, Milad Sharif, Sachin Katti, Nick McKeown, Balaji Prabhakar, and Scott Shenker. 2013. pfabric: Minimal near-optimal data-center transport. ACM SIGCOMM Computer Communication Review (2013).

Digital Library

[40]

Radhika Mittal, Rachit Agarwal, Sylvia Ratnasamy, and Scott Shenker. 2016. Universal packet scheduling. In 13th USENIX Symposium on Networked Systems Design and Implementation (NSDI). 501--521.

Cited By

Zhang ZChen SYao RSun RMei HWang HChen ZFang GFan YShi WLiu SXu YSekar VYu MSeneviratne AVeitch D(2024)vPIFO: Virtualized Packet Scheduler for Programmable Hierarchical Scheduling in High-Speed NetworksProceedings of the ACM SIGCOMM 2024 Conference10.1145/3651890.3672270(983-999)Online publication date: 4-Aug-2024
https://dl.acm.org/doi/10.1145/3651890.3672270
Wang ZWan XLi LSun YXie PWei XNing QZhang JChen KSekar VYu MSeneviratne AVeitch D(2024)Fast, Scalable, and Accurate Rate Limiter for RDMA NICsProceedings of the ACM SIGCOMM 2024 Conference10.1145/3651890.3672215(568-580)Online publication date: 4-Aug-2024
https://dl.acm.org/doi/10.1145/3651890.3672215

Index Terms

BMW Tree: Large-scale, High-throughput and Modular PIFO Implementation using Balanced Multi-Way Sorting Tree
1. Hardware
  1. Communication hardware, interfaces and storage
    1. Networking hardware
2. Networks
  1. Network services
    1. Programmable networks

Recommendations

Scalable high-throughput architecture for large balanced tree structures on FPGA (abstract only)
FPGA '13: Proceedings of the ACM/SIGDA international symposium on Field programmable gate arrays

Architectures for tree structures on FPGAs as well as ASICs have been proposed over the years. The exponential growth in the memory size with respect to the tree levels restricts the scalability of these architectures due to limited on-chip memory. For ...
vPIFO: Virtualized Packet Scheduler for Programmable Hierarchical Scheduling in High-Speed Networks
ACM SIGCOMM '24: Proceedings of the ACM SIGCOMM 2024 Conference

Programmable packet scheduling enables the integration of scheduling algorithms into switches without the need for hardware redesign. The Push-In First-Out (PIFO) queue facilitates a programmable packet scheduler, supporting a single scheduling algorithm ...
High throughput and large capacity pipelined dynamic search tree on FPGA
FPGA '10: Proceedings of the 18th annual ACM/SIGDA international symposium on Field programmable gate arrays

We propose a pipelined Dynamic Search Tree (pDST) on FPGA which offers high throughput for lookup, insert and delete operations as well as the capability to perform in-place incremental updates. Based on the pipelined 2-3 tree data structure, our pDST ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ACM SIGCOMM '23: Proceedings of the ACM SIGCOMM 2023 Conference

September 2023

1217 pages

ISBN:9798400702365

DOI:10.1145/3603269

Chairs:
Henning Schulzrinne,
Vishal Misra,
Program Chairs:
Eddie Kohler,
David Maltz

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGCOMM: ACM Special Interest Group on Data Communication

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 September 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Key-Area Research and Development Program of Guangdong Province
National Natural Science Foundation of China
Natural Science Foundation of Shanghai
the Major Key Project of PCL
Open Research Projects of Zhejiang Lab

Conference

ACM SIGCOMM '23

Sponsor:

SIGCOMM

ACM SIGCOMM '23: ACM SIGCOMM 2023 Conference

September 10, 2023

NY, New York, USA

Acceptance Rates

Overall Acceptance Rate 462 of 3,389 submissions, 14%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
852
Total Downloads

Downloads (Last 12 months)569
Downloads (Last 6 weeks)47

Reflects downloads up to 09 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zhang ZChen SYao RSun RMei HWang HChen ZFang GFan YShi WLiu SXu YSekar VYu MSeneviratne AVeitch D(2024)vPIFO: Virtualized Packet Scheduler for Programmable Hierarchical Scheduling in High-Speed NetworksProceedings of the ACM SIGCOMM 2024 Conference10.1145/3651890.3672270(983-999)Online publication date: 4-Aug-2024
https://dl.acm.org/doi/10.1145/3651890.3672270
Wang ZWan XLi LSun YXie PWei XNing QZhang JChen KSekar VYu MSeneviratne AVeitch D(2024)Fast, Scalable, and Accurate Rate Limiter for RDMA NICsProceedings of the ACM SIGCOMM 2024 Conference10.1145/3651890.3672215(568-580)Online publication date: 4-Aug-2024
https://dl.acm.org/doi/10.1145/3651890.3672215

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents