Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3132747.3132780acmconferencesArticle/Chapter ViewAbstractPublication PagessospConference Proceedingsconference-collections
research-article
Open access

ZygOS: Achieving Low Tail Latency for Microsecond-scale Networked Tasks

Published: 14 October 2017 Publication History

Abstract

This paper focuses on the efficient scheduling on multicore systems of very fine-grain networked tasks, which are the typical building block of online data-intensive applications. The explicit goal is to deliver high throughput (millions of remote procedure calls per second) for tail latency service-level objectives that are a small multiple of the task size.
We present ZYGOS, a system optimized for μs-scale, in-memory computing on multicore servers. It implements a work-conserving scheduler within a specialized operating system designed for high request rates and a large number of network connections. ZYGOS uses a combination of shared-memory data structures, multi-queue NICs, and inter-processor interrupts to rebalance work across cores.
For an aggressive service-level objective expressed at the 99th percentile, ZYGOS achieves 75% of the maximum possible load determined by a theoretical, zero-overhead model (centralized queueing with FCFS) for 10μs tasks, and 88% for 25μs tasks.
We evaluate ZYGOS with a networked version of Silo, a state-of-the-art in-memory transactional database, running TPC-C. For a service-level objective of 1000μs latency at the 99th percentile, ZYGOS can deliver a 1.63x speedup over Linux (because of its dataplane architecture) and a 1.26x speedup over IX, a state-of-the-art dataplane (because of its work-conserving scheduler).

Supplementary Material

MP4 File (zygos.mp4)

References

[1]
Atikoglu, B., Xu, Y., Frachtenberg, E., Jiang, S., and Paleczny, M. Workload analysis of a large-scale key-value store. In Proceedings of the 2012 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems (2012), pp. 53--64.
[2]
Barroso, L. A., Clidaras, J., and Hölzle, U. The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines, Second Edition. Synthesis Lectures on Computer Architecture. Morgan & Claypool Publishers, 2013.
[3]
Barroso, L. A., Marty, M., Patterson, D., and Ranganathan, P. Attack of the killer microseconds. Commun. ACM 60, 4 (2017), 48--54.
[4]
Belay, A., Bittau, A., Mashtizadeh, A. J., Terei, D., Mazières, D., and Kozyrakis, C. Dune: Safe User-level Access to Privileged CPU Features. In Proceedings of the 10th Symposium on Operating System Design and Implementation (OSDI) (2012), pp. 335--348.
[5]
Belay, A., Prekas, G., Primorac, M., Klimovic, A., Grossman, S., Kozyrakis, C., and Bugnion, E. The IX Operating System: Combining Low Latency, High Throughput, and Efficiency in a Protected Dataplane. ACM Trans. Comput. Syst. 34, 4 (2017), 11:1--11:39.
[6]
Bronson, N., Amsden, Z., Cabrera, G., Chakka, P., Dimov, P., Ding, H., Ferris, J., Giardullo, A., Kulkarni, S., Li, H. C., Marchukov, M., Petrov, D., Puzar, L., Song, Y. J., and Venkataramani, V. TAO: Facebook's Distributed Data Store for the Social Graph. In Proceedings of the 2013 USENIX Annual Technical Conference (ATC) (2013), pp. 49--60.
[7]
10 thousand connections problem. http//:www.kegel.com/c10k.html, 1999.
[8]
Linux cfs scheduler. https://www.kernel.org/doc/Documentation/scheduler/sched-design-CFS.txt.
[9]
Chase, D., and Lev, Y. Dynamic circular work-stealing deque. In Proceedings of the 17th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA) (2005), pp. 21--28.
[10]
Clements, A. T., Kaashoek, M. F., Zeldovich, N., Morris, R. T., and Kohler, E. The Scalable Commutativity Rule: Designing Scalable Software for Multicore Processors. ACM Trans. Comput. Syst. 32, 4 (2015), 10:1--10:47.
[11]
Contreras, G., and Martonosi, M. Characterizing and improving the performance of Intel Threading Building Blocks. In Proceedings of the 2008 IEEE International Symposium on Workload Characterization (IISWC) (2008), pp. 57--66.
[12]
Dean, J., and Barroso, L. A. The tail at scale. Commun. ACM 56, 2 (2013), 74--80.
[13]
Dinan, J., Larkins, D. B., Sadayappan, P., Krishnamoorthy, S., and Nieplocha, J. Scalable work stealing. In Proceedings of the 2009 ACM/IEEE Conference on Supercomputing (SC) (2009).
[14]
Data plane development kit. http//:www.dpdk.org/.
[15]
Dragojevic, A., Narayanan, D., Castro, M., and Hodson, O. FaRM: Fast Remote Memory. In Proceedings of the 11th Symposium on Networked Systems Design and Implementation (NSDI) (2014), pp. 401--414.
[16]
Duda, K. J., and Cheriton, D. R. Borrowed-virtual-time (BVT) scheduling: supporting latency-sensitive threads in a general-purpose schedular. In Proceedings of the 17th ACM Symposium on Operating Systems Principles (SOSP) (1999), pp. 261--276.
[17]
Dunkels, A. Design and Implementation of the lwIP TCP/IP Stack. Swedish Institute of Computer Science 2 (2001), 77.
[18]
Eisenbud, D. E., Yi, C., Contavalli, C., Smith, C., Kononov, R., Mann-Hielscher, E., Cilingiroglu, A., Cheyney, B., Shang, W., and Hosein, J. D. Maglev: A Fast and Reliable Software Network Load Balancer. In Proceedings of the 13th Symposium on Networked Systems Design and Implementation (NSDI) (2016), pp. 523--535.
[19]
Epollexclusive kernel patch. https://lwn.net/Articles/667087/, 2015.
[20]
Färber, F., Cha, S. K., Primsch, J., Bornhövd, C., Sigg, S., and Lehner, W. SAP HANA database: data management for modern business applications. SIGMOD Record 40, 4 (2011), 45--51.
[21]
Gaud, F., Geneves, S., Lachaize, R., Lepers, B., Mottet, F., Muller, G., and Quéma, V. Efficient Workstealing for Multi-core Event-Driven Systems. In Proceedings of the 30th IEEE International Conference on Distributed Computing Systems (ICDCS) (2010), pp. 516--525.
[22]
Gordon, A., Amit, N., Har'El, N., Ben-Yehuda, M., Landau, A., Schuster, A., and Tsafrir, D. ELI: bare-metal performance for I/O virtualization. In Proceedings of the 17th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-XVII) (2012), pp. 411--422.
[23]
gRPC. http//:www.grpc.io/.
[24]
Hanford, N., Ahuja, V., Balman, M., Farrens, M. K., Ghosal, D., Pouyoul, E., and Tierney, B. Characterizing the impact of end-system affinities on the end-to-end performance of highspeed flows. In Proceedings of the Third International Workshop on Network-Aware Data Management (2013), pp. 1:1--1:10.
[25]
Haque, M. E., Eom, Y. H., He, Y., Elnikety, S., Bianchini, R., and McKinley, K. S. Few-to-Many: Incremental Parallelism for Reducing Tail Latency in Interactive Services. In Proceedings of the 20th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-XX) (2015), pp. 161--175.
[26]
Jeong, E., Woo, S., Jamshed, M. A., Jeong, H., Ihm, S., Han, D., and Park, K. mTCP: a Highly Scalable User-level TCP Stack for Multicore Systems. In Proceedings of the 11th Symposium on Networked Systems Design and Implementation (NSDI) (2014), pp. 489--502.
[27]
Kapoor, R., Porter, G., Tewari, M., Voelker, G. M., and Vahdat, A. Chronos: predictable low latency for data center applications. In Proceedings of the 2012 ACM Symposium on Cloud Computing (SOCC) (2012), p. 9.
[28]
Kernel connection multiplexer. https://lwn.net/Articles/657999/, 2015.
[29]
Kernel connection multiplexer patch. https://lwn.net/Articles/657970/, 2015.
[30]
Kivity, A., Kamay, Y., Laor, D., Lublin, U., and Liguori, A. kvm: the Linux virtual machine monitor. In Proceedings of the Linux symposium (2007), vol. 1, pp. 225--230.
[31]
Klimovic, A., Litz, H., and Kozyrakis, C. ReFlex: Remote Flash ≈ Local Flash. In Proceedings of the 22nd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-XXII) (2017), pp. 345--359.
[32]
Laadan, O., Nieh, J., and Viennot, N. Structured linux kernel projects for teaching operating systems concepts. In Proceedings of the 42nd ACM Technical Symposium on Computer Science Education (SIGCSE) (2011), pp. 287--292.
[33]
Leung, K.-C., Li, V. O. K., and Yang, D. An Overview of Packet Reordering in Transmission Control Protocol (TCP): Problems, Solutions, and Challenges. IEEE Trans. Parallel Distrib. Syst. 18, 4 (2007), 522--535.
[34]
Leverich, J., and Kozyrakis, C. Reconciling high server utilization and sub-millisecond quality-of-service. In Proceedings of the 2014 EuroSys Conference (2014), pp. 4:1--4:14.
[35]
Li, J., Agrawal, K., Elnikety, S., He, Y., Lee, I.-T. A., Lu, C., and McKinley, K. S. Work stealing for interactive services to meet target latency. In Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP) (2016), pp. 14:1--14:13.
[36]
Li, J., Sharma, N. K., Ports, D. R. K., and Gribble, S. D. Tales of the Tail: Hardware, OS, and Application-level Sources of Tail Latency. In Proceedings of the 2014 ACM Symposium on Cloud Computing (SOCC) (2014), pp. 9:1--9:14.
[37]
libevent. http://libevent.org/.
[38]
libuv. http://libuv.org/.
[39]
Lim, H., Han, D., Andersen, D. G., and Kaminsky, M. MICA: A Holistic Approach to Fast In-Memory Key-Value Storage. In Proceedings of the 11th Symposium on Networked Systems Design and Implementation (NSDI) (2014), pp. 429--444.
[40]
Lu, Y., Xie, Q., Kliot, G., Geller, A., Larus, J. R., and Greenberg, A. G. Join-Idle-Queue: A novel load balancing algorithm for dynamically scalable web services. Perform. Eval. 68, 11 (2011), 1056--1071.
[41]
Apache lucene. https://lucene.apache.org/.
[42]
Marinos, I., Watson, R. N. M., and Handley, M. Network stack specialization for performance. In Proceedings of the ACM SIGCOMM 2014 Conference (2014), pp. 175--186.
[43]
Memcached. https://memcached.org/.
[44]
In-memory mongodb. https://docs.mongodb.com/manual/core/inmemory/.
[45]
Nanavati, M., Wires, J., and Warfield, A. Decibel: Isolation and Sharing in Disaggregated Rack-Scale Storage. In Proceedings of the 14th Symposium on Networked Systems Design and Implementation (NSDI) (2017), pp. 17--33.
[46]
Nginx thread pool usage. https://www.nginx.com/blog/thread-pools-boost-performance-9x/.
[47]
Nishtala, R., Fugal, H., Grimm, S., Kwiatkowski, M., Lee, H., Li, H. C., McElroy, R., Paleczny, M., Peek, D., Saab, P., Stafford, D., Tung, T., and Venkataramani, V. Scaling Memcache at Facebook. In Proceedings of the 10th Symposium on Networked Systems Design and Implementation (NSDI) (2013), pp. 385--398.
[48]
Ousterhout, K., Wendell, P., Zaharia, M., and Stoica, I. Sparrow: distributed, low latency scheduling. In Proceedings of the 24th ACM Symposium on Operating Systems Principles (SOSP) (2013), pp. 69--84.
[49]
Pesterev, A., Strauss, J., Zeldovich, N., and Morris, R. T. Improving network connection locality on multicore systems. In Proceedings of the 2012 EuroSys Conference (2012), pp. 337--350.
[50]
Peter, S., Li, J., Zhang, I., Ports, D. R. K., Woos, D., Krishnamurthy, A., Anderson, T. E., and Roscoe, T. Arrakis: The Operating System Is the Control Plane. ACM Trans. Comput. Syst. 33, 4 (2016), 11:1--11:30.
[51]
Pf_ring. http//:www.ntop.org/products/packet-capture/pf_ring/.
[52]
Prekas, G., Belay, A., Primorac, M., Klimovic, A., Grossman, S., Kogias, M., Gütermann, B., Kozyrakis, C., and Bugnion, E. IX Open-source version 1.0 -- Deployment and Evaluation Guide. Tech. rep., EPFL Technical Report 218568, 2016.
[53]
Prekas, G., Primorac, M., Belay, A., Kozyrakis, C., and Bugnion, E. Energy proportionality and workload consolidation for latency-critical applications. In Proceedings of the 2015 ACM Symposium on Cloud Computing (SOCC) (2015), pp. 342--355.
[54]
Redis. https://redis.io/.
[55]
Rizzo, L. netmap: A Novel Framework for Fast Packet I/O. In Proceedings of the 2012 USENIX Annual Technical Conference (ATC) (2012), pp. 101--112.
[56]
Microsoft corp. receive side scaling. http://msdn.microsoft.com/library/windows/hardware/ff556942.aspx.
[57]
Schroeder, B., Wierman, A., and Harchol-Balter, M. Open Versus Closed: A Cautionary Tale. In Proceedings of the 3rd Symposium on Networked Systems Design and Implementation (NSDI) (2006).
[58]
ScillaDB Project. Seastar -- high-performance service-application framework. https://github.com/scylladb/seastar/.
[59]
Silo: Multicore in-memory storage engine. https://github.com/stephentu/silo.
[60]
Soares, L., and Stumm, M. FlexSC: Flexible System Call Scheduling with Exception-Less System Calls. In Proceedings of the 9th Symposium on Operating System Design and Implementation (OSDI) (2010), pp. 33--46.
[61]
Stonebraker, M., Madden, S., Abadi, D. J., Harizopoulos, S., Hachem, N., and Helland, P. The End of an Architectural Era (It's Time for a Complete Rewrite). In Proceedings of the 33rd International Conference on Very Large DataBases (VLDB) (2007), pp. 1150--1160.
[62]
Thekkath, C. A., Nguyen, T. D., Moy, E., and Lazowska, E. D. Implementing Network Protocols at User Level. In Proceedings of the ACM SIGCOMM 1993 Conference (1993), pp. 64--73.
[63]
TPC-C Benchmark. http//:www.tpc.org/tpcc/, 2010.
[64]
Tu, S., Zheng, W., Kohler, E., Liskov, B., and Madden, S. Speedy transactions in multicore in-memory databases. In Proceedings of the 24th ACM Symposium on Operating Systems Principles (SOSP) (2013), pp. 18--32.
[65]
Uhlig, R., Neiger, G., Rodgers, D., Santoni, A. L., Martins, F. C. M., Anderson, A. V., Bennett, S. M., Kägi, A., Leung, F. H., and Smith, L. Intel Virtualization Technology. IEEE Computer 38, 5 (2005), 48--56.
[66]
Voltdb. https://www.voltdb.com/.
[67]
Wei, X., Shi, J., Chen, Y., Chen, R., and Chen, H. Fast in-memory transaction processing using RDMA and HTM. In Proceedings of the 25th ACM Symposium on Operating Systems Principles (SOSP) (2015), pp. 87--104.
[68]
What'sapp 2m connections. https://https://blog.whatsapp.com/196/1-million-is-so-2011.
[69]
Wierman, A., and Zwart, B. Is Tail-Optimal Scheduling Possible? Operations Research 60, 5 (2012), 1249--1257.
[70]
Yang, X., Blackburn, S. M., and McKinley, K. S. Elfen Scheduling: Fine-Grain Principled Borrowing from Latency-Critical Workloads Using Simultaneous Multithreading. In Proceedings of the 2016 USENIX Annual Technical Conference (ATC) (2016), pp. 309--322.
[71]
Yasukata, K., Honda, M., Santry, D., and Eggert, L. StackMap: Low-Latency Networking with the OS Stack and Dedicated NICs. In Proceedings of the 2016 USENIX Annual Technical Conference (ATC) (2016), pp. 43--56.
[72]
Zeldovich, N., Yip, A., Dabek, F., Morris, R., Mazières, D., and Kaashoek, M. F. Multiprocessor Support for Event-Driven Programs. In USENIX Annual Technical Conference (2003), pp. 239--252.
[73]
Zhang, H., Dong, M., and Chen, H. Efficient and Available In-memory KV-Store with Hybrid Erasure Coding and Replication. In Proceedings of the 14th USENIX Conference on File and Storage Technologie (FAST) (2016), pp. 167--180.
[74]
Zygos kernel. https://github.com/ix-project/zygos, 2017.

Cited By

View all
  • (2025)Customizable function scheduling for serverless computingSCIENTIA SINICA Informationis10.1360/SSI-2024-033955:3(481)Online publication date: 27-Feb-2025
  • (2025)DORADD: Deterministic Parallel Execution in the Era of Microsecond-Scale ComputingProceedings of the 30th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3710848.3710872(282-296)Online publication date: 28-Feb-2025
  • (2025)Rapid Data Ingestion through DB-OS Co-designProceedings of the ACM on Management of Data10.1145/37097183:1(1-28)Online publication date: 11-Feb-2025
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SOSP '17: Proceedings of the 26th Symposium on Operating Systems Principles
October 2017
677 pages
ISBN:9781450350853
DOI:10.1145/3132747
© 2017 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 October 2017

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Microsecond-scale computing
  2. Tail latency

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

SOSP '17
Sponsor:

Acceptance Rates

Overall Acceptance Rate 174 of 961 submissions, 18%

Upcoming Conference

SOSP '25
ACM SIGOPS 31st Symposium on Operating Systems Principles
October 13 - 16, 2025
Seoul , Republic of Korea

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)785
  • Downloads (Last 6 weeks)129
Reflects downloads up to 08 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Customizable function scheduling for serverless computingSCIENTIA SINICA Informationis10.1360/SSI-2024-033955:3(481)Online publication date: 27-Feb-2025
  • (2025)DORADD: Deterministic Parallel Execution in the Era of Microsecond-Scale ComputingProceedings of the 30th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3710848.3710872(282-296)Online publication date: 28-Feb-2025
  • (2025)Rapid Data Ingestion through DB-OS Co-designProceedings of the ACM on Management of Data10.1145/37097183:1(1-28)Online publication date: 11-Feb-2025
  • (2025)COREC: Concurrent non-blocking single-queue receive driver for low latency networkingComputer Networks10.1016/j.comnet.2024.110982258(110982)Online publication date: Feb-2025
  • (2024)OSMOSISProceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference10.5555/3691992.3692007(247-263)Online publication date: 10-Jul-2024
  • (2024)Power-aware deep learning model serving with µ-serveProceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference10.5555/3691992.3691997(75-93)Online publication date: 10-Jul-2024
  • (2024)ALPSProceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference10.5555/3691992.3691994(19-36)Online publication date: 10-Jul-2024
  • (2024)Burstable cloud block storage with data processing unitsProceedings of the 18th USENIX Conference on Operating Systems Design and Implementation10.5555/3691938.3691980(783-799)Online publication date: 10-Jul-2024
  • (2024)DSigProceedings of the 18th USENIX Conference on Operating Systems Design and Implementation10.5555/3691938.3691974(667-685)Online publication date: 10-Jul-2024
  • (2024)Data-flow availabilityProceedings of the 18th USENIX Conference on Operating Systems Design and Implementation10.5555/3691938.3691962(445-463)Online publication date: 10-Jul-2024
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media