Article

FaRM: fast remote memory

Authors:

Aleksandar Dragojević,

Dushyanth Narayanan,

Miguel CastroAuthors Info & Claims

NSDI'14: Proceedings of the 11th USENIX Conference on Networked Systems Design and Implementation

Pages 401 - 414

Published: 02 April 2014 Publication History

Abstract

We describe the design and implementation of FaRM, a new main memory distributed computing platform that exploits RDMA to improve both latency and throughput by an order of magnitude relative to state of the art main memory systems that use TCP/IP. FaRM exposes the memory of machines in the cluster as a shared address space. Applications can use transactions to allocate, read, write, and free objects in the address space with location transparency. We expect this simple programming model to be sufficient for most application code. FaRM provides two mechanisms to improve performance where required: lock-free reads over RDMA, and support for collocating objects and function shipping to enable the use of efficient single machine transactions. FaRM uses RDMA both to directly access data in the shared address space and for fast messaging and is carefully tuned for the best RDMA performance. We used FaRM to build a key-value store and a graph store similar to Facebook's. They both perform well, for example, a 20-machine cluster can perform 167 million key-value lookups per second with a latency of 31µs.

References

[1]

Memcached. http://memcached.org.

[2]

Viking Technology. http://www.vikingtechnology.com/.

[3]

AGUILERA, M. K., MERCHANT, A., SHAH, M., VEITCH, A., AND KARAMANOLIS, C. Sinfonia: A New Paradigm for Building Scalable Distributed Systems. In Proceedings of the 21st ACM SIGOPS Symposium on Operating Systems Principles (2007), SOSP '07.

Digital Library

[4]

ALLEN, E., CHASE, D., HALLETT, J., LUCHANGCO, V., MAESSEN, J.-W., RYU, S., STEELE JR, G. L., TOBINHOCHSTADT, S., DIAS, J., EASTLUND, C., ET AL. The Fortress language specification. Sun Microsystems 139 (2005), 140.

[5]

AMZA, C., COX, A. L., DWARKADAS, S., KELEHER, P. J., LU, H., RAJAMONY, R., YU, W., AND ZWAENEPOEL, W. Treadmarks: Shared memory computing on networks of workstations. IEEE Computer 29, 2 (1996), 18-28.

Digital Library

[6]

ANJALI, G., BARBARA, L., AND RODRIGO, R. Efficient routing for peer-to-peer overlays. In Proceedings of the 1st Symposium on Networked Systems Design and Implementation (2004), NSDI'04.

Digital Library

[7]

APPAVOO, J., WATERLAND, A., DA SILVA, D., UHLIG, V., ROSENBURG, B., VAN HENSBERGEN, E., STOESS, J., WISNIEWSKI, R., AND STEINBERG, U. Providing a cloud network infrastructure on a supercomputer. In Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing (2010), HPDC '10.

Digital Library

[8]

ARMSTRONG, T. G., PONNEKANTI, V., BORTHAKUR, D., AND CALLAGHAN, M. LinkBench: a database benchmark based on the Facebook social graph. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data (2013), SIGMOD '13.

Digital Library

[9]

ATIKOGLU, B., XU, Y., FRACHTENBERG, E., JIANG, S., AND PALECZNY, M. Workload Analysis of a Large-scale Key-value Store. In Proceedings of the 12th ACM SIGMETRICS/PERFORMANCE Joint International Conference on Measurement and Modeling of Computer Systems (2012), SIGMETRICS '12.

Digital Library

[10]

BERGER, E. D., MCKINLEY, K. S., BLUMOFE, R. D., AND WILSON, P. R. Hoard: A scalable memory allocator for multithreaded applications. In Proceedings of the 9th International Conference on Architectural Support for Programming Languages and Operating Systems (2000), ASPLOS-IX.

Digital Library

[11]

BRONSON, N., AMSDEN, Z., CABRERA, G., CHAKKA, P., DIMOV, P., DING, H., FERRIS, J., GIARDULLO, A., KULKARNI, S., LI, H., MARCHUKOV, M., PETROV, D., PUZAR, L., SONG, Y. J., AND VENKATARAMANI, V. TAO: Facebook's distributed data store for the social graph. In Proceedings of the 2013 USENIX Annual Technical Conference (2013), USENIX ATC'13.

Digital Library

[12]

CARTER, J. B., BENNETT, J. K., AND ZWAENEPOEL, W. Implementation and performance of Munin. In ACM SIGOPS Operating Systems Review (1991), vol. 25, ACM.

Digital Library

[13]

CHAMBERLAIN, B. L., CALLAHAN, D., AND ZIMA, H. P. Parallel programmability and the Chapel language. International Journal of High Performance Computing Applications 21, 3 (2007), 291-312.

Digital Library

[14]

CHARLES, P., GROTHOFF, C., SARASWAT, V., DONAWA, C., KIELSTRA, A., EBCIOGLU, K., VON PRAUN, C., AND SARKAR, V. X10: An object-oriented approach to non-uniform cluster computing. In Proceedings of the 20th annual ACM SIGPLAN Conference on Object-oriented Programming, Systems, Languages, and Applications (2005), OOPSLA '05.

Digital Library

[15]

COOPER, B. F., SILBERSTEIN, A., TAM, E., RAMAKRISHNAN, R., AND SEARS, R. Benchmarking cloud serving systems with YCSB. In Proceedings of the 1st ACM Symposium on Cloud computing (2010), SoCC '10.

Digital Library

[16]

FAN, B., ANDERSEN, D. G., AND KAMINSKY, M. MemC3: Compact and concurrent MemCache with dumber caching and smarter hashing. In Proceedings of the 10th USENIX Conference on Networked Systems Design and Implementation (2013), NSDI'13.

Digital Library

[17]

FRASER, K. Practical Lock Freedom. PhD thesis, Cambridge University Computer Laboratory, 2003. Also available as Technical Report UCAM-CL-TR-579.

[18]

GRAY, J., AND REUTER, A. Transaction Processing: Concepts and Techniques. 1992.

Digital Library

[19]

GREENWALD, M., AND CHERITON, D. The synergy between non-blocking synchronization and operating system structure. In Proceedings of the 2nd USENIX Symposium on Operating Systems Design and Implementation (1996), OSDI '96.

Digital Library

[20]

HENDLER, D., INCZE, I., SHAVIT, N., AND TZAFRIR, M. Flat combining and the synchronization-parallelism tradeoff. In Proceedings of the 22nd ACM Symposium on Parallelism in Algorithms and Architectures (2010), SPAA '10.

Digital Library

[21]

HERLIHY, M., SHAVIT, N., AND TZAFRIR, M. Hopscotch hashing. In Proceedings of the 22nd International Symposium on Distributed Computing (2008), DISC '08.

Digital Library

[22]

HOEFLER, T., DINAN, J., THAKUR, R., BARRETT, B., BALAJI, P., GROPP, W., AND UNDERWOOD, K. Remote memory access programming in MPI-3. ACM Trans. Parallel Comput. (Mar. 2013).

[23]

HUANG, J., OUYANG, X., JOSE, J., W.-U. R. M., WANG, H., LUO, M., SUBRAMONI, H., MURTHY, C., AND PANDA, D. K. High-performance design of HBase with RDMA over Infiniband. In Parallel and Distributed Processing Symposium (2012).

Digital Library

[24]

HUNT, P., KONAR, M., JUNQUEIRA, F. P., AND REED, B. Zookeeper: wait-free coordination for internet-scale systems. In Proceedings of the 2010 USENIX Annual Technical Conference (2010), USENIX'10.

Digital Library

[25]

IEEE. 802.1Qau - Congestion Notification, 2010.

[26]

IEEE. 802.1Qbb - Priority-based Flow Control, 2011.

[27]

INFINIBAND TRADE ASSOCIATION. Supplement to InfiniBand Architecture Specification Volume 1 Release 1.2.2 Annex A16: RDMA over Converged Ethernet (RoCE), 2010.

[28]

ISLAM, N. S., RAHMAN, M., JOSE, J., RAJACHANDRASEKAR, R., WANG, H., SUBRAMONI, H., MURTHY, C., AND PANDA, D. K. High performance RDMA-based design of HDFS over InfiniBand. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (2012), IEEE Computer Society Press, p. 35.

Digital Library

[29]

JOSE, J., SUBRAMONI, H., KANDALLA, K., WASI-UR RAHMAN, M., WANG, H., NARRAVULA, S., AND PANDA, D. K. Scalable Memcached design for Infiniband clusters using hybrid transports. In Proceedings of the 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (2012).

Digital Library

[30]

JOSE, J., SUBRAMONI, H., LUO, M., ZHANG, M., HUANG, J., WASI-UR RAHMAN, M., ISLAM, N. S., OUYANG, X., WANG, H., SUR, S., AND PANDA, D. K. Memcached design on high performance RDMA capable interconnects. In Proceedings of the 2011 International Conference on Parallel Processing (2011).

Digital Library

[31]

KARGER, D., LEHMAN, E., LEIGHTON, T., PANIGRAHY, R., LEVINE, M., AND LEWIN, D. Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the World Wide Web. In Proceedings of the 29th Annual ACM Symposium on Theory of Computing (1997), STOC '97.

Digital Library

[32]

KUNG, H. T., AND ROBINSON, J. T. On optimistic methods for concurrency control. ACM Transactions on Database Systems (1981).

Digital Library

[33]

LI, B., ZHANG, P., HUO, Z., AND MENG, D. Early experiences with write-write design of NFS over RDMA. In Networking, Architecture, and Storage, 2009. NAS 2009. IEEE International Conference on (2009), IEEE, pp. 303-308.

Digital Library

[34]

LIU, J., JIANG, W., WYCOFF, P., PANDA, D., ASHTON, D., BUNTINAS, D., GROPP, W., AND TOONEN, B. Design and implementation of MPICH2 over Infiniband with RDMA support. In Parallel and Distributed Processing Symposium (2004).

[35]

LIU, J., WU, J., AND PANDA, D. K. High performance RDMA-based MPI implementation over InfiniBand. International Journal of Parallel Programming 32, 3 (June 2004).

Digital Library

[36]

MELLANOX TECHNOLOGIES. Connect-IB: Architecture for Scalable High Performance Computing, 2013.

[37]

MITCHELL, C., YIFENG, G., AND JINYANG, L. Using one-sided RDMA reads to build a fast, CPU-efficient key-value store. In Proceedings of the 2013 USENIX Annual Technical Conference (2013), USENIX ATC'13.

Digital Library

[38]

NISHTALA, R., FUGAL, H., GRIMM, S., KWIATKOWSKI, M., LEE, H., LI, H. C., MCELROY, R., PALECZNY, M., PEEK, D., SAAB, P., STAFFORD, D., TUNG, T., AND VENKATARAMANI, V. Scaling Memcache at Facebook. In Proceedings of the 10th USENIX Conference on Networked Systems Design and Implementation (2013), NSDI'13.

Digital Library

[39]

ONGARO, D., RUMBLE, S. M., STUTSMAN, R., OUSTERHOUT, J., AND ROSENBLUM, M. Fast crash recovery in RAM-Cloud. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles (2011), SOSP '11.

Digital Library

[40]

PAGH, R., AND RODLER, F. F. Cuckoo hashing. Journal of Algorithms (2004).

Digital Library

[41]

SETHI, R. Useless actions make a difference: Strict serializability of database updates. Journal of the ACM (1982).

Digital Library

[42]

SHIPMAN, G., WOODALL, T., GRAHAM, R., MACCABE, A., AND BRIDGES, P. Infiniband scalability in Open MPI. In Parallel and Distributed Processing Symposium (2006).

Digital Library

[43]

STETS, R., DWARKADAS, S., HARDAVELLAS, N., HUNT, G., KONTOTHANASSIS, L., PARTHASARATHY, S., AND SCOTT, M. Cashmere-2L: Software coherent shared memory on a clustered remote-write network. In ACM SIGOPS Operating Systems Review (1997), vol. 31, ACM, pp. 170-183.

[44]

STOICA, I., MORRIS, R., KARGER, D., KAASHOEK, M. F., AND BALAKRISHNAN, H. Chord: A scalable peer-to-peer lookup service for internet applications. In Proceedings of the 2001 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications (2001), SIGCOMM'01.

Digital Library

[45]

SUBRAMONI, H., POTLURI, S., KANDALLA, K., BARTH, B., VIENNE, J., KEASLER, J., TOMKO, K., SCHULZ, K., MOODY, A., AND PANDA, D. K. Design of a scalable InfiniBand topology service to enable network-topology-aware placement of processes. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (2012).

Digital Library

[46]

TREIBER, R. K. Systems programming: Coping with parallelism. Tech. Rep. RJ 5118 (53162), IBM, Thomas J. Watson Research Center, 1986.

[47]

UPC CONSORTIUM. UPC Language Specifications, v1.2., Technical Report. Lawrence Berkeley National Laboratory LBNL- 59208, 2005.

[48]

WU, J., WYCKOFF, P., AND PANDA, D. PVFS over Infini-Band: Design and performance evaluation. In Parallel Processing, 2003. Proceedings. 2003 International Conference on (2003), IEEE, pp. 125-132.

[49]

WWW.OPENSHMEM.ORG. OpenSHMEM Application Programming Interface, 2012.

[50]

YELICK, K., SEMENZATO, L., PIKE, G., MIYAMOTO, C., LIBLIT, B., KRISHNAMURTHY, A., HILFINGER, P., GRAHAM, S., GAY, D., COLELLA, P., ET AL. Titanium: A high-performance Java dialect. Concurrency Practice and Experience 10, 11-13 (1998), 825-836.

Cited By

Yang ZWang QLiao XLu YHuang KShu JMa XWon Y(2024)TeRMProceedings of the 22nd USENIX Conference on File and Storage Technologies10.5555/3650697.3650698(1-16)Online publication date: 27-Feb-2024
https://dl.acm.org/doi/10.5555/3650697.3650698
Wang ZLiu YZhang SHu JLiu X(2024)A Survey of RDMA Distributed StorageProceedings of the 2024 5th International Conference on Computing, Networks and Internet of Things10.1145/3670105.3670199(534-539)Online publication date: 24-May-2024
https://dl.acm.org/doi/10.1145/3670105.3670199
Garcia Lopez PSlominski AMetzler BBerhendt MShillaker S(2024)Serverless End Game: Disaggregation enabling TransparencyProceedings of the 2nd Workshop on SErverless Systems, Applications and MEthodologies10.1145/3642977.3652094(9-14)Online publication date: 22-Apr-2024
https://dl.acm.org/doi/10.1145/3642977.3652094
Show More Cited By

FaRM: fast remote memory

Recommendations

WOM-Code Solutions for Low Latency and High Endurance in Phase Change Memory
This paper describes a write-once-memory-code phase change memory (WOM-code PCM) architecture for next-generation non-volatile memory applications. Specifically, we address the long latency of the write operation in PCM—attributed to PCM SET—...
Mellow writes: extending lifetime in resistive memories through selective slow write backs
ISCA'16

Emerging resistive memory technologies, such as PCRAM and ReRAM, have been proposed as promising replacements for DRAM-based main memory, due to their better scalability, low standby power, and non-volatility. However, limited write endurance is a major ...
A Novel Memory Block Management Scheme for PCM Using WOM-Code
HPCC-CSS-ICESS '15: Proceedings of the 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conf on Embedded Software and Systems

Phase Change Memory (PCM) is a promising DRAM replacement in embedded systems due to its attractive characteristics including low static power consumption and high density. However, long write latency is one of the major drawbacks in current PCM ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

NSDI'14: Proceedings of the 11th USENIX Conference on Networked Systems Design and Implementation

April 2014

546 pages

ISBN:9781931971096

Program Chairs:
Ratul Mahajan
Microsoft Research
,
Ion Stoica
University of California at Berkeley

Sponsors

USENIX Assoc: USENIX Assoc

In-Cooperation

Publisher

USENIX Association

United States

Publication History

Published: 02 April 2014

Check for updates

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

246
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 10 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Yang ZWang QLiao XLu YHuang KShu JMa XWon Y(2024)TeRMProceedings of the 22nd USENIX Conference on File and Storage Technologies10.5555/3650697.3650698(1-16)Online publication date: 27-Feb-2024
https://dl.acm.org/doi/10.5555/3650697.3650698
Wang ZLiu YZhang SHu JLiu X(2024)A Survey of RDMA Distributed StorageProceedings of the 2024 5th International Conference on Computing, Networks and Internet of Things10.1145/3670105.3670199(534-539)Online publication date: 24-May-2024
https://dl.acm.org/doi/10.1145/3670105.3670199
Garcia Lopez PSlominski AMetzler BBerhendt MShillaker S(2024)Serverless End Game: Disaggregation enabling TransparencyProceedings of the 2nd Workshop on SErverless Systems, Applications and MEthodologies10.1145/3642977.3652094(9-14)Online publication date: 22-Apr-2024
https://dl.acm.org/doi/10.1145/3642977.3652094
Jensen CHoward HKatsarakis AMortier R(2024)Unanimous 2PC: Fault-tolerant Distributed Transactions Can be Fast and SimpleProceedings of the 11th Workshop on Principles and Practice of Consistency for Distributed Data10.1145/3642976.3653035(44-57)Online publication date: 22-Apr-2024
https://dl.acm.org/doi/10.1145/3642976.3653035
Lu FWei XHuang ZChen RWu MChen H(2024)Serialization/Deserialization-free State Transfer in Serverless WorkflowsProceedings of the Nineteenth European Conference on Computer Systems10.1145/3627703.3629568(132-147)Online publication date: 22-Apr-2024
https://dl.acm.org/doi/10.1145/3627703.3629568
Katsarakis AGavrielatos VNtarmos NMencagli GDazzi PLowenthal DBadia R(2024)DLHT: A Non-blocking Resizable Hashtable with Fast Deletes and Memory-awarenessProceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3625549.3658682(186-199)Online publication date: 3-Jun-2024
https://dl.acm.org/doi/10.1145/3625549.3658682
Schuh HKrishnamurthy ACuller DLevy HRizzo LKhan SStephens BTsafrir DMUSUVATHI MGupta RAbu-Ghazaleh N(2024)CC-NIC: a Cache-Coherent Interface to the NICProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3617232.3624868(52-68)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3617232.3624868
Ren FZhang MChen KXia HChen ZWu YTsafrir DMUSUVATHI MGupta RAbu-Ghazaleh N(2024)Scaling Up Memory Disaggregated Applications with SMARTProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3617232.3624857(351-367)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3617232.3624857
Yan BLu YWang QXie MShu JNaor DGoel A(2023)PatronusProceedings of the 21st USENIX Conference on File and Storage Technologies10.5555/3585938.3585958(315-330)Online publication date: 21-Feb-2023
https://dl.acm.org/doi/10.5555/3585938.3585958
Zhang QLi JZhao HXu QLu WXiao JHan FYang CDu X(2023)Efficient Distributed Transaction Processing in Heterogeneous NetworksProceedings of the VLDB Endowment10.14778/3583140.358315316:6(1372-1385)Online publication date: 20-Apr-2023
https://dl.acm.org/doi/10.14778/3583140.3583153
Show More Cited By

View Options

View options

Media

Figures

Other

Tables

View Table of Contents