SRAM as Main Memory

Home, Parent, Memory Latency, TPC-E Benchmarks, Single Processor, DRAM, SRAM

SRAM as Main Memory (2018-03)

Everyone knows that SRAM has incredible performance, but also consider it to be too expensive for main memory. The nature of modern computing is diverse in characteristics, and no single element will have large and broad impact. For this reason, we should look at specific situations in which SRAM can have large and high value impact. The obvious candidate is database transaction processing, which is largely an exercise in pointer-chasing and is expected to benefit greatly from low latency memory. It also expected to have hot spots such that some intermediate amount of SRAM is sufficient to achieve a significant reduction in average memory latency, with the usual huge DRAM configuration handling the less critical needs. From

But there is more. All processors now have integrated memory controllers (IMC). This allows a single processor system to have low latency to memory. The other aspect is that a multi-processor (MP) system inherently has non-uniform memory access (NUMA). Average memory latency of the MP-NUMA configuration can still be better than in previous generation processor-systems with external memory controllers, though by a smaller margin than at the single processor level. The implication is that the first step in MP scaling from single to dual processors incurs a significant penalty on average memory latency.

When processors first added the IMC, the greater aggregate compute and memory capacity of MP systems were needed regardless of the advantages in performance-efficiency at the single processor level. Even in the MP configuration, the overall advantages of an integrated MC were strongly positive. Now that the individual processor is as powerful as it is, with more than twenty cores, whether the default baseline system should be MP is questionable, even though this is continues to be standard practice as it were an involuntary muscle action.

The new factor to consider is that a single processor system with some combination of SRAM or other low latency technology and conventional DRAM as main memory can match the throughput performance of a multi-processor system. In this scenario, the value of a 35-40% increase in performance via SRAM is not 35-40% but that of a second processor as this is the expected performance of a 2-way system. The value of doubling performance is comparable to that of a 4-way MP system.

For 2017-18, the baseline processor is the Xeon SP with 28 cores because the most cost-effective strategy when software per core licensing applies is to first scale up the number of cores in a single processor system. This processor carries a price tag of $8,700 or $10,009 depending on the specific model. The SRAM solution also has better thread level performance, which has it own substantial value, and avoids severe issues that could occur on the more complex NUMA system architecture. In this scenario, the cost of SRAM is not all prohibitive. Rather it is likely be very attractive.

Organizations spend tens to hundreds of millions of dollars or even more to develop and implement their line-of-business systems. It is an application that is often difficult to scale out. When a problem is encountered on the production server, almost any price would be paid for a hardware solution that can be dropped in to make the issue go away. Experience in the past with big-iron servers of 16-plus processors has been that complex systems are more likely to introduce new problems than solve existing ones. SRAM as memory leads to the simpler system with the ability to scale performance.

SRAM as Main Memory Objectives

The new baseline system is a single processor system. Where scale-up is concerned, then the starting point is at high core count model. There current generation Xeon SP has 28-core at the top model, with 2 memory controllers and 6 DDR4 SDRAM channels. In the new strategy, the system would have both SRAM and DRAM as main memory.

SRAM_scaling

The implementation might be similar to Intel's Knights Landing, aka Xeon Phi x200 series, in which the in-package MCDRAM can act as : 1) cache, 2) a separate node in a flat memory model, or 3) a hybrid of the two. The difference here, aside from the Skylake versus Atom core, is that we are interested in very low latency instead of just high bandwidth.

knights_landing_mesh_rs

The general characteristic for the frequency of access versus incremental memory is expected to be something like the following. The initial amount is accessed in almost every operation, and may include system tables and index root level pages. There is a middle range in which data is accessed frequently, possibly representing index intermediate level pages and other hot data. Any further incremental memory is only for data accessed infrequently by DRAM standards (<1M per sec).

mem_freq1

Depending on the actual of memory access, the right solution might involve some combination of SRAM, RL-DRAM/eDRAM, conventional DRAM, and one of the non-volatile options like 3D XPoint. RL-DRAM has an SRAM-like interface (no multiplexed row and column addresses) and presumably Intel will have some form of 3D XPoint with a DRAM interface. There should not be more than two types of memory interface on the processor. If conventional DRAM is not necessary, then the processor might have just the SRAM-type interface

The objective is to reduce average memory latency by 25-50%. It is not necessary to achieve this exact range. A reduction of average memory access latency by 25% corresponds to a 33% performance in point-chasing code and 33% corresponds to a 50%. This is the range that would give the new system about the same performance of a 2-way system with just conventional DRAM memory. In turn, it would be easy then to justify a value on the combined single processor plus SRAM complex at double that of the base processor.

Software licensing is factored in this. Per-core licensing for 28 cores could be on the order of $100K. Recall that the advent of multi-core processors led software vendors to change from a per processor to per core licensing model. This is why it would be important if the SRAM as cache mode is effective, as it would be possible to employ this as a drop-in solution for existing software. Long-term, the flat mode in which the software understands that each of the memory nodes have different characteristics and know what to put where could be a better solution.

Should it be possible to achieve a 50% or slightly better reduction of average memory latency, the performance and value would be about equal to a 4-way system. Depending on whether only hardware equivalent or software licensing is also included, this could be $40K or $300K. The value could also be one million dollars if the simpler systems resolves a problem in the production system that the conventional 4-way system could not. This is the nature of corporate IT world.

SRAM Density

The proposal here is only that SRAM cost is not an issue given its value in the database transaction processing workload. Still, the curious would like to speculate with some parameters. First is SRAM density.

Intel has 3 types of SRAM at 14nm: High-Perf, High-Density and Low-voltage. The last two have SRAM (bit) cell densities of 0.0499µm² and 0.0588µm² respectively. See Intel 14nm process and Wikichip 14nm lithography. From this, 1MB with ECC is 1024 × 1024 × (8+1) cells at 0.529 sq. mm for the LV version.

A visual inspection of the Intel Skylake die shown on Wikichip suggests that 1MB L3, data and ECC, is 1.2 sq. mm. It is assumed that the structure above and below the ring interconnect are the L3 tags.

Skylake_2core_group2

The figure below, from Wikichip Intel’s 10nm ..., original source IEDM 2017 + ISSCC 2018, shows that there are elements in addition to the bit cell array.

Intel 10nm

The visual estimate of 1.2 sq. mm is then reasonable based on Intel's published SRAM bit cell density. A 512MB would then be 614 sq mm. The question of whether we want few jumbo die or many small die is not discussed here.

The IBM z14 has an SC chip, 14nm process, 696 sq. mm with 672MB eDRAM as L4 cache. Hot Chips 29.

In A Survey Of Architectural Approaches for Managing Embedded DRAM and Non-volatile On-chip Caches by S. Mittal, J. Vetter and D. Li, the SRAM cell size is 120-200F², and EDRAM is 60-100, where F is the feature/process dimension. DRAM is around 6-8 sq. F. So, the SRAM bit cell is 20-30 times larger than DRAM? And eDRAM could have twice the density of SRAM?

The F^2 comparison between DRAM and SRAM is not entirely comparable. Processors (logic) and its accompanying SRAM is currently manufactured on a 14nm process (Intel) while DRAM is perhaps at 28nm. FYI, the Intel 14nm HD SRAM at 0.0499µm corresponds to 255F^2.

SRAM Cost

Motley Fool estimates the cost structure of Intel’s 14nm process at $9,100 per 300mm wafer. The Silicon Edge die-per-wafer estimator says 86 die of dimensions 17.6 x 35mm fit on a 300mm wafer. Assuming the SRAM is made with spare banks, the expectation is that yield is high even for the large 614 sq. mm die.

As a rough approximation, the end-user cost of SRAM is assumed to be in the $1000-1600/GB range. The current end-user cost of ECC DRAM is about $16/GB (Crucial, 2018 Jan). DRAM Exchange has the cost of the DDR4 8Gb chip at around $9. In 2015, DRAM Prices Down, it was $4.30/GB and even lower in 2012?

SRAM On-package?

In acknowledgment that SRAM is expensive, the strategy needs to achieve maximum effect. Putting SRAM in the processor package might reduce transmission delays? In Knights Landing, the MCDRAM was placed in the package to enable maximum bandwidth.

SRAM3

Intel's Embedded Multi-Die Interconnect Bridge (EMIB) has 55µm bumps versus 130µm bumps for signals going off package, allowing for higher signal density. It is unclear if EMIB also lowers transmission delays, as MCDRAM was not targeting latency. Intel expects to decrease EMIB bump size in following generations, possibly to as small as 10µm. Perhaps one of the future generation EMIB will help lower transmission delays.

Depending on actual SRAM density, we might be able to put 5-10GB of SRAM in close proximity to a large die processor. If SRAM can be stacked, then perhaps 2-4X more is possible?

Summary

Many of the old rules in system performance have long become obsolete. What is important now is memory latency. A reasonably achievable objective of 25 and 50% lower average latency would have the effect of allowing a single processor system to equal the throughput performance of 2 and 4-way multi-processor systems respectively. The hardware only equivalent value of this is on the order of $10-30K. From this point of view, SRAM is not expensive, and may even be a bargain. When software per-core based licensing is factored, then the value could be astronomical.

There are alternatives to recover the throughput performance capability of modern processor cores. But these all involve greater complexity and do nothing for thread level performance. The SRAM as main memory solution directly improves thread level performance which has its own value, and is the solution with the simplest hardware architecture.

Addendum

The Intel 1Gbit EDRAM is cited as 3ns tRC? SRAM may be better, but is of little relevance because the system level latency in the path from core to memory controllers to a large SRAM or EDRAM array will be much higher than the cycle time of the device itself. It might be that the distinction between the two is that SRAM does not require a refresh? What is the impact? and does this justify a 2X difference?

Electroiq.com the-most-expensive-sram-in-the-world