Big Memory Solution : The End of IO


Design Goals

Bringing a promising new cutting-edge technology from the lab to market 

We envision a product conceptually similar to Intel’s Xeon Phi or NVidia’s GPU products but accelerating CPU work rather than off-board parallel processing.

Enhancing an existing product by infusing one or more new technologies

Our solution will dovetail seamlessly into existing designs and practices.

Proposing a next-generation architecture or assessing threats to the dominant design

We believe we can make a product that perfectly fits this problem and optimizes the trade space (towards the Pareto frontier).

Investigating a troubling or dysfunctional product and suggesting significant improvements

This is a great improvement over earlier attempts in this space. We can make a Maximum Viable Product for this market.


Extension / explanation of Big Memory Cache topics


The fastest way to get data to a computer’s CPU is via cache. L1, L2, and L3 cache development improvements are pretty much the purview of the big chip companies like AMD and Intel. But quite recently there has appeared a path for open innovators, either individuals or University teams, to develop L4 cache solutions that will greatly increase performance and efficiency of existing compute models as well as facilitating new ones.


Who are the incumbents in this space?


ISVs like Dell and HPE (insofar as they integrate AMD processors). Composable computing companies like GigaIO, who actually has an interesting approach to this. Also, Intel Optane integrators like MemVerge are using some of the same rhetoric, although they are actually developing the opposite tech; using boards on the memory bus as non-volatile memory rather than using NVMe as “RAM”


There is noone in the OpenSuperComputing space, probably because this tech either competes with an existing product of theirs or the perceived market is too small.


The case for improving L1 and L2 cache has been around for a long time and the OEMs, ODMs, and ISVs have done wonders in that space. Recently they have started making the case for improving L3 and L2/L3 interaction. We make the case for adding L4.


IBM has experimented with enlarging / speeding up L3 and L4. We propose to expand on that concept

https://www.youtube.com/watch?v=z6u_oNIXFuU#t=10m7s

Part 1 of this video is actually a good tutorial on cache hierarchies.



Why us (the local research consortium )?


The center of excellence on this tech is clearly at MIT / BU. See these presentations...


https://math.mit.edu/crib/CRIBJun.pdf

https://ieee-hpec.org/prelimagenda2021.html#1-P



Why does this opportunity exist now and not earlier?


There have been explorations in the past. Intel inserted L4 into their Broadwell line.

https://en.wikipedia.org/wiki/CPU_cache#MULTILEVEL

but for many technical and business reasons it did not catch on.


We now have at least one commodity big L3 cache vendor that can take advantage of this. The AMD Milan-X is not only an enabling tech for this but it is the first of such products using huge vCache dies.

https://www.youtube.com/watch?v=ZEDKNtt-erk#t=2m10s


2 of the production codes we use a lot at MIT, OpenFOAM and WRF, will improve 20% if we only improve the L3 cache but it should double that performance if we install L4 cache.

https://www.nextplatform.com/2021/11/08/microsoft-azure-brings-the-cache-with-amd-milan-x/


Also, PCI-e 5.0 compliant products will start appearing in servers in Q1 2022. Now, for the first time, that data bus is rated as higher bandwidth than the memory bus. So a memory tech on that bus will have faster lookups than the equivalent memory bus (DDR4 / DDR5) tech.


The latency argument (that the DDR memory bus has lower latency than the PCI-e bus) is addressed through speculative loading and execution. The metric of value in all computing is how much work can you get done per clock cycle. This is the justification for modern caches. If you can find the data in a short number of cycles that is preferable than having the processor go out to RAM to find it.


Here are our relative cache latencies

L1 3-5 cycles (fastest but smallest)

L2 11-19 (faster but smaller)

L3 40-50 cycles (fast but small)

L4 (fast and large) is TBD but we know it will be a lot less than the alternative, which is

RAM 150-300 cycles (slow and small)

Fixed disk IO, historically the slowest and highest latency


Going to RAM is suboptimal, but hitting the hard drive, either by paging or swap or virtual memory or some other mechanism, is devastating to performance. That is why we aspirationally call this project “The End of IO” which, if achieved, would usher in a new age of computing and put the industry back on a Moore’s Law trajectory.