Dynamic random access memory (DRAM) is a major component of nearly all computing systems and is widely known to be the largest bottleneck as we continue to move forward towards the data working set sizes exponentially increasing across application domains like machine learning, genomics, etc. Naturally the solution would be to scale main memory across its capacity, energy, cost, and performance to all scale in an efficient manner across technology generations. This is to no surprise hard as hell, so the bottleneck continues to squeeze and worsen with time. For example, the memory capacities of large machine learning models increased by more than 10,000 times in the past five years. Unofrtunately, DRAM technology scaling is becoming increasingly challenging: it is increasingly difficult to enlarge DRAM chip capcity at low cost while also maintaining DRAM performance, energy efficiency, and reliability. THus fulfilling the increasing memory needs of modern workloads is becoming increasingly costly and difficult. The first key concern is the difficulty of scaling DRAM capacity (i.e., density or cost per bit), bandwidth, and latency at the same time. While the processing core count doubles every two years, the DRAM capacity doubles onlyevery three years, as shown by and the latter is slowing down. This trend causes memory capcity per core to drop by approximately 30% every two years. The trend is even worse for memory bandiwdth per core - in the approximately two decades between 1999 and 2017,, DRAM chip storage chip storage capacity (for the most commonly-used DDRx chip of the time) has improved by approximately 128x while DRAM bandwidth has improved only by approximately 20x. In the same period of about two decades, DRAM latency (as measured by the row cycling time has remained almost constant (i.e., reduced by only 30%)), making it a signfiicant performance bottleneck for many modern workloads, including in-memory databases, graph processing, data analytics, data center workloads, neural networks, large language models, and consumer workloads. As low-latency computing is becoming ever more important due to the ever-increasing need to process large amounts of data at real time, and predictable performance continues to be a critical concern in the design of modern computing systems, it is increasingly important to design low-latency main memory chips.
A major resaon for the main memory bottleneck is the high energy and latency cost associated with data movement. In modern computing systems, to perform any operation on data that resides in main memory, the processor must retreive the data from main memory. This requires the memory controller to issue commands to the DRAM module across a relatively slow and power hungry off-chip bus (known as the memory channel). The DRAM module sends the requested data across the memory channel, after which the data is placed into caches and registers. The CPU can perform computation on the data once teh data is in its registers. Data movement from the DRAM to the CPU incurs long latency and consumes a signficant amount of energy. These costs are often excacerbated by the fact that much of the data brought into the cahces is not reused by the CPU or accelerators providing little benefit in return for the high latency and energy cost.
This cost of data movement is the fundamental issue with processor-centric nature of contemporary computing systems. The CPU is considered to be the master in the system and computation is performed only in the processor (and accelerators). In contrast, data storage and communication units, including the main memory are treated as unintelligent workers that are incapable of computation. As a result of this processor-centric design paradigm, data moves a lot in the system (back and forth between the computation units and communciation / storage units) so that computation can be done on it. With the increasingly data -centric nature of contemporary and emerging applications, the processor-centric design paradigm leads to great inefficiency in performance, energy, and cost. For example, most of the real estate within a single compute node is already dedicated to handling data movement and storage (e.g., large caches, memory controllers, interconnects, communication interfaces and associated circuitry, main memory) and recent works show that
These large overheads of data movememt in modern systems along with technology advances that enable better integration of memory and logic have recently prompted the re-examination of an old idea that we will broadly call processing-in-memory (PIM). The key idea is to place computation mechanisms in or near where the data is stored (i.e., inside the memory chips, in the logic layer of 3D-stacked memory, in the mmeory controllers, inside large caches, inside storage units or inside sensing units), so that data movement between where the computation is done and where the data is stored is reduced or eliminated, compared to contemporary processor-centric systems. PIM enables the ability to perform operations and execute software tasks either using
PIM has been around for more than half a century, however past efforts were never widely adopted for the following reasons.
As a result of advances in modern memory architectures such as the integration of logic and memory in a 3D stacked manner, various recent works explore a range of PIM architectures for multiple different use cases.
We can exploit analog operational properties of memory circuitry to perform simple yet powerful common operations that the chip is inherently efficient at or could be made efficient at performing. This approach ahs the potential to provide large performance and energy gains with minimal changes to memory chips and circuitry. Some solutions that fall under this approach take advantage of existing DRAM design to cleverly and efficiently perform bulk operations (i.e., operations on an entire row of DRAM cells), such as bulk copy and data initialization, bitwise Boolean operations, and other operations., arithmetic operations, and lookup table based operations. Other solutions take advantage of analog operational principles of SRAM, NAND flash, and emerging non-volatile memory technologies (e.g., phase-change memory, PCM, spin-transfer torqque magnetic RAM, STT-MRAM, metal-oxide resistive RAM, ReRAM, etc.) to perform similar bulk operations or other specialized computations like convolutions and matrix multiplications.
The second apporach is a potentially more general-purpose and flexible manner by adding computation capability to conventional memory controllers, memory chips, memory modules or the logic layer(s) of the relatively new 3D-stacked memory technologies. This approach is is especially catalyzed by recent advancements in 3D-stacked memory technologies that include a logic procesisng layer underneath memory layers and recent prototpyes that map the computing capability inside DRAM chips and DRAM modules. In order to stack multiple layers of mmeory, 3D-stacked chips use vertical through-silicon vias (TSVs) to connect the layers to each other, and to the I/O drivers of the chip. The TSVs provide much greater intenral bandwidth within the 3D stack layers than is available externally on the memory channel. Several such 3D-stacked mmeory architectures, such as the Hybrid Memory Cube and High-Bandwidth Memory (HBM), include a logic layer, where designers can add some processing logic (e.g., accelerators, simple cores, reconfigurable logic) to take advantage of this high internal bandwidth. Emerging die-stacking and packaging technologies, like hybrid bonding and monolithic 3D integration, can amplify the benfits of this approach by greatly improving internal bandwidth across layers and potentially adding logic layers between memory layers.
09/24/2024