One of the main ways to increase system performance is minimising how far down the memory hierarchy one has to go to manipulate data. It's not just system level programmers that need to be aware of these issues, as most systems have a time/cost requirement, be it how fast your web application responds, or how many racks you need in your data center.

In this era of multiple CPUs per system this can be further complicated for programmers due to memory contention between each CPU. Also, virtualization introduces further complications. Consider the following diagram which shows the memory hierarchy currently in a 4 socket by 4 core system, which Ulrich Drepper mentions is going to be a common system in his excellent paper on computer memory.

current memory hierarchy of a 4 socket 4 core system
[Update Sep 2010: Note the organisation of cache levels in multi-core CPUs can vary quite a bit]
[Update Oct 2010: hwloc is a handy tool for automatically generating diagrams like these]

Up until lately we've just had incremental improvements to the performance (not size), of RAM and mechanical hard disks, and CPU performance has diverged from them a lot. So changes to the memory hierarchy would both speed systems up a lot, and simplify software running on the CPU. It's these exciting changes that are happening now and in the next few years that I'm focusing on here.

Solid State Disks

Consider for example how SSDs affect processing of a large file on a multi-core system. Because random seeks are of no extra cost on SSDs compared to mechanical disks, it's sensible for multiple cores to process separate portions of a file directly. With mechanical disks each core would just be fighting over the mechanical disk head, and slow down a lot compared to just a single core processing the file. In other words, data partitioning to take advantage of multiple cores is much more complicated for mechanical disks than for SSDs, requiring more complex logic and arrays of disks to achieve parallelization. Note for certain operations like sorting, one has to take RAM size into account, so the cores should process chunks of the file in parallel where each chunk is ((ram size/num cpus) - abit). For other operations like searching for example, RAM size is not a factor, and one can just split the file into chunks of (file size/num cpus). [Update Dec 2012: Given the widening disparity between traditional disks and SSDs, they're separating out to distinct layers in the memory hierarchy. To take advantage of this, hybrid drives are becoming available, as is software to transparently combine separate drives, like SRT or Linux solutions like bcache.]

2 Transistor DRAM

2T DRAM currently being developed by Intel, has the potential to enhance caches in CPUs at least. You can see in the diagram above that the level 2 cache can be both used to speed access to the relatively slow RAM and speed up communication between cores in a single processor. When this memory wall is lowered it again gives the opportunity to use different algorithms, especially on multi core systems. Tian Tian of Intel has written a good article on how shared caches enhance a multi-core system and how programmers can take further advantage of them. There also is another good ACM article on optimizing application performance in the presence of caches, and this excellent presentation on lock free algorithms taking considerations of the current memory hierarchy. [Update Dec 2008: I noticed an IEEE reference to a Sandia National Laboratories simulation, which showed that for many applications, the memory wall with current architectures causes performance to decline with greater than 8 processors, so it looks like technology like 2T DRAM will be required in the near future.]

MRAM and Memristors

These technologies have the potential to be the biggest game changers. They're essentially very fast non volatile memory, and so will affect both current RAM and flash technologies.

MRAM has been in development for a while, but while being about as twice as fast as current RAM technologies, it's much more expensive. However researchers in Germany have recently figured out how to make it 10 times faster again!

Memristors have recently been created by HP labs and again they have the potential to be a fast, dense, cheap, non volatile memory. The memristor was first theorized in 1971 by Leon Chua, being a fourth fundamental circuit element, having properties that cannot be achieved by any combination of the other three elements (resistor, inductor, capacitor). [Update Sep 2010: Memristors will be available by 2014 apparently.] [Update Nov 2011: You can apparently make home-made memristors :)] [Update Jun 2014: Informative memristor info and roadmap from HP] Interesting times...

© Aug 19 2008