>Based on available information, CPUs use least-recently used as a cache replacement algorithm. This is excessively deterministic and leads to numerous side channel attacks. I propose a new cache replacement policy that would fix and prevent some categories of AES software timing attacks noticed since 2005.
>All cache lines in the L1 cache will have a bitfield dedicated to each thread. When a cache line is loaded or read, a bit is set for all hardware threads except for the one requesting the data. When replacing a cache line, the first cache line with an unset bit is replaced. This is done by incrementing through each cache line and inspecting the bitfield, if the bit is set, it will be unset, and then the next cache line is inspected.
>This will prioritize replacing cache lines used by the current hardware thread over cache lines used by other threads.
>In the worst case, a thread will end up loading data to the L2 cache before it can load data into the L1 cache.
>I call it the Canadian replacement algorithm because they are known for excessive politeness. This can be considered an inversion of the second-chance algorithm.
>I won't touch upon the issue of context switches and software threads.
In the mid-1990's I thought of a way that ordinary DRAM could be combined with a set of cached columns, internal to the DRAM chip to implement the caches in the DRAMs themselves. To simplify a bit, ordinary DRAMs of the mid-1990's era (say, a 4M x 1) might have had 2048 rows by 2048 columns. A row address is present, RAS falls, and a row of data is read from the storage cells. The data is put in the column storage, and one bit is read (for read cycles) or written (for write cycles).
My idea was that instead of having only a single column of storage, the chip would have 4 or 8 or even more of storage. Access to main memory in a computer tends not to be truly 'random', but is localized: Program instructions, stack(s), data area(s), etc. A given column of storage would be relinquished by means of some LRU (least recently used) algorithm. I think most programs, when running, would statistically limit themselves to 8 data areas over periods of time of a few thousands of accesses.
The main reason this should be superior is that the data path between the DRAM array and the cache columns would be internal to the DRAM chip, say, 2048 bits wide (in the example above; probably much wider in modern 4Gbit+ DRAMs. )
I don't know whether they've done anything like this.
Jim Bell