Better cache replacement policy for CPU caches
Based on available information, CPUs use least-recently used as a cache replacement algorithm. This is excessively deterministic and leads to numerous side channel attacks. I propose a new cache replacement policy that would fix and prevent some categories of AES software timing attacks noticed since 2005. All cache lines in the L1 cache will have a bitfield dedicated to each thread. When a cache line is loaded or read, a bit is set for all hardware threads except for the one requesting the data. When replacing a cache line, the first cache line with an unset bit is replaced. This is done by incrementing through each cache line and inspecting the bitfield, if the bit is set, it will be unset, and then the next cache line is inspected. This will prioritize replacing cache lines used by the current hardware thread over cache lines used by other threads. In the worst case, a thread will end up loading data to the L2 cache before it can load data into the L1 cache. I call it the Canadian replacement algorithm because they are known for excessive politeness. This can be considered an inversion of the second-chance algorithm. I won't touch upon the issue of context switches and software threads. Sent with [ProtonMail](https://protonmail.com) Secure Email.
On Saturday, June 15, 2019, 7:27:17 AM PDT, Ryan Carboni <33389@protonmail.com> wrote:
Based on available information, CPUs use least-recently used as a cache replacement algorithm. This is excessively deterministic and leads to numerous side channel attacks. I propose a new cache replacement policy that would fix and prevent some categories of AES software timing attacks noticed since 2005.
All cache lines in the L1 cache will have a bitfield dedicated to each thread. When a cache line is loaded or read, a bit is set for all hardware threads except for the one requesting the data. When replacing a cache line, the first cache line with an unset bit is replaced. This is done by incrementing through each cache line and inspecting the bitfield, if the bit is set, it will be unset, and then the next cache line is inspected.
This will prioritize replacing cache lines used by the current hardware thread over cache lines used by other threads. In the worst case, a thread will end up loading data to the L2 cache before it can load data into the L1 cache. I call it the Canadian replacement algorithm because they are known for excessive politeness. This can be considered an inversion of the second-chance algorithm.
I won't touch upon the issue of context switches and software threads.
In the mid-1990's I thought of a way that ordinary DRAM could be combined with a set of cached columns, internal to the DRAM chip to implement the caches in the DRAMs themselves. To simplify a bit, ordinary DRAMs of the mid-1990's era (say, a 4M x 1) might have had 2048 rows by 2048 columns. A row address is present, RAS falls, and a row of data is read from the storage cells. The data is put in the column storage, and one bit is read (for read cycles) or written (for write cycles). My idea was that instead of having only a single column of storage, the chip would have 4 or 8 or even more of storage. Access to main memory in a computer tends not to be truly 'random', but is localized: Program instructions, stack(s), data area(s), etc. A given column of storage would be relinquished by means of some LRU (least recently used) algorithm. I think most programs, when running, would statistically limit themselves to 8 data areas over periods of time of a few thousands of accesses. The main reason this should be superior is that the data path between the DRAM array and the cache columns would be internal to the DRAM chip, say, 2048 bits wide (in the example above; probably much wider in modern 4Gbit+ DRAMs. ) I don't know whether they've done anything like this. Jim Bell
On Saturday, June 15, 2019 8:38 AM, jim bell <jdb10987@yahoo.com> wrote:
In the mid-1990's I thought of a way that ordinary DRAM could be combined with a set of cached columns, internal to the DRAM chip to implement the caches in the DRAMs themselves. To simplify a bit, ordinary DRAMs of the mid-1990's era (say, a 4M x 1) might have had 2048 rows by 2048 columns. A row address is present, RAS falls, and a row of data is read from the storage cells. The data is put in the column storage, and one bit is read (for read cycles) or written (for write cycles).
My idea was that instead of having only a single column of storage, the chip would have 4 or 8 or even more of storage. Access to main memory in a computer tends not to be truly 'random', but is localized: Program instructions, stack(s), data area(s), etc. A given column of storage would be relinquished by means of some LRU (least recently used) algorithm. I think most programs, when running, would statistically limit themselves to 8 data areas over periods of time of a few thousands of accesses.
The main reason this should be superior is that the data path between the DRAM array and the cache columns would be internal to the DRAM chip, say, 2048 bits wide (in the example above; probably much wider in modern 4Gbit+ DRAMs. )
I don't know whether they've done anything like this. Jim Bell
Great question, I hold deference to your opinions due to your known legal troubles. Perhaps the fact you even dare email me even increases your worthiness. I wonder if it is the same reason ECC checking is done in RAM instead of the processor. It seems incredibly unlikely that a sufficient number of altered bits would occur over the lifetime of the chip to result in incorrect data be returned. Afterall, non-ECC is fine (unless you use ZFS), and the probability of receiving a slightly different bit of data over HTTP used to be pretty high at according to "When The CRC and TCP Checksum Disagree" So RAM is not the most likely source of errors, and being able to correct errors at refresh is sort of strange given the cost in die area.
On Saturday, June 15, 2019, 2:49:58 PM PDT, Ryan Carboni <33389@protonmail.com> wrote: On Saturday, June 15, 2019 8:38 AM, jim bell <jdb10987@yahoo.com> wrote:
In the mid-1990's I thought of a way that ordinary DRAM could be combined with a set of cached columns, internal to the DRAM chip to implement the caches in the DRAMs themselves. To simplify a bit, ordinary DRAMs of the mid-1990's era (say, a 4M x 1) might have had 2048 rows by 2048 columns. A row address is present, RAS falls, and a row of data is read from the storage cells. The data is put in the column storage, and one bit is read (for read cycles) or written (for write cycles).
My idea was that instead of having only a single column of storage, the chip would have 4 or 8 or even more of storage. Access to main memory in a computer tends not to be truly 'random', but is localized: Program instructions, stack(s), data area(s), etc. A given column of storage would be relinquished by means of some LRU (least recently used) algorithm. I think most programs, when running, would statistically limit themselves to 8 data areas over periods of time of a few thousands of accesses.
The main reason this should be superior is that the data path between the DRAM array and the cache columns would be internal to the DRAM chip, say, 2048 bits wide (in the example above; probably much wider in modern 4Gbit+ DRAMs. )
I don't know whether they've done anything like this. Jim Bell
Great question, I hold deference to your opinions due to your known legal troubles. Perhaps the fact you even dare email me even increases your worthiness.
I wonder if it is the same reason ECC checking is done in RAM instead of the processor. It seems incredibly unlikely that a sufficient number of altered bits would occur over the lifetime of the chip to result in incorrect data be returned. Afterall, non-ECC is fine (unless you use ZFS), and the probability of receiving a slightly different bit of data over HTTP used to be pretty high at according to "When The CRC and TCP Checksum Disagree"
So RAM is not the most likely source of errors, and being able to correct errors at refresh is sort of strange given the cost in die area.
Well, you may remember nearly 40 years ago when Tim May discovered that alpha particles (nuclei of helium atoms) were the big, main source of SEU (single error upsets) in that generation of DRAMS. This was a huge discovery, although later going to CMOS (from NMOS) technology made alpha-induced soft errors go away. (Putting the chips in alpha-free plastic would, and did, fix it as well.) I am very disappointed that the current generation of microprocessor based computers (Ooops! These days, there's not much else!!!) has gone away from using at least parity bits in DRAM. How does a computer detect single-bit errors without at least parity bits? Worse, I think about 2-3 years ago we were discussing "Rowhammer", a tricky programmatic way of 'randomly' changing bits on a given row, by hitting (reading) on the rows to the right and left of it, a number of times. Do that enough, and that "target" ("victim") row gets upset. I would think that at least having parity bits would quickly detect such treatment. At that point, I searched and discovered that Intel was proactively developing a technology in microprocessors to detect "rowhammer", and to more frequently refresh "victim" rows. "Great!" I think. I take back have the things I said about Intel. Full disclosure: Shortly, I will be trying to sell to Intel my isotope technology to improve the dielectric constant of semiconductor dielectrics. Jim Bell
On Sat, 15 Jun 2019 22:17:11 +0000 (UTC) jim bell <jdb10987@yahoo.com> wrote:
On Saturday, June 15, 2019, 2:49:58 PM PDT, Ryan Carboni <33389@protonmail.com> wrote:
On Saturday, June 15, 2019 8:38 AM, jim bell <jdb10987@yahoo.com> wrote:
In the mid-1990's I thought of a way that ordinary DRAM could be combined with a set of cached columns, internal to the DRAM chip to implement the caches in the DRAMs themselves. To simplify a bit, ordinary DRAMs of the mid-1990's era (say, a 4M x 1) might have had 2048 rows by 2048 columns. A row address is present, RAS falls, and a row of data is read from the storage cells. The data is put in the column storage, and one bit is read (for read cycles) or written (for write cycles).
My idea was that instead of having only a single column of storage, the chip would have 4 or 8 or even more of storage. Access to main memory in a computer tends not to be truly 'random', but is localized: Program instructions, stack(s), data area(s), etc. A given column of storage would be relinquished by means of some LRU (least recently used) algorithm. I think most programs, when running, would statistically limit themselves to 8 data areas over periods of time of a few thousands of accesses.
The main reason this should be superior is that the data path between the DRAM array and the cache columns would be internal to the DRAM chip, say, 2048 bits wide (in the example above; probably much wider in modern 4Gbit+ DRAMs. )
I don't know whether they've done anything like this. Jim Bell
Great question, I hold deference to your opinions due to your known legal troubles. Perhaps the fact you even dare email me even increases your worthiness.
I wonder if it is the same reason ECC checking is done in RAM instead of the processor. It seems incredibly unlikely that a sufficient number of altered bits would occur over the lifetime of the chip to result in incorrect data be returned. Afterall, non-ECC is fine (unless you use ZFS), and the probability of receiving a slightly different bit of data over HTTP used to be pretty high at according to "When The CRC and TCP Checksum Disagree"
So RAM is not the most likely source of errors, and being able to correct errors at refresh is sort of strange given the cost in die area.
Well, you may remember nearly 40 years ago when Tim May discovered that alpha particles (nuclei of helium atoms) were the big, main source of SEU (single error upsets) in that generation of DRAMS. This was a huge discovery, although later going to CMOS (from NMOS) technology made alpha-induced soft errors go away. (Putting the chips in alpha-free plastic would, and did, fix it as well.) I am very disappointed that the current generation of microprocessor based computers (Ooops! These days, there's not much else!!!) has gone away from using at least parity bits in DRAM. How does a computer detect single-bit errors without at least parity bits? Worse, I think about 2-3 years ago we were discussing "Rowhammer", a tricky programmatic way of 'randomly' changing bits on a given row, by hitting (reading) on the rows to the right and left of it, a number of times. Do that enough, and that "target" ("victim") row gets upset. I would think that at least having parity bits would quickly detect such treatment. At that point, I searched and discovered that Intel was proactively developing a technology in microprocessors to detect "rowhammer", and to more frequently refresh "victim" rows. "Great!" I think. I take back have the things I said about Intel. Full disclosure: Shortly, I will be trying to sell to Intel my isotope technology to improve the dielectric constant of semiconductor dielectrics. Jim Bell
On Sat, 15 Jun 2019 22:17:11 +0000 (UTC) jim bell <jdb10987@yahoo.com> wrote:
At that point, I searched and discovered that Intel was proactively developing a technology in microprocessors to detect "rowhammer", and to more frequently refresh "victim" rows. "Great!" I think. I take back have the things I said about Intel.
Yes. Intel is an amazing firm. Intel is the perfect example showing how Selfishness in the Free Market can create the Best Tools the NSA can wish for. All True Individualists should Praise Intel.
Full disclosure: Shortly, I will be trying to sell to Intel my isotope technology to improve the dielectric constant of semiconductor dielectrics.
Yes. While you visit Intel don't forget to mention to them the fact that all Intel CEOs should be at the very top of the Assassination Politics KILL LIST. I'm sure they will apprectiate your libertarian ideas as much as you appreciate theirs. Jim Bell
non-ECC is fine (unless you use ZFS)
Such bullshit arguments out there. ECC is mandatory for those that give a shit about their data, instructions, uptime, etc. There is no valid argument about ECC being needed for this or that app or not... all the apps are running on the same CPU mobo disk.. so either you care or you don't. People do have a case for statistical comparative odds analysis of all possible sources of corruption in whatever data and instruction streams they pass and operate on... thus leading to particular design and purchase solutions effective to avoid those odds. ie: If ether+TCP isn't enough, use TLS to get 2^strong alerts and fails, compute in triplicate with comparator, etc. However those analysis articles are few, and are not consulted by users and enterprises. Deploy 100k nodes worth of ECC around the world, watch the error logs roll in. Whether those faults have monetary or other value to you or not is up to you. Same for defending against EMP, GRB, theft, war, govt, fire, flood, quake, adversaries, etc.
participants (4)
-
grarpamp
-
jim bell
-
Punk
-
Ryan Carboni