In the mid-1990's I thought of a way that ordinary DRAM could be combined with a set of cached columns, internal to the DRAM chip to implement the caches in the DRAMs themselves. To simplify a bit, ordinary DRAMs of the mid-1990's era (say, a 4M x 1) might have had 2048 rows by 2048 columns. A row address is present, RAS falls, and a row of data is read from the storage cells. The data is put in the column storage, and one bit is read (for read cycles) or written (for write cycles).
My idea was that instead of having only a single column of storage, the chip would have 4 or 8 or even more of storage. Access to main memory in a computer tends not to be truly 'random', but is localized: Program instructions, stack(s), data area(s), etc. A given column of storage would be relinquished by means of some LRU (least recently used) algorithm. I think most programs, when running, would statistically limit themselves to 8 data areas over periods of time of a few thousands of accesses.
The main reason this should be superior is that the data path between the DRAM array and the cache columns would be internal to the DRAM chip, say, 2048 bits wide (in the example above; probably much wider in modern 4Gbit+ DRAMs. )
I don't know whether they've done anything like this.
Jim Bell