the data comes out right now until it's consolidated at the end of the softmax i stepped through it carefully, and it turns out the attention values are being generated in a truncated manner. there are only 20 in the efficient_attention code, whereas there are 96 in the working code. so, i still got something wrong. i'm guessing my test passed more easily because it had the same feature size for all of queries, keys, and values. that is not true in the perceiver_loader test i'm pursuing; i think it looks as if the values have a feature size of 20 whereas the keys have a feature size of 96. gotta review again to get that making 96 attention scores instead of 20, I guess. unsure. here are the notes with the einsum letters flushed out, unchecked: chunked: queries: ...qhd keys: ...khd values: ...vhd scores: ...qhk mask: ...hqk -> ...qhk unchunked: queries: .hqd keys: .hkd values: .hvd scores: .hqk -> 1,8,256,96 mask: .hqk -> 1,1,1,96 -> needs extension to num_heads, num_queries commit eb16dc63d2c617bfe708881f8bd5ba96be8b9f50 (HEAD -> memory-efficient-attention, xloem/memory-efficient-attention) Author: xloem <0xloem@gmail.com> Date: Thu Jan 27 11:43:45 2022 +0000 wip efficient attention: dimensions pass but data is truncated