
it totally worked it completed "Once upon a" -> " time" with only a top_k of 6
actually it didn't do that i think i prompted it with that token it made totally wrong tokens the next text was "interconnected.goBack_SECURE" more what i expected :/
well i changed models to athena which i've logged so i could compare the math was correct and it turns out i was making the tensors contiguous incorrectly and my numbers were all wrong i added an extra loop to separate out all the strides, but it then becomes so slow to simply iterate the indices that i haven't seen a single loop complete (although i was doing all 8192 indices in athena to test). i was thinking that i would step away because a better algorithm informed by pagesize could likely need less index iteration, as well as simply using the tensor strides as opposed to calling a wrapped indexing function for every scalar in the matrix additionally merging the fetch regions would reduce the extensive incomplete loop to a single fetch in this case another idea is to try a yet smaller model, i've got llama 1B logged for tiny tests but i spent all day and got so close. it was really cool to forward llama 405b streaming over the internet in just a few seconds. but it wasn't selecting the correct data due to misinterpretation of sparse strides. an upshot is that it's quite possible those faulty results would look much better with my indexing error fixed this might be a situation where preprocessing the data to change its storage layout (which could also include extrema of the rows and columns of each matrix which would make top_k more effective) would make some sense man it's so close