so recently zuckerbergcorp released another llama, llama 3.1, they're really trying to make it big, although i think they talked bigger than it is, so far but they released a private-research-scale model, over 400G parametres, said it took thousands of gpus to train blergh anyway i have a 2GB gpu still, so i can run a 500K parameter model at standard floating point precision (or 4G at 4bit). it's always fun to try to downscale the models, which means navigating puzzle inhibitions and such we were thinking, how to run this llama? there's also an 8G parameter model, so i could run part of it but not all of it, in the gpu a fun idea seemed like top-k. there was some criticism around it, but it might work! the idea would be to keep all parameters on disk but to filter them somehow such that only ones that are relevant to the data are loaded in further thought it's an algorithm challeng-- [AARRRGG