On Sun, Aug 4, 2024 at 11:51 PM Undescribed Horrific Abuse, One Victim & Survivor of Many <gmkarl@gmail.com> wrote:
so recently zuckerbergcorp released another llama, llama 3.1, they're really trying to make it big, although i think they talked bigger than it is, so far but they released a private-research-scale model, over 400G parametres, said it took thousands of gpus to train blergh anyway i have a 2GB gpu still, so i can run a 500K parameter model at standard floating point precision (or 4G at 4bit).
it's always fun to try to downscale the models, which means navigating puzzle inhibitions and such
we were thinking, how to run this llama? there's also an 8G parameter model, so i could run part of it but not all of it, in the gpu
a fun idea seemed like top-k. there was some criticism around it, but it might work!
the idea would be to keep all parameters on disk but to filter them somehow such that only ones that are relevant to the data are loaded in further thought it's an algorith
thinking of taking algorithm-idea gently some possible parts that might be near considering: - the first layer's input is fully known because it comes straight from embeddings, which are easy to partially load [basically a dict of vectors] - for top-k, you care about which weights will impact the result: you can see this inside the attention kernel (really each module will need its own treatment of some sort, there are like 4 or 5 different kinds of weights) - the weights are multiplied by the inputs in a -- [aughhhhhhhhl
m challeng-- [AARRRGG