people have been quantizing models using https://github.com/qwopqwop200/GPTQ-for-LLaMa and uploading the models to huggingface.co they can be pruned smaller using sparsegpt, which has some forks for llama, but a little dev work is needed to do the pruning in a way that is useful, presently the lost weights are just set to zero. it would make sense to alter the algorithm such that entire matrix columns and rows can be excised (see https://github.com/EIDOSLAB/simplify for ideas) or to use a purpose selected dataset and severely increase the sparsification (per https://scholar.google.com/scholar?as_ylo=2023&q=lottery+tickets pruning even random data may actually be more effective than normal methods for training models) the newer fad is https://github.com/FMInference/FlexGen which i don't believe has been ported to llama yet but is not complex, notably applies 10% sparsity in attention but i don't believe it prunes and the latest version of pytorch has some hardcoded accelerated memory reduced attention algorithms that could likely be almost drop in replacements for huggingface's manual attention, mostly useful when training longer contexts, https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_pro...