[ml] langchain runs local model officially

Thu Apr 6 14:42:54 PDT 2023

Sadly that's way beyond my capabilities, but I take away from this that 
development is continuing that surely, there will be better quality models 
for smaller computers available in the coming months. =)

On Thu, 6 Apr 2023, Undescribed Horrific Abuse, One Victim & Survivor of Many wrote:

> people have been quantizing models using
> https://github.com/qwopqwop200/GPTQ-for-LLaMa and uploading the models
> to huggingface.co
>
> they can be pruned smaller using sparsegpt, which has some forks for
> llama, but a little dev work is needed to do the pruning in a way that
> is useful, presently the lost weights are just set to zero. it would
> make sense to alter the algorithm such that entire matrix columns and
> rows can be excised (see https://github.com/EIDOSLAB/simplify for
> ideas) or to use a purpose selected dataset and severely increase the
> sparsification (per
> https://scholar.google.com/scholar?as_ylo=2023&q=lottery+tickets
> pruning even random data may actually be more effective than normal
> methods for training models)
>
> the newer fad is https://github.com/FMInference/FlexGen which i don't
> believe has been ported to llama yet but is not complex, notably
> applies 10% sparsity in attention but i don't believe it prunes
>
> and the latest version of pytorch has some hardcoded accelerated
> memory reduced attention algorithms that could likely be almost drop
> in replacements for huggingface's manual attention, mostly useful when
> training longer contexts,
> https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html
>
>