[ml] langchain runs local model officially
https://python.langchain.com/en/latest/modules/models/llms/integrations/gpt4... note: - llamacpp is cpu only. the huggingface backend can run the same model on gpu. it is not always faster - llamacpp is experiencing political targeting and a little upheaval in optimization code - there are newer and more powerful models that can be loaded just like gpt4all, such as vicuna langchain is a powerful and noncomplex language model frontend library supporting openai that provides for coding autonomous agents that use tools and access datastores. see also llama-index .
I run alpaca.cpp on a laptop with 8 GB ram and the 7B model. Works pretty well. I would love to find a project that would enable me to go to the 13B model, but have not yet found one that enables me to run that on only 8 GB ram. On Wed, 5 Apr 2023, Undescribed Horrific Abuse, One Victim & Survivor of Many wrote:
https://python.langchain.com/en/latest/modules/models/llms/integrations/gpt4...
note: - llamacpp is cpu only. the huggingface backend can run the same model on gpu. it is not always faster - llamacpp is experiencing political targeting and a little upheaval in optimization code - there are newer and more powerful models that can be loaded just like gpt4all, such as vicuna
langchain is a powerful and noncomplex language model frontend library supporting openai that provides for coding autonomous agents that use tools and access datastores. see also llama-index .
On 4/6/23, efc@swisscows.email <efc@swisscows.email> wrote:
I run alpaca.cpp on a laptop with 8 GB ram and the 7B model. Works pretty well.
I would love to find a project that would enable me to go to the 13B model, but have not yet found one that enables me to run that on only 8 GB ram.
4 bit quantization and mmap? i’m not using them myself yet but people are doing these things
Thank you for the pointer. I will look it up! =) On Thu, 6 Apr 2023, Undescribed Horrific Abuse, One Victim & Survivor of Many wrote:
On 4/6/23, efc@swisscows.email <efc@swisscows.email> wrote:
I run alpaca.cpp on a laptop with 8 GB ram and the 7B model. Works pretty well.
I would love to find a project that would enable me to go to the 13B model, but have not yet found one that enables me to run that on only 8 GB ram.
4 bit quantization and mmap?
i’m not using them myself yet but people are doing these things
also llama.cpp is better in many ways but in python with huggingface accelerate and the transformers package it will spread between gpu and cpu ram, giving more total ram, if you pass device_map='auto', and it will use fast mmap loading if you use a safetensors model note that huggingface's libs do tend to be somewhat crippled user-focused things, maybe why i know them
people have been quantizing models using https://github.com/qwopqwop200/GPTQ-for-LLaMa and uploading the models to huggingface.co they can be pruned smaller using sparsegpt, which has some forks for llama, but a little dev work is needed to do the pruning in a way that is useful, presently the lost weights are just set to zero. it would make sense to alter the algorithm such that entire matrix columns and rows can be excised (see https://github.com/EIDOSLAB/simplify for ideas) or to use a purpose selected dataset and severely increase the sparsification (per https://scholar.google.com/scholar?as_ylo=2023&q=lottery+tickets pruning even random data may actually be more effective than normal methods for training models) the newer fad is https://github.com/FMInference/FlexGen which i don't believe has been ported to llama yet but is not complex, notably applies 10% sparsity in attention but i don't believe it prunes and the latest version of pytorch has some hardcoded accelerated memory reduced attention algorithms that could likely be almost drop in replacements for huggingface's manual attention, mostly useful when training longer contexts, https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_pro...
Sadly that's way beyond my capabilities, but I take away from this that development is continuing that surely, there will be better quality models for smaller computers available in the coming months. =) On Thu, 6 Apr 2023, Undescribed Horrific Abuse, One Victim & Survivor of Many wrote:
people have been quantizing models using https://github.com/qwopqwop200/GPTQ-for-LLaMa and uploading the models to huggingface.co
they can be pruned smaller using sparsegpt, which has some forks for llama, but a little dev work is needed to do the pruning in a way that is useful, presently the lost weights are just set to zero. it would make sense to alter the algorithm such that entire matrix columns and rows can be excised (see https://github.com/EIDOSLAB/simplify for ideas) or to use a purpose selected dataset and severely increase the sparsification (per https://scholar.google.com/scholar?as_ylo=2023&q=lottery+tickets pruning even random data may actually be more effective than normal methods for training models)
the newer fad is https://github.com/FMInference/FlexGen which i don't believe has been ported to llama yet but is not complex, notably applies 10% sparsity in attention but i don't believe it prunes
and the latest version of pytorch has some hardcoded accelerated memory reduced attention algorithms that could likely be almost drop in replacements for huggingface's manual attention, mostly useful when training longer contexts, https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_pro...
participants (2)
-
efc@swisscows.email
-
Undescribed Horrific Abuse, One Victim & Survivor of Many