[ml] langchain runs local model officially

Thu Apr 6 12:33:32 PDT 2023

also llama.cpp is better in many ways

but in python with huggingface accelerate and the transformers package
it will spread between gpu and cpu ram, giving more total ram, if you
pass device_map='auto', and it will use fast mmap loading if you use a
safetensors model

note that huggingface's libs do tend to be somewhat crippled
user-focused things, maybe why i know them