
karl3@writeme.com wrote:
karl3@writeme.com wrote:
{{{{{{there was some energy around making a network-based inference engine, maybe by modifying deepseek.cpp (don't quite recall why not staying in python, some concern arose) task got weak, found cinatra as a benchmark leader for c++ web engines (although pico.v was the top! (surprised all c++ http engines were beaten by java O_O very curious about this, wondering if it's a high-end test system) never heard of V language but is interesting it won a leaderboard) inhibition ended up discovering concern somewaht like ... on this 4GB ram system it might take 15-33GB of network transfor for each forward pass of the model ... [multi-token passes ^^ karl3@writeme.com wrote: the concern resonates with difficulty making the implementation, and some form of inhibition or concern around using python. notably, i've made offloading python hooks a lot and they never last due to the underlying interfaces changing (although those interfaces have stabilized much more now as hf made accelerator as their official implementation) (also i think the issue is more severe dissociative associations than the interface, if one considers the possibility of personal maintenance and use rather than usability for others). don't immediately recall the python concern seem to be off task or taking break, but it would make sense to do disk caching too. there is also option of quantizing. basically, LLMs and AI in general place r&d effort between the user and ease, smallness, cheapness, power, etc
I poked at python again. The existing implementations of the 8 bit quantization used by the model all require an NVidia GPU which I do not presently have. It is fun to imagine making it work, maybe I can upcast it to float32 or something >)