
26 Apr
2025
26 Apr
'25
7:02 p.m.
karl3@writeme.com wrote: > karl3@writeme.com wrote: > > so http://github.com/karl3wm/httptransformer now has a resuming class > > i can load llama 405b and process 0.00002% of it on my tiny system, then reboot the system and start right off again at 0.00002% and process up to 0.00004% ! :D > > you can try this with `python test_nettensor.py` but I _do not recommend doing it unless you are spending the time working on it_ because it is incredibly inefficient. i do expect it to correctly complete prompts if let run for weeks. > > of course the intent of his project was to implement models that provide for amortizable completion in much smaller time like deepseek or mistral or any ad-hoc setup providing for multi-token completion such as an assistant model [1] > > using it for llama 405b is just for fun > > 1: https://huggingface.co/docs/transformers/main/en/main_classes/text_generatio... note that you can probably implement this much more effectively than huggingface did by engaging logits and attention internals > > it doesn't do any paralleliz-- > > i tried it on google colab, but it actually ran at about the same speed because of the synchronous streaming nature -- > > so there's some interest in focusing on speeding up the network portion, the biggest bottleneck > > ok! so how can we make it download faster :s > ideas might include: > - prefetching in the background (really baseline) > - downloading compressed data and decompressing it [this opens up an interesting space of weight compression] > - downloading only top-k items, could be very powerful, works better if model weights are shuffled to get similar things in similar rows, alternatively this might work well with lora too. > - opening multiple connections at once, possibly accessing multiple sources > > :s small space of how prefetching could be calculated via lazy evaluation, torch hides the operator graph tho ummmm might be easiest to forward with virtual/meta tensors awkward tho huh! i wonder how prefetching could work best? now prefetching would be easy inside single operators. it breaks data into chunks and calculates the operators over only part at once. so it knows clearly what's coming up then. another space would be module children. when a module is forwarded we can probably guess all its children could be important. similarly layers. ... [... [ the umm current resume code doesn't engage the opera--