
karl3@writeme.com wrote:
https://github.com/karl3wm/httptransformer or maybe c++ or something deepseek is designed with 5% evaluation size and pretrained speculative decode so the next step i left was subsharding large weights. i have a potential bump today so i wanted to mention that subsharding looks pretty easy, one approach is to use torch's __torch_function__ functionality where it can treat any object as a tensor if it has a __torch_function__ function (the examples shows a class function but member functions may work too), and it calls this function (if present) for operations rather than the torch implementations. very good for embedding layer, a LazyTensor could store the url and offset and calculate and fill only the sparse columns needed for the tokens passed, saving network and memory significantly.
i spent a lot of hours playing with things around this, although most of it was spent generating model tracing data to validate inference code. there's a second more organized implementation, a little old now, still useful: - https://github.com/karl3wm/httptransformer/blob/main/netsafetensors.py loads the huggingface safetensors format remotely from their git-lfs with a variable-sized cache such that it will only download tensors used in evaluation, and will memory-map them all to disk if there is space, and otherwise download them from the network on use if there isn't space - https://github.com/karl3wm/httptransformer/blob/main/netsafetensors.py wraps netsafetensors with pytorch tensors and presents a lazy tensor and lazy state dict object such that entire models can be run off the network. uses similar code as netsafetensors to provide a variable sized cache in RAM - https://github.com/karl3wm/httptransformer/blob/main/test_nettensors.py runs language models off the network and validates that the logis used in them compare well with recorded logits i made at https://huggingface.co/datasets/baffo32/llm_logits -- but of course only a couple models are recorded there now, deepseek doesn't run correctly. it's designed to only work on the H100, a high-end datacenter GPU, and the huggingface quantization initialization code only runs under certain conditions. i haven't figured out if there is a correct way to run it under other conditions or not. it _looks_ like it would run fine on any cpu or gpu if the code to do so was enabled nowadays, but I could be wrong. as-is it loads but doesn't init its quantization block scalars and produces very wrong outputs. meanwhile, things have gotten a little funny online around it ... i might step back from trying this again with this model and let them sort out ...