
https://github.com/karl3wm/httptransformer or maybe c++ or something deepseek is designed with 5% evaluation size and pretrained speculative decode
so the next step i left was subsharding large weights. i have a potential bump today so i wanted to mention that subsharding looks pretty easy, one approach is to use torch's __torch_function__ functionality where it can treat any object as a tensor if it has a __torch_function__ function (the examples shows a class function but member functions may work too), and it calls this function (if present) for operations rather than the torch implementations. very good for embedding layer, a LazyTensor could store the url and offset and calculate and fill only the sparse columns needed for the tokens passed, saving network and memory significantly.