
karl3@writeme.com wrote:
karl3@writeme.com wrote:
so i looked through https://huggingface.co/deepseek-ai/DeepSeek-V3/tree/main a little bit. information is in the two readmes, the config file, the safetensors json file, and the modelling python class, all in that repository it looks like they are including a fully trained layer with a full set of experts for speculatively decoding more than 1 token (different from the architecture i've spammed about which does all tokens as equal class, this instead uses a single extra layer dedicated to guessing further tokens), but it does not look like this layer is used in the modelling class. in the safetensors json there are some unique properties to the mtp (multitoken prediction) layer, such as a name "shared_head", which i don't find in the source that wires the architecture. i have a google cloud vm going while i'm on this windows system, but it could be hard to reference all the layers to test the model as-is with python. but if i did -- usually when there are weights in a safetensors file that don't have a place in the model architecture it will emit a warning, and i could look into that to see if they just left that warning or if there's more information on how to use the weights. noting that overtly they are describing the multitoken weights as mostly to improve training performance. they do also briefly express support for speculative decoding with them. the normal way people were doing speculative decoding looks like they run a second model in parallel to the first, so it's possible the extra weights just need to be wired up as a parallel model. it could be fun to make it run and test outputs to figure out the right wiring by trial. but most of my time/energy would likely be spent figuring out how to map that many huge safetensors files on a system without ram to load them. i have a w.i.p. project that has a bump. or i could look for more media or a paper regarding this model, or a repository that was used for training it; this would show the wiring of the weights for training which would be basically the same for inference. basically i'm curious if it can infer more than 1 extra token at once. i'm guessing likely it would do that simply in a sequential way like normal generation. but it might also rely on the outputs of the larger model and only make 1 token. or it might possibly be trained to make 4 tokens in one go or something this seems unlikely since that's not normative. the deepseek3 paper is at https://github.com/deepseek-ai/DeepSeek-V3/blob/main/DeepSeek_V3.pdf and it describes the architecture wiring. it sounds like nobody has implemented the inference wiring for this yet.
i could look into implementing it, but of course it's a challenge to reference all those layers to test it can predict a token.
it seems a little interesting to think about working on the safetensors files one-by-one and sharding them differently, planning to break them into different files based on experts intentionally, so that only a portion of them need to be mapped to forward the model
i guess i'm thinking of that as kind of a subtask that might be a thing on its own ... if i did succeed at this, i wonder what i would do with the remapped weights, and what kind of auxiliary code might be needed to make use of them. maybe none! maybe it's just interesting to remap them unsure :s
given that would only facilitate use of my buggy network mounter and isn't needed by most people, it might make sense to write one's own weight loader, by simply finding urls to all the parts and loading them by hand 1712 .. the other problem [ok i'll just write the network offloader 1712 1719 i've written code to enumerate the filename map over the network. i'm planning to write the safetensors parsing code by hand because my experience is that the format is very simple, and that huggingface does not keep internal or extra apis stable to repurpose their library 1720 oh huh the loader is in rust! the format documentation is at https://github.com/huggingface/safetensors/tree/main/safetensors#format 1757 i've written code that produces metadata including remote url and numbers from which byteranges can be computed for every tensor got less energy around this after that milestone, watched the output a little: $ python3 deepseekhttp.py https://huggingface.co/deepseek-ai/DeepSeek-V3/raw/main/model.safetensors.in... {'total_size': 1369062772000} urls: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 163/163 [00:11<00:00, 14.21it/s] model.layers.60.mlp.experts.184.up_proj.weight_scale_inv {'dtype': 'F32', 'shape': [16, 56], 'data_offsets': [0, 3584], 'url': 'https://huggingface.co/deepseek-ai/DeepSeek-V3/resolve/main/model-00160-of-0...', 'N': 58442} each tensor has a dtype, shape, datea_offsets from safetensors and I add the url and N fields. the actual byterange is N+8+data_offsets . N is the name the spec gives the header length. so. the accelerate library has a way to load models without loading the weights. i guess i'd use that ... and somehow hook the accessing of the data to load it via http. i'll try to github this code. ......... it was called deepseekhttp but it's more normal to call it more like httptransformer and make it work with any modl...................................................... i've uploaded this simple metadata enumerator to https://github.com/karl3wm/httptransformer/blob/main/httptransformer.py .