
cypherpunks list not loading for me. also many emails from it missing to inbox at the moment. here is last spam that i tried to send during connectivity issues: troubleshooting deepseek inference failure [on remote hardware] transformers/modeling_utils.py 4788 `p` is an mlp weight, "model.layers.61.self_attn.q_a_proj.weight" `param_device_map[p]` does not exist `p` is enumerated from `weight_map` transformers modeling_utils.py 4785: - `weight_map` has mlp weights and `param_device_map` does not - an mlp weight is "model.layers.61.self_attn.q_a_proj.weight" - this is in PreTrainedModel._load_pretrained_model 0352 what conditions cause this block to execute? where do weight_map and param_device_map come from? `weight_map` is constructed in previous block. `else` block in line 4783 indent depth 3 which weight map is constructed? go up file. indent depth 3 is weight map condition indent depth 2 is offload code condition weightmap condition is `if sharded_metadata is None` (Pdb) p sharded_metadata is None False so we have `weight_map = {p: os.path.join(folder, f) for p, f in sharded_metadata["weight_map"].items()}` -> weight map is constructed from `sharded_metadata`. if sharded_metadata were None, it would be constructed from `original_loaded_keys` and would still contain mlp weights. it looks like a good avenue would either be to figure out why `param_device_map` does not have mlp keys, or why the larger block is being executed. 0358 line 4773 indent depth 2: `if device_map is not None and is_safetensors` so basically this block is only run if there is both a device map, and is_safetensors is set. i think i'm manually setting is_safetensors; maybe i'll try disabling it and see if i can generate the data then. 0359 0400 ok while that is loading lets see if we can figure out where param_device_map comes from 0402: removing `use_safetensors` did not resolve the crash. param_device_map is set on line 4774: 4773 if device_map is not None and is_safetensors: 4774 param_device_map = expand_device_map(device_map, original_loaded_keys, start_prefix) basically, `device_map` is expanded to `model.layers.[i]` but does not have an entry for layer 61 which is the mlp layer. so when it is expanded it doesn't have any of the weights in that layer. this probably happens when the device map is autogenerated, which happened outside this function. 0405 but rather in the calling function: .from_pretrained() likely line 4259 device_map = infer_auto_device_map(...) right now: (Pdb) p device_map_kwargs {'no_split_module_classes': ['DeepseekV3DecoderLayer'], 'special_dtypes': {'lm_head.weight': torch.bfloat16}, 'max_memory': {'cpu': 85212960085}} 0407 so basically it sounds like these weights are not present in the model enumeration but are present on disk i have run the model before as have many other so there's some way to make it work. it looks like the easiest way is to disable device_map which may mean fitting the entire model on one device, or it may mean manually calling offload code after construction. i could maybe put it on cpu, then set the dtype and offloading after or maybe i can set the offloading for the whole model without using a device map somehow .... maybe not - set a breakpint on infer_auto_device_map ? (i confirmed the layer is not in the model) - look at the model source code again to see if the layer can be enabled for this step - try calling without a device map some confusion. it looks like the model has _62_ layers, whereas .... uhhh ... so num_hidden_layers is 61 and num_nextn_predict_layers is 1. the ModuleList .layers is constructed with num_hidden_layers and it has names that range from 0 to 60. so the layer that is named "61" is the mlp layer. and it's #62. confusing because there are 61 hidden layers and it seemed like the kind of community that might use 1-based numbering but nope! layer 61 is the 62nd layer, the mlp layer, and it's not in the list of layers so i don't see any way for layer 61 to be instantiated here :/ which is strange cause i've thought i've seen it eval'd maybe i can look at my logits and see what happened ! 0417 0424 no, the log doesn't show layer 61 ever used. it does show expert 61 used a lot, maybe i missaw that ok hmm so the huggingface device_map code assumes that what's on-disk matches what's in the model ... but i know elsewhere in the code they often handle that mismatching, so maybe something just needs to be set for something to mismatch ...? 0425 0427 looks like the mismatched key code might be after this code; the present assumption might be that sharded device mapped models are kind of tuned for use hmm there's an unused function _load_pretrained_model_low_mem that looks intended for people like me to try out the keys come from the state_dict parameter. so i could either look into the function for loading that, or preload a custom state dict, or not use a device map it looks like it might work to call transformers.modeling_utils.load_state_dict in advance and filter the unused keys. oh no that function is only used if it's not sharded the keylist comes from get_checkpoint_shard_files hrm >( ok options: - likely a way by passing a custom state dict - likely a way by not using a device map - likely a way by engaging internals, one option is get_checkpoint_shard_files - likely a way by modifying the model to add the unused layers in that last option might be _easiest and quickest_ here while it's kind of a unique quirk just for generating test data i'd just list all the keys in the weights that are on layer 61 and patch them in i guess when i run without a device map the warning says i'm supposed to use "device_map = 'cuda'". it seems happy to load on cpu hmm device_map='cuda' seems to work. why is this? ok i'll try on an H100 again. last time i checked i had $6 on vast.ai . an H100 is maybe $2.50/hr . 0516 ok device_map='cuda' works fine but then i run out of gpu memory ... 0526 so i stepped into device_map='cuda' and i'm around line 4586 and it did actually enumerate missing_keys and unexpected_keys way back on line 4582 ... there is also a list of unexpected keys to accept: 4620 # Some models may have keys that are not in the state by design, removing them before needlessly warning 4621 # the user. 4622 -> if cls._keys_to_ignore_on_load_missing is not None: 4623 for pat in cls._keys_to_ignore_on_load_missing: 4624 missing_keys = [k for k in missing_keys if re.search(pat, k) is None] 4625 4626 if cls._keys_to_ignore_on_load_unexpected is not None: 4627 for pat in cls._keys_to_ignore_on_load_unexpected: 4628 unexpected_keys = [k for k in unexpected_keys if re.search(pat, k) is None] 4629 if hf_quantizer is not None: 4630 missing_keys = hf_quantizer.update_missing_keys(model, missing_keys, prefix) however, layer 61 is still in loaded_keys after, despite being detected as unexpected ok so on line 4773 is_safetensors is _false_ and the failing block isn't executed. that's basically why it worked. so why is is_safetensors false? looks like on line 4534 that is_safetensors is only set if device_map contains "disk". it sounds like deepseek will run if i offload to cpu and not to disk. maybe if i can get a VM running i can use swap. i haven't gotten VMs working on vast.ai, it won't let me connect to them. hrm maybe i'll just patch those lines to run the model! i can add a check for the key to be present. lemme see how that works. line 4788 of modeling_utils.py 0535 0556 well now i get an error in get_disk_only_shard_files i might want to just capture some weights manually at this point - initially config.quantization_config = {'activation_scheme': 'dynamic', 'fmt': 'e4m3', 'quant_method': 'fp8', 'weight_block_size': [128, 128]} - then config.quantization_config = AutoHfQuantizer.merge_quantization_configs(config.quantization_config, quantization_config=None) = FineGrainedFP8Config(quant_method=<QuantizationMethod.FP8: 'fp8'>) - then 3691 -> hf_quantizer = AutoHfQuantizer.from_config( 3692 config.quantization_config, 3693 pre_quantized=pre_quantized, // = True 3694 ) 3699 hf_quantizer.validate_environment( 3700 torch_dtype=torch_dtype, 3701 from_tf=from_tf, 3702 -> from_flax=from_flax, 3703 device_map=device_map, 3704 weights_only=weights_only, 3705 ) 3706 torch_dtype = hf_quantizer.update_torch_dtype(torch_dtype) 3707 device_map = hf_quantizer.update_device_map(device_map) (... the model is constructed with empty weights ...) 4200 -> hf_quantizer.preprocess_model( 4201 model=model, device_map=device_map, keep_in_fp32_modules=keep_in_fp32_modules 4202 ) it looks like preprocess_model is replacing Linear modules with FP8Linear modules, before weights are loaded. so that's likely a really important step my code was missing ... now it's doing the weight loading code i engaged so much ... [hey one thing i could do is forward with saving weights, but only save them for e.g. the first layer] it looked like some of the param quantization initialization could have been in _load_state_dict_into_meta_model or somesuch so here's this, but it doesn't look properly initialized: (Pdb) p model_to_load.model.layers[0].self_attn.q_a_proj.weight.cpu() tensor([[ -22.0000, -72.0000, 88.0000, ..., -9.0000, -208.0000, -28.0000], [ 128.0000, 14.0000, 16.0000, ..., 104.0000, -64.0000, 26.0000], [ 72.0000, -36.0000, 64.0000, ..., -120.0000, 80.0000, -72.0000], ..., [-144.0000, 80.0000, 48.0000, ..., -72.0000, -96.0000, 72.0000], [ -80.0000, 120.0000, 72.0000, ..., -44.0000, 112.0000, 112.0000], [ 224.0000, 4.5000, -56.0000, ..., 160.0000, -64.0000, 36.0000]], dtype=torch.float8_e4m3fn) these are much higher magnitude numbers than i'd expect, i don't think they've been scaled here ok it's in weight_scale_inv: (Pdb) p model_to_load.model.layers[0].self_attn.q_a_proj.weight_scale_inv.cpu() tensor([[0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0003, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0003, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0003, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0001, 0.0002, 0.0002, 0.0002, 0.0001, 0.0001, 0.0001, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0003, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0001, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0003, 0.0004, 0.0002, 0.0002], [0.0002, 0.0001, 0.0002, 0.0002, 0.0001, 0.0002, 0.0002, 0.0002, 0.0001, 0.0001, 0.0002, 0.0002, 0.0002, 0.0001, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0001, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0001, 0.0001, 0.0002, 0.0001, 0.0002, 0.0003, 0.0001, 0.0003, 0.0003, 0.0002, 0.0002, 0.0001, 0.0002, 0.0002, 0.0002, 0.0002, 0.0001, 0.0002, 0.0002, 0.0003, 0.0002, 0.0002, 0.0001], [0.0003, 0.0001, 0.0002, 0.0003, 0.0001, 0.0003, 0.0005, 0.0002, 0.0002, 0.0002, 0.0003, 0.0003, 0.0003, 0.0002, 0.0004, 0.0004, 0.0002, 0.0004, 0.0003, 0.0002, 0.0002, 0.0005, 0.0002, 0.0003, 0.0005, 0.0001, 0.0002, 0.0002, 0.0004, 0.0002, 0.0001, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0003, 0.0003, 0.0005, 0.0002, 0.0005, 0.0004, 0.0002, 0.0001, 0.0001, 0.0002, 0.0003, 0.0003, 0.0002, 0.0002, 0.0003, 0.0002, 0.0005, 0.0004, 0.0002, 0.0002], [0.0004, 0.0002, 0.0002, 0.0003, 0.0001, 0.0003, 0.0005, 0.0002, 0.0003, 0.0002, 0.0003, 0.0003, 0.0004, 0.0002, 0.0004, 0.0004, 0.0003, 0.0004, 0.0003, 0.0001, 0.0002, 0.0005, 0.0002, 0.0003, 0.0005, 0.0002, 0.0003, 0.0002, 0.0004, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0001, 0.0003, 0.0003, 0.0005, 0.0002, 0.0005, 0.0005, 0.0002, 0.0002, 0.0001, 0.0001, 0.0003, 0.0003, 0.0002, 0.0002, 0.0003, 0.0002, 0.0005, 0.0004, 0.0002, 0.0001], [0.0002, 0.0001, 0.0002, 0.0002, 0.0002, 0.0002, 0.0003, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0003, 0.0003, 0.0002, 0.0003, 0.0002, 0.0001, 0.0002, 0.0003, 0.0002, 0.0002, 0.0003, 0.0002, 0.0002, 0.0001, 0.0003, 0.0001, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0001, 0.0002, 0.0002, 0.0003, 0.0002, 0.0004, 0.0003, 0.0002, 0.0002, 0.0002, 0.0001, 0.0002, 0.0002, 0.0002, 0.0001, 0.0003, 0.0002, 0.0004, 0.0004, 0.0002, 0.0002], [0.0002, 0.0002, 0.0001, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0001, 0.0002, 0.0002, 0.0002, 0.0002, 0.0001, 0.0002, 0.0001, 0.0001, 0.0002, 0.0002, 0.0002, 0.0001, 0.0002, 0.0002, 0.0001, 0.0002, 0.0001, 0.0002, 0.0002, 0.0002, 0.0002, 0.0001, 0.0002, 0.0001, 0.0002, 0.0002, 0.0001, 0.0002, 0.0001, 0.0002, 0.0001, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0001, 0.0001, 0.0002, 0.0002, 0.0002, 0.0001, 0.0002, 0.0002, 0.0002, 0.0002, 0.0001], [0.0003, 0.0001, 0.0002, 0.0003, 0.0002, 0.0002, 0.0004, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0003, 0.0001, 0.0003, 0.0004, 0.0002, 0.0003, 0.0003, 0.0002, 0.0002, 0.0004, 0.0002, 0.0002, 0.0004, 0.0002, 0.0002, 0.0002, 0.0003, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0001, 0.0002, 0.0002, 0.0003, 0.0004, 0.0002, 0.0004, 0.0004, 0.0001, 0.0002, 0.0002, 0.0001, 0.0003, 0.0002, 0.0002, 0.0002, 0.0003, 0.0002, 0.0004, 0.0003, 0.0002, 0.0002], [0.0003, 0.0001, 0.0002, 0.0003, 0.0002, 0.0002, 0.0003, 0.0002, 0.0002, [3/1806] 0.0002, 0.0003, 0.0002, 0.0002, 0.0001, 0.0004, 0.0003, 0.0002, 0.0003, 0.0003, 0.0002, 0.0002, 0.0004, 0.0002, 0.0002, 0.0003, 0.0002, 0.0002, 0.0002, 0.0003, 0.0002, 0.0001, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0003, 0.0004, 0.0002, 0.0004, 0.0004, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0003, 0.0002, 0.0001, 0.0003, 0.0002, 0.0005, 0.0004, 0.0002, 0.0002], [0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0003, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0003, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0003, 0.0003, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0003, 0.0002, 0.0002, 0.0002], [0.0003, 0.0002, 0.0002, 0.0003, 0.0002, 0.0003, 0.0005, 0.0002, 0.0002, 0.0002, 0.0003, 0.0003, 0.0003, 0.0002, 0.0004, 0.0004, 0.0002, 0.0004, 0.0003, 0.0002, 0.0002, 0.0004, 0.0002, 0.0003, 0.0005, 0.0002, 0.0002, 0.0002, 0.0004, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0001, 0.0003, 0.0003, 0.0005, 0.0002, 0.0005, 0.0004, 0.0002, 0.0002, 0.0002, 0.0001, 0.0003, 0.0003, 0.0002, 0.0002, 0.0003, 0.0002, 0.0005, 0.0004, 0.0002, 0.0002], [0.0001, 0.0002, 0.0001, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0001, 0.0002, 0.0001, 0.0001, 0.0001, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0001, 0.0002, 0.0002, 0.0002, 0.0001, 0.0002, 0.0002, 0.0001, 0.0002, 0.0002, 0.0001, 0.0001, 0.0002, 0.0002, 0.0001, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0001, 0.0002, 0.0002, 0.0002, 0.0001], [0.0002, 0.0002, 0.0002, 0.0001, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0001, 0.0002, 0.0002, 0.0002, 0.0001, 0.0001, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0001, 0.0002, 0.0002, 0.0002, 0.0002, 0.0001, 0.0002, 0.0002, 0.0001, 0.0002, 0.0002, 0.0001, 0.0002, 0.0001, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0001, 0.0001, 0.0002, 0.0001, 0.0001, 0.0002]]) and of course i could have made mistakes copying that by hand from pdb