
karl3@writeme.com wrote:
karl3@writeme.com wrote:
karl3@writeme.com wrote: {{{{{{there was some energy around making a network-based inference engine, maybe by modifying deepseek.cpp (don't quite recall why not staying in python, some concern arose) task got weak, found cinatra as a benchmark leader for c++ web engines (although pico.v was the top! (surprised all c++ http engines were beaten by java O_O very curious about this, wondering if it's a high-end test system) never heard of V language but is interesting it won a leaderboard) inhibition ended up discovering concern somewaht like ... on this 4GB ram system it might take 15-33GB of network transfor for each forward pass of the model ... [multi-token passes ^^ karl3@writeme.com wrote: the concern resonates with difficulty making the implementation, and some form of inhibition or concern around using python. notably, i've made offloading python hooks a lot and they never last due to the underlying interfaces changing (although those interfaces have stabilized much more now as hf made accelerator as their official implementation) (also i think the issue is more severe dissociative associations than the interface, if one considers the possibility of personal maintenance and use rather than usability for others). don't immediately recall the python concern seem to be off task or taking break, but it would make sense to do disk caching too. there is also option of quantizing. basically, LLMs and AI in general place r&d effort between the user and ease, smallness, cheapness, power, etc I poked at python again. The existing implementations of the 8 bit quantization used by the model all require an NVidia GPU which I do not presently have. It is fun to imagine making it work, maybe I can upcast it to float32 or something >)
so i have implemented a little of this. it creates a model state_dict that lazily loads things off the network when accessed. it's quite fast! some of the layers are larger than my available ram which i have deferred with a plan to resolve via either storing them in mmap'd files or further sharding the modules that use them (ideally both). i did not implement cpu operators for 8 bit quantization but prevented the check from firing to continue testing for now. the vm i have has 3-4 GB of ram so it's easy to exhaust with this task, at which point it freezes and i have to remotely power it off and on, which it does slowly because it is out of ram and has no swap. kind of a downer when it happens. the deepseekv3 model implementation allocates a separate buffer for the RoPE position encodings for each layer. i haven't looked closely at the various different implementations that the architecture can configure, but in llama these buffers were algorithmically defined and constant; they're not available over the network but could be written to a file. right now they are exhausting my ram causing the vm thrash freeze, the model has 60+1 layers and each position encoding uses a few hundred megabytes :/ their size is likely proportional to the square of the max input context length, so a quick fix could be to reconfigure the model to expect a much smaller context. i was trying to patch them to cache their construction so as to share the buffer across layers, but my attempt failed and i froze the vm again :) late. sleep times. 2326 2400 ok i got to the next milestone. it constructs a model, puts it in a pipeline, all the weights download from the network when used for it to do anything on this tiny system more code is needed to mmap or subshard large weights. current code is like this: import inspect, json, psutil import accelerate, requests, torch, tqdm, transformers class Quirks: if not torch.cuda.is_available(): # this avoids FP8 assertions on cpu during placement testing torch.cuda.is_available = lambda: True torch.cuda.get_device_capability = lambda: [9,0] #def unify_rope(model_or_class): # # deepseek generates separate rope buffers for each layer which can use significant memory # deepseek = inspect.getmodule(model_or_class) # cache = {} # def wrap_rope(rope): # def wrapper(*params, **kwparams): # key = tuple(params) + tuple(kwparams) # if key in cache: # return cache[key] # else: # val = rope(*params, **kwparams) # cache[key] = val # return val # return wrapper # for key, val in deepseek.__dict__.items(): # if 'Rotary' in key and isinstance(val, torch.nn.Module): # setattr(deepseek, key, wrap_rope(val)) class LazyStateDict(dict): def __init__(self, tensor_request_by_name, device): super().__init__(tensor_request_by_name) self.session = requests.Session() self.device = device def get_meta_tensor(self, weight): return super().__getitem__(weight)[0] def __getitem__(self, weight): tensor, request = super().__getitem__(weight) chunk_size = 1024*128 with tqdm.tqdm(desc=weight, leave=False, total=tensor.nbytes) as pbar: #if tensor.nbytes > psutil.virtual_memory().available / 2: # print(weight, 'is more than half available vram, mmapping a file ...') #else: # we could also further shard the embeddings and lm_head # since embeddings are sparsely used assert tensor.nbytes < psutil.virtual_memory().available / 2 buffer = memoryview(bytearray(tensor.nbytes)) with self.session.send(request, stream=True) as response: while pbar.n < pbar.total: pbar.update( response.raw.readinto( buffer[ pbar.n : pbar.n + chunk_size ] ) ) result = torch.frombuffer( buffer, dtype = tensor.dtype, count = tensor.numel(), requires_grad = False, device = self.device, ).reshape(tensor.shape) return result def items(self): for key in self: yield [key, self[key]] def values(self): for key in self: yield self[key] def largest(self): return max([ [key, tensor] for key, [tensor, _] in super().items() ], key = lambda keytensor: keytensor[1].nbytes)[1] @staticmethod def tensor_request_from_json(url, N, data): dtype = data['dtype'] dtype = dict( F32 = torch.float32, F8_E4M3 = torch.float8_e4m3fn, BF16 = torch.bfloat16, )[dtype] shape = data['shape'] tensor = torch.empty(shape, dtype=dtype, device='meta') start, end = data['data_offsets'] start += N + 8 end += N + 8 - 1 request = requests.Request('GET', url, dict(Range='bytes='+str(start)+'-'+str(end))) request = request.prepare() return [tensor, request] @classmethod def from_user_repo_branch(cls, user, repo, branch, device): base_url = f'https://huggingface.co/{user}/{repo}/raw/{branch}/' lfs_base_url = f'https://huggingface.co/{user}/{repo}/resolve/{branch}/' safetensors_index_url = base_url + 'model.safetensors.index.json' print(safetensors_index_url) with requests.get(safetensors_index_url, stream=True) as response: safetensors_index = json.load(response.raw) print(safetensors_index['metadata']) fn_by_weight = safetensors_index['weight_map'] urls = [lfs_base_url + fn for fn in set(fn_by_weight.values())] #url_range_dict = {} #data_by_weight = {} tensor_request_by_name = {} with tqdm.tqdm(urls,desc='constructing tensor urls') as pbar: for url in pbar: # we could potentially also check the git-lfs sha256 from the base url and merklify the data too, this would mean downloading it all #[b'version https://git-lfs.github.com/spec/v1', b'oid sha256:e94d32e8649e1a5b03cc0a343c59ca5a6d80d03cd46161b482fd3bb2484adb7d', b'size 4302350824'] #lfs = dict([ line.decode().split(' ', 1) for line in response.iter_lines() ]) with requests.get(url, stream=True) as response: N = int.from_bytes(response.raw.read(8), 'little') header = json.loads(response.raw.read(N)) for weight, data in header.items(): if weight == '__metadata__': continue #dtype = data['dtype'] #shape = data['shape'] #start, end = data['data_offsets'] #start += headersize + 8 #end += headersize + 8 #data_by_weight[weight] = data | {'url':url,'N':headersize} tensor_request_by_name[weight] = cls.tensor_request_from_json(url, N, data) return cls(tensor_request_by_name, device) def construct(model_id, device, config_patches = {}, attr_patches = {}): user, repo = model_id.split('/',1) branch = 'main' print(user, repo, branch) config = transformers.AutoConfig.from_pretrained(model_id, trust_remote_code=True) for key, val in config_patches.items(): setattr(config, key, val) tokenizer = transformers.AutoTokenizer.from_pretrained(model_id, trust_remote_code=True) with accelerate.init_empty_weights(), transformers.modeling_utils.no_init_weights(): model = transformers.AutoModelForCausalLM.from_config(config, trust_remote_code=True) for key, val in attr_patches.items(): setattr(model, key, val) lazy_state_dict = LazyStateDict.from_user_repo_branch(user, repo, branch, device=device) # misuse cpu offloading by providing lazy_state_dict model = accelerate.cpu_offload(model, device, state_dict = lazy_state_dict) model.hf_device_map = { '': device } return transformers.pipeline('text-generation', model=model, config=config, tokenizer=tokenizer) pipe = construct( 'deepseek-ai/DeepSeek-V3', device = 'cpu', config_patches = dict( max_position_embeddings = 64, # drop ctx len from 163840 to 64 ), attr_patches = dict( _supports_cache_class = False, # might be a bug that this isn't in the model ), ) pipe('Once upon a time,')