
karl3@writeme.com wrote:
karl3@writeme.com wrote:
karl3@writeme.com wrote: there's some interest in 'downloading only top k items' [this involves looking at the layer algebra [and coming up with ways to identify low-contributing values. [[we have solved this before possibly/optionally including preprocessing to categorize things top k is more fun! it seems niftier to make ummmmmmmmm so we've got some input logits. these are probably getting multiplied by a _huge_ matrix. we could technically do a naiveish approach of discarding the parts that are multiplied by values near zero. (we could actually consider that each dot product has large values and small values, and skip all values that are smaller than a percentage of the largest values.) - this works much better if we find a way to clump the mask based on locality :/ since http servers like to send regions of bytes not sparse masks - this is really cool if we make like a bayesian or error-labeled datatype, so instead of 3.4 it's more like 3.4+-0.31 this would give much more useful information at the end but yeah it seems interesting to just try the mask! involves some simple torch kernel algebra there's a small space here where one can get the _same exact output_ by predicting that some products would be smaller than the precision of the sum ... this might at least need information on the magnitude of the weights unsure ... ... but there are likely heuristics one could apply here that would be accurate because of the rote nature of the training process and possibly a lack of useful+accurate information one would expect from an overtiny number multiplied by an overlarge one ...
that's kind of more in line with the intent of httptransformer and llm_logits, to be able to work on things like that on your cellphone, but i didn't make llm_logits for this model
ummm i guess i'll look a little at matmul
97 number_passes = math.ceil(weight.mem_usage_frac()) 98 if number_passes == 1: 99 product = torch.matmul(input, weight.fetch(progress=name, validate_usage=False).T) 100 else: 101 rows_at_once = math.ceil(weight.shape[0] / number_passes) 102 -> product = torch.cat([ 103 torch.matmul( 104 input, 105 weight[offset : offset+rows_at_once].fetch(progress=f'row{offset}-{offset+rows_at_once}/{weight.shape[0]}', validate_usage=False).T 106 ) 107 for offset in tqdm.tqdm(range( (Pdb) p input.shape torch.Size([1, 6, 16384]) (Pdb) p input.abs().max(dim=-1) torch.return_types.max( values=tensor([[1.5359, 0.2287, 0.1609, 0.1848, 0.1869, 0.2321]], dtype=torch.float64), indices=tensor([[ 6303, 6303, 14427, 14427, 14427, 205]])) (Pdb) p input.abs().min(dim=-1) torch.return_types.min( values=tensor([[1.0807e-08, 1.3109e-07, 3.4837e-07, 7.3625e-08, 1.0437e-06, 7.8622e-08]], dtype=torch.float64), indices=tensor([[ 78, 6285, 10787, 5347, 10964, 15229]])) (Pdb) p input.dtype torch.float64 so yes it's a float64 but 1e-6 is still a lot smaller than 0.1, aren't you curious what would happen if we only multiplied the largest 16 values? or the largest 1024?