Re: [spam][crazy][fiction][random] Non-Canon MCBoss Spinoffs

26 Apr 2025

      karl3＠writeme.com wrote:
...
karl3＠writeme.com wrote:
...
karl3＠writeme.com wrote:
there's some interest in 'downloading only top k items' [this involves looking at the layer algebra [and coming up with ways to identify low-contributing values.
[[we have solved this before possibly/optionally including preprocessing to categorize things
top k is more fun! it seems niftier to make ummmmmmmmm
so we've got some input logits. these are probably getting multiplied by a _huge_ matrix.
we could technically do a naiveish approach of discarding the parts that are multiplied by values near zero. (we could actually consider that each dot product has large values and small values, and skip all values that are smaller than a percentage of the largest values.)
- this works much better if we find a way to clump the mask based on locality :/ since http servers like to send regions of bytes not sparse masks
- this is really cool if we make like a bayesian or error-labeled datatype, so instead of 3.4 it's more like 3.4+-0.31 this would give much more useful information at the end
but yeah it seems interesting to just try the mask! involves some simple torch kernel algebra
there's a small space here where one can get the _same exact output_ by predicting that some products would be smaller than the precision of the sum ... this might at least need information on the magnitude of the weights unsure ... ... but there are likely heuristics one could apply here that would be accurate because of the rote nature of the training process and possibly a lack of useful+accurate information one would expect from an overtiny number multiplied by an overlarge one ...
that's kind of more in line with the intent of httptransformer and llm_logits, to be able to work on things like that on your cellphone, but i didn't make llm_logits for this model
ummm i guess i'll look a little at matmul
97             number_passes = math.ceil(weight.mem_usage_frac())
 98             if number_passes == 1:
 99                 product = torch.matmul(input, weight.fetch(progress=name, validate_usage=False).T)
100             else:
101                 rows_at_once = math.ceil(weight.shape[0] / number_passes)
102  ->             product = torch.cat([
103                     torch.matmul(
104                         input,
105                         weight[offset : offset+rows_at_once].fetch(progress=f'row{offset}-{offset+rows_at_once}/{weight.shape[0]}', validate_usage=False).T
106                     )
107                     for offset in tqdm.tqdm(range(
(Pdb) p input.shape
torch.Size([1, 6, 16384])
(Pdb) p input.abs().max(dim=-1)
torch.return_types.max(
values=tensor([[1.5359, 0.2287, 0.1609, 0.1848, 0.1869, 0.2321]], dtype=torch.float64),
indices=tensor([[ 6303,  6303, 14427, 14427, 14427,   205]]))
(Pdb) p input.abs().min(dim=-1)
torch.return_types.min(
values=tensor([[1.0807e-08, 1.3109e-07, 3.4837e-07, 7.3625e-08, 1.0437e-06, 7.8622e-08]],
       dtype=torch.float64),
indices=tensor([[   78,  6285, 10787,  5347, 10964, 15229]]))
(Pdb) p input.dtype
torch.float64

so yes it's a float64 but 1e-6 is still a lot smaller than 0.1, aren't you curious what would happen if we only multiplied the largest 16 values? or the largest 1024?