# [ot][crazy][spam] Notes: Matmul Acceleration

Undiscussed Horrific Abuse, One Victim of Many gmkarl at gmail.com
Sat Jun 25 04:26:04 PDT 2022

4.1 new encoding g(a)

- new trainable hash functions
- balanced binary regression trees, each tree a hash bucket

leaf(x):
traverse tree from root, select child based on where x_j is <= node-specific v.
for simd, limited to 16 leaves, and j is depth-specific. vectorization
details in Appendix B

this is called Algorithm 1, the MaddnessHash
input: vector x, split indices j^1..j^4, split threshold v^1...v^4
i <- 1 // node index within level of tree
for t <- 1 to 4 do
v <- v^t_i // lookup split threshold for node i at level t
b <- x_(j^t) >= v ? 1 : 0 // above split threshold ?
i <- 2i - 1 + b // assign to left or right child
end for
return i

-------------------
i'm guessing i can learn more effectively what i might need from
source code than most of this. skipping a little to see if theory is
mentioned.

- hash function parameters constructed in greedy manner
- algorithm for forming in Algorithm 2
- experimentally found selecting more than 4 indices for best loss
gave little return

Algorithm 2 is 22 lines long. section 4.2 describes its parts.

4.3
new method for optimizing pq prototypes

4.4 fast 8-bit aggregation f(,)

4.5 theoretical guarantees
proof and further discussion in Appendix F

A~ is from probability distribution D, maximum value sigma_A . C is #
of codebooks, lambda > 0 is a regulization parameter used in 4.3 .

i'm not certain where the error is in the equation, but it contains a
difference constraint of C * sigma_A * ||b||_2 / (2*sqrt(lambda)) *
[1/256 + (8 + sqrt(rho(C,D,del))/sqrt(2n)]
where v(C,D,del) =triangle C(4 * ceil(log2(d)) + 256)log2 - logdel
not sure what a log2 after an expression means.

i'm not sure what this means, not even sure if accuracy is implied by
high values or low values in the overall expression. i think i'd get
more return looking through the appendix, for quality guarantees.

5 Experimentes
note: post-softmax, this algorithm gave the same results as matmul.
- the algorithm is the fastest, but would be over 4x faster if
hardware lookup-accumulates were optimized as much as
multiply-accumulates.

6 Discussion, Conclusion
- theoretical guarantees incomplete, quantization errors not yet described
- tested only on cpus, not gpus. expecting different parameters would
be optimal.
- not tested a parallel implementation using threads
- thinking next exploiting weight reuse in convolutional layers to
further optimize
- accelerating an entire network is being considered an engineering
challenge due to boilerplate work
- it is unknown when it is appropriate to use this acceleration
- no work done to differentiate across the hash function at all
- mostly promising on inference rather than training
- strong reduction in electricity needs due to implementability with
multiplexers only. very strong locality of data access.
- speed increase over previous AMM methods is up to 10x
- future methods similar to this also hold promise. this kind of work
unbottlenecks many things.