4.1 new encoding g(a) - new trainable hash functions - balanced binary regression trees, each tree a hash bucket leaf(x): traverse tree from root, select child based on where x_j is <= node-specific v. for simd, limited to 16 leaves, and j is depth-specific. vectorization details in Appendix B this is called Algorithm 1, the MaddnessHash input: vector x, split indices j^1..j^4, split threshold v^1...v^4 i <- 1 // node index within level of tree for t <- 1 to 4 do v <- v^t_i // lookup split threshold for node i at level t b <- x_(j^t) >= v ? 1 : 0 // above split threshold ? i <- 2i - 1 + b // assign to left or right child end for return i ------------------- i'm guessing i can learn more effectively what i might need from source code than most of this. skipping a little to see if theory is mentioned. - hash function parameters constructed in greedy manner - algorithm for forming in Algorithm 2 - experimentally found selecting more than 4 indices for best loss gave little return Algorithm 2 is 22 lines long. section 4.2 describes its parts. 4.3 new method for optimizing pq prototypes 4.4 fast 8-bit aggregation f(,) 4.5 theoretical guarantees proof and further discussion in Appendix F A~ is from probability distribution D, maximum value sigma_A . C is # of codebooks, lambda > 0 is a regulization parameter used in 4.3 . i'm not certain where the error is in the equation, but it contains a difference constraint of C * sigma_A * ||b||_2 / (2*sqrt(lambda)) * [1/256 + (8 + sqrt(rho(C,D,del))/sqrt(2n)] where v(C,D,del) =triangle C(4 * ceil(log2(d)) + 256)log2 - logdel not sure what a log2 after an expression means. i'm not sure what this means, not even sure if accuracy is implied by high values or low values in the overall expression. i think i'd get more return looking through the appendix, for quality guarantees. 5 Experimentes note: post-softmax, this algorithm gave the same results as matmul. - the algorithm is the fastest, but would be over 4x faster if hardware lookup-accumulates were optimized as much as multiply-accumulates. 6 Discussion, Conclusion - theoretical guarantees incomplete, quantization errors not yet described - tested only on cpus, not gpus. expecting different parameters would be optimal. - not tested a parallel implementation using threads - thinking next exploiting weight reuse in convolutional layers to further optimize - accelerating an entire network is being considered an engineering challenge due to boilerplate work - it is unknown when it is appropriate to use this acceleration - no work done to differentiate across the hash function at all - mostly promising on inference rather than training - strong reduction in electricity needs due to implementability with multiplexers only. very strong locality of data access. - speed increase over previous AMM methods is up to 10x - future methods similar to this also hold promise. this kind of work unbottlenecks many things.