Goal: Take implementation-oriented notes on https://arxiv.org/pdf/2106.10860.pdf . My understanding of the approach is that it lossily compresses the data into a smaller representation that contains most of the big ends of the information, combines in a way that involves no add-multiplies, then decompresses it and produces a batch matmul result. The goal is to understand the derivation and implementation enough to daydream about implementing a convolution operation rather than matrix multiplication. [multiple people are working on convolution; i don't know whether i will find their results] Planning to skip the abstract, then start through the intro. 1 Introduction - Approximate Matrix Multiplication (AMM) - assumptions: tall, dense, held in a single address space - A is large data, B is linear operator such as model weights - nonlinear preprocessing reduces problem to table lookups - no multiply-adds if B is constant - a family of quantizers used rather than a single expensive on - 3 parts: 1. high-speed quantization functions 2. high-speed low-bitwidth stable summation algorithm 3. matmul algorithm using 1&2 with theoretical quality guarantees 1.1 Problem Formulation A: NxD B: DxM N >> D >= M tau: computation time bound Problem formulation to minimize error epsilon for computation time bound tau: f(g(A), h(B)) * alpha + Beta - AB [interruption]