[ot][crazy][spam] Notes: Matmul Acceleration

Sat Jun 25 02:59:14 PDT 2022

Goal: Take implementation-oriented notes on
https://arxiv.org/pdf/2106.10860.pdf .

My understanding of the approach is that it lossily compresses the
data into a smaller representation that contains most of the big ends
of the information, combines in a way that involves no add-multiplies,
then decompresses it and produces a batch matmul result.

The goal is to understand the derivation and implementation enough to
daydream about implementing a convolution operation rather than matrix
multiplication. [multiple people are working on convolution; i don't
know whether i will find their results]

Planning to skip the abstract, then start through the intro.

1 Introduction

- Approximate Matrix Multiplication (AMM)
- assumptions: tall, dense, held in a single address space
- A is large data, B is linear operator such as model weights
- nonlinear preprocessing reduces problem to table lookups
- no multiply-adds if B is constant
- a family of quantizers used rather than a single expensive on
- 3 parts:
1. high-speed quantization functions
2. high-speed low-bitwidth stable summation algorithm
3. matmul algorithm using 1&2 with theoretical quality guarantees

1.1 Problem Formulation

A: NxD
B: DxM

N >> D >= M

tau: computation time bound

Problem formulation to minimize error epsilon for computation time bound tau:
f(g(A), h(B)) * alpha + Beta - AB

[interruption]