[ot][crazy][spam] Notes: Matmul Acceleration
Undiscussed Horrific Abuse, One Victim of Many
gmkarl at gmail.com
Sat Jun 25 02:59:14 PDT 2022
Goal: Take implementation-oriented notes on
https://arxiv.org/pdf/2106.10860.pdf .
My understanding of the approach is that it lossily compresses the
data into a smaller representation that contains most of the big ends
of the information, combines in a way that involves no add-multiplies,
then decompresses it and produces a batch matmul result.
The goal is to understand the derivation and implementation enough to
daydream about implementing a convolution operation rather than matrix
multiplication. [multiple people are working on convolution; i don't
know whether i will find their results]
Planning to skip the abstract, then start through the intro.
1 Introduction
- Approximate Matrix Multiplication (AMM)
- assumptions: tall, dense, held in a single address space
- A is large data, B is linear operator such as model weights
- nonlinear preprocessing reduces problem to table lookups
- no multiply-adds if B is constant
- a family of quantizers used rather than a single expensive on
- 3 parts:
1. high-speed quantization functions
2. high-speed low-bitwidth stable summation algorithm
3. matmul algorithm using 1&2 with theoretical quality guarantees
1.1 Problem Formulation
A: NxD
B: DxM
N >> D >= M
tau: computation time bound
Problem formulation to minimize error epsilon for computation time bound tau:
f(g(A), h(B)) * alpha + Beta - AB
[interruption]
More information about the cypherpunks
mailing list