i was reading looking at this wonky algorithm i found online and pursued a little in discord it's biggest bottleneck is a call to conv1d(). the author cuda-accelerated it by hand; they are new to cuda. thinking about conv1d via this mithral/maddness stuff, dunno. in a matrix, one combines columns of A with rows of B or something like that. so, the algorithm organizes a bunch of columns and rows, and I think it just precalculates their dotproducts, and then does something equivalent to interpolating the precalculations to do future stuff. if that's true, there's likely a basic analogy with conv1d if i can move the concepts through my mind. if it's not true, those concepts will still do some of the process work for whatever the case actually is. not sure what happened to the multiplication step of the interpolation, if that's what it is, but it seems it has been sorted out. with convolution of 1d data, basically there a bunch of dot products taken between sequences that change in length. [ 1 2 3 4 ... 49 ] convolved with a kernel with indices [ 1 2 3 ... 7 ] i think it produces many dot products, all of length 7, along the length of the data. given the conv1d is performed inside a trained model, the data can indeed be collected from the training. basically, the kernel is roughly a matrix, with all the rows the same. i'm guessing you could reuse the mithral/maddness approach with convolution by turning the convolution kernel into a matrix, and extracting all the sequential chunks of the data to process. it's a naive approach, since the data is so repetitive. but it woudl work well enough to consider. to consider it better, one might think of how each data item is multiplied by every item in the kernel, and summed with its neighbor item. it's interesting to me that in the research they found they could lookup faster than multiply. maybe relates to how near the data is. i've probably often worked with fragmented memory.