I'd like to do it _without_ transformer models, but since I can barely control my body, I'll plan to use them.

Here's the plan:

- we find two different speech-to-text models

- we upload them and the audio data to e.g. google colab

- loop:

- forward pass with both on the data

- the difference is their output is used as their loss

- backward pass with both

the challenge is whether (a) implementation mistakes and (b) the possibility of design error, can both be resolved before the correct transcription reaches explainxkcd.com or the github linked in other post.

then if we do it in time, we see who else has done it and how much faster they were than we were