I'm thinking on how the machine-generated output likely has highly dense regions of precisely matching audio. I'm thinking on identifying them. For example, if we were to do an autocorrellation, they would stand out in some way. Once we have similar audio, we could then compare with the output of any old speech2text model. This could be used to form an association between audio and letters. --- We could possibly take a break from pursuing an xkcd-related goal there, and try to build a graph that describes some of the content of the language model and its tokenizers .... --- Would it be hard to build a compressor out of a speech2text model? The speech2text model operates on very low-density data. --- What could be built would be an open-source app that transcribes things live. There's likely one of those already, but I haven't stumbled on it myself, and it's probably not very good. ---- The tether here is the xkcd task, and we do want some measure of success. Using the model and having it produce a correct example transcription from the new recording is of course success. What is the next success? --- This pursuit was a big struggle. Not sure how it might continue. [what was the reason for the goal to change? just the length of the struggle, and lots of engagement ... what was the reason to _do_ the goal? ... opportunity, maybe ...] [so not a strong validity like other goals, unfortunately :S] [a part was representing this goal. it's probably for AI-like shapes. we're trying to make public hyperintelligence, roughly. this goal likely helps with that.] [okay, so it's good to pursue this in a way that shares with others the ability to do fancy stuff. that supports the addition of graphs and whatnot. makes the approach more reusable.]