[crazy][hobby][spam] Automated Reverse Engineering

Wed Dec 29 13:15:09 PST 2021

note: the huggingface demo passes information to the model using token ids
token ids are just indexed sets of character orders that occur
together frequently (the tokenizer counts and decides these)

with something based on math, since it's going to be learning using
linear algebra, i'm wondering if it might make sense to retain the
numeric value of the inputs.  this could mean bypassing the use of
integer input ids.

the integer input ids are converted into high-dimensioned vectors
using an 'embedding' matrix at the very start of the model.  this
matrix could be hand-altered by finding its property in the model, or
removed/skipped entirely.  a thought.

the coefficients in the embedding matrix are trained via
backpropagated gradients to determine the vectors.  i think they end
up being roughly random with some of their dimensions clustering
similar data near each other and such.