[ot][spam][crazy][data] transformer model 'attention' improvement

k gmkarl at gmail.com
Sat Jan 22 01:40:10 PST 2022


- their example gpu code is based around an attention() function on
line 42 that takes the query, key, and value as function parameters,
as well as a chunk size.

- this engages the concept of 'heads'.  i _ think_ a 'head' is
basically a chunk of the input data, already, not sure.

- their attention() function breaks the query into chunks of the
passed size, each chunk associated with all values and all keys, and
passes each one to _query_chunk_attention() ...


More information about the cypherpunks mailing list