Re: [ot][spam][crazy][data] transformer model 'attention' improvement

22 Jan 2022


      - their example gpu code is based around an attention() function on
line 42 that takes the query, key, and value as function parameters,
as well as a chunk size.

- this engages the concept of 'heads'.  i _ think_ a 'head' is
basically a chunk of the input data, already, not sure.

- their attention() function breaks the query into chunks of the
passed size, each chunk associated with all values and all keys, and
passes each one to _query_chunk_attention() ...