[ot][spam][crazy][data] transformer model 'attention' improvement

Sat Jan 22 01:40:10 PST 2022

- their example gpu code is based around an attention() function on
line 42 that takes the query, key, and value as function parameters,
as well as a chunk size.

- this engages the concept of 'heads'.  i _ think_ a 'head' is
basically a chunk of the input data, already, not sure.

- their attention() function breaks the query into chunks of the
passed size, each chunk associated with all values and all keys, and
passes each one to _query_chunk_attention() ...