i spent most of today exploring the concept of beam search. the reason is, i ran into difficulty using the prompt for data augmentation (which means generating more data than there really is, for training models). long story short, the huggingface beam search implementation produces the top likelihood generations only if you provide a sufficiently large number of beams, and the way it is written it will upload all the beams to the gpu in parallel, so it seems to only give you the highest probability results if you have enough ram to calculate all the candidates at the same time. i was confused around this. i had previously implemented something on my own, to find the highest probability results, using a priority queue. i eventually spent some time redoing some of that work. i did not, however, resolve all the bugs. i ended up using sampling rather than collecting the strictly highest probability outputs, which seems to work all right (it's already in the huggingface library).