spm.SentencePieceTrainer.train can take a sentence_iterator keyword
   parameter (kwarg)
   this likely iterates over sentences to train on.
   they may need linebreaks to easily include linebreaks, unknown.
   so some data could help.
   we are taxed. this reason to use data manually transcribed by others.
   but real reason is it is just the tokenizer, so extra information
   doesn't flow through model in certain way.