spm.SentencePieceTrainer.train can take a sentence_iterator keyword parameter (kwarg) this likely iterates over sentences to train on. they may need linebreaks to easily include linebreaks, unknown. so some data could help. we are taxed. this reason to use data manually transcribed by others. but real reason is it is just the tokenizer, so extra information doesn't flow through model in certain way.