So, on to the detokenization approach, which is expected to reproduce its given data and fail on other data because it was trained to have missing words! And I guess that's roughly fine for small reasons.