idea: a model could be trained to guess the source layout by
   sequentially producing filepaths and selecting areas of the source code
   to consider, like an agent
   that's similar to language generation except the output words/phrases
   are unordered: a set of filepaths.
   might be interesting to try training a model based on there being
   multiple correct answers rather than just one. it likely works great
   but could run into an issue.