idea: a model could be trained to guess the source layout by sequentially producing filepaths and selecting areas of the source code to consider, like an agent

that's similar to language generation except the output words/phrases are unordered: a set of filepaths.

might be interesting to try training a model based on there being multiple correct answers rather than just one. it likely works great but could run into an issue.