
but now i get to step into the git xdiff code and see !!
here's the call to the xdiff implementation: 887 mustbe0(xdl_recs_cmp(&dd1, 0, dd1.nrec, &dd2, 0, dd2.nrec, 888 kvdf, kvdb, (xp.flags & XDF_NEED_MINIMAL) != 0, 889 &xenv)); git added other implementations (patience, histogram) and patched different infrastructure on for those. but i think this is the core xdiff implementation that uses the first-class infrastructure in the xdiff structures. all the variables from xdf1 and xdf2 have been copied into dd1 and dd2: 878 dd1.nrec = xe->xdf1.nreff; 879 dd1.ha = xe->xdf1.ha; 880 dd1.rchg = xe->xdf1.rchg; 881 dd1.rindex = xe->xdf1.rindex; 882 dd2.nrec = xe->xdf2.nreff; 883 dd2.ha = xe->xdf2.ha; 884 dd2.rchg = xe->xdf2.rchg; 885 dd2.rindex = xe->xdf2.rindex; So basically they are presenting the data to xdiff as if all the preprocessed lines never existed. 878 dd1.nrec = xe->xdf1.nreff; // this is the number of lines in file 1 879 dd1.ha = xe->xdf1.ha; // this is the "hash" of every line. it's actually an index into a hash list, same thing 880 dd1.rchg = xe->xdf1.rchg; // this is the change vector -- the algorithm sets 1 to indicate the line does not match 881 dd1.rindex = xe->xdf1.rindex; // this is the vector of pointers to line data including the actual content The goal of the algorithm is to correctly fill rchg, which is its output. In this buggy case, rchg already has some 1s in it (hummm i could just set it to 0s and see if there's a difference ... well if i do that i'd better not stop my debugging session in case it doesn't help!) maybe try that in a bit unsure (gdb) s xdl_recs_cmp (dd1=0x7ffff4b08880, off1=0, lim1=2, dd2=0x7ffff4b088c0, off2=0, lim2=1, kvdf=0x50c000000290, kvdb=0x50c0000002c0, need_min=1, xenv=0x7ffff4b08840) at /home/karl3/projects/zinc/third_party/xdiff/xdiffi.c:259 259 unsigned long const *ha1 = dd1->ha, *ha2 = dd2->ha; (gdb) list 254 * (marking changed lines) is done in the two boundary reaching checks. 255 */ 256 int xdl_recs_cmp(diffdata_t *dd1, long off1, long lim1, 257 diffdata_t *dd2, long off2, long lim2, 258 long *kvdf, long *kvdb, int need_min, xdalgoenv_t *xenv) { 259 unsigned long const *ha1 = dd1->ha, *ha2 = dd2->ha; 260 261 /* 262 * Shrink the box by walking through each diagonal snake (SW and NE). 263 */ This function is supported by xdiffi.c it's kind of the focus of the xdiff library. (previously i was working with xprepare.c which sets up all those structures). i kind of don't expect it to be that long and complex, but i do expect to be highly confused by it. let's list some more lines 261 /* 262 * Shrink the box by walking through each diagonal snake (SW and NE). 263 */ (gdb) list 264 for (; off1 < lim1 && off2 < lim2 && ha1[off1] == ha2[off2]; off1++, off2++); 265 for (; off1 < lim1 && off2 < lim2 && ha1[lim1 - 1] == ha2[lim2 - 1]; lim1--, lim2--); 266 267 /* 268 * If one dimension is empty, then all records on the other one must 269 * be obviously changed. 270 */ 271 if (off1 == lim1) { 272 char *rchg2 = dd2->rchg; 273 long *rindex2 = dd2->rindex; (gdb) list 274 275 for (; off2 < lim2; off2++) 276 rchg2[rindex2[off2]] = 1; 277 } else if (off2 == lim2) { 278 char *rchg1 = dd1->rchg; 279 long *rindex1 = dd1->rindex; 280 281 for (; off1 < lim1; off1++) 282 rchg1[rindex1[off1]] = 1; 283 } else { (gdb) list 284 xdpsplit_t spl; 285 spl.i1 = spl.i2 = 0; 286 287 /* 288 * Divide ... 289 */ 290 if (xdl_split(ha1, off1, lim1, ha2, off2, lim2, kvdf, kvdb, 291 need_min, &spl, xenv) < 0) { 292 293 return -1; (gdb) list 294 } 295 296 /* 297 * ... et Impera. 298 */ 299 if (xdl_recs_cmp(dd1, off1, spl.i1, dd2, off2, spl.i2, 300 kvdf, kvdb, spl.min_lo, xenv) < 0 || 301 xdl_recs_cmp(dd1, spl.i1, lim1, dd2, spl.i2, lim2, 302 kvdf, kvdb, spl.min_hi, xenv) < 0) { 303 (gdb) list 304 return -1; 305 } 306 } 307 308 return 0; 309 } see? not very long or complex-looking. it looks like its recursive! I'm surprised to encounter a recursive function here, I remember when I wrote recursive functions before I knew about state stacks ... didn't learn too much new after that ... but i used to get a lot of stack overflows, surprised now to see recursive functions 261 /* 262 * Shrink the box by walking through each diagonal snake (SW and NE). 263 */ (gdb) list 264 for (; off1 < lim1 && off2 < lim2 && ha1[off1] == ha2[off2]; off1++, off2++); 265 for (; off1 < lim1 && off2 < lim2 && ha1[lim1 - 1] == ha2[lim2 - 1]; lim1--, lim2--); "snakes" are a term from myers first paper on LCS. i don't immediately recall what it refers to, and it's quite important for understanding the terminology of these algorithms (and this is likely that first one as it is still very popular) ... i think a snake is a set of matching lines that can be thought of as a diagonal path through a grid of potential line matchings, one file's line numbers on one axis, and one file's line numbers on another. so in 264 and 265 above, it's looking for starting and ending lines that match, and increasing off and decreasing lim to narrow the problem domain after these lines it then runs some quick checks for unchangedness, if either pair of offs and lims end up touching. we won't have that condition as the window is only partly filled. the ending lines will mismatch; the condition ha1[lim1 - 1] == ha2[lim2 - 1] will be false. i guess i'd better step into that and verify i'm right!