Re: text analysis
A good data structure is a tree: -----> 2nd letter / ------> 3rd letter / / 1st letter -------> 2nd letter -------> 3rd letter \ \ \ ------> 3rd letter -----> 2nd letter \ -----> 3rd letter Start with an array indexed by the 26 letters. Each array entry contains a count of that letter's frequence and a pointer to a subsidary array. The pointer starts out null. When you find a new digraph (like the first time you find a letter after 'g') you allocate a new structure to represent the letters following 'g'. This will again consist of 26 entries like the first one. Likewise when you find a new trigraph, you allocate a new third-level array of 26 entries. This will be a compromise between space and speed. Most digraphs don't exist so you will save considerable space over the brute force way of allocating an array of 26^n spaces to count n-graphs. However it is a bit wasteful in that it allocates a full 26 entries each time even though only a small fraction will be used. The last layer does not need to have the pointers allocated per letter since you won't be looking beyond that (e.g. if you are only interested in counting trigraphs, the third layer above can consist of just an array of counts). struct entry { int count; struct entry *next; } top; Handle letter, given array of previous letters. prev[0] is current letter, prev[1] is 1st previous letter, prev[2] is 2nd previous letter, etc. Size of prev is n entries. Letters are normalized to range 0-25. Untested code, hopefully it will give you the idea. handle (int prev[], int n) { struct entry *cur; cur = top; for (i=0; i<n; ++i) { cur[prev[i]].count += 1; if (i==n-1) break; if (cur[prev[i]].next == NULL) cur[prev[i]].next = calloc (26, sizeof(struct entry)); cur = cur[prev[i]].next; } } This actually counts the entries backwards, so that "the" is entered under top['e'-'a'+1]->next['h'-'a'+1]->next['t'-'a'+1]. You can correct for this when you print them out. Or you can store the letters in the prev array in the opposite order, putting the current letter in prev[n-1], then they will be in the correct order.
participants (1)
-
Anonymous