Re: text analysis

17 Dec 2003

      A good data structure is a tree:

                   ----->  2nd letter
                  /                    ------> 3rd letter
                 /                    /
      1st letter ------->  2nd letter -------> 3rd letter
                 \                    \
                  \                    ------> 3rd letter
                   ----->  2nd letter
                                      \
                                       -----> 3rd letter

Start with an array indexed by the 26 letters.  Each array entry contains
a count of that letter's frequence and a pointer to a subsidary array.
The pointer starts out null.

When you find a new digraph (like the first time you find a letter after
'g') you allocate a new structure to represent the letters following 'g'.
This will again consist of 26 entries like the first one.

Likewise when you find a new trigraph, you allocate a new third-level
array of 26 entries.

This will be a compromise between space and speed.  Most digraphs don't
exist so you will save considerable space over the brute force way of
allocating an array of 26^n spaces to count n-graphs.  However it is
a bit wasteful in that it allocates a full 26 entries each time even
though only a small fraction will be used.

The last layer does not need to have the pointers allocated per letter
since you won't be looking beyond that (e.g. if you are only interested
in counting trigraphs, the third layer above can consist of just an
array of counts).

struct entry {
	int count;
	struct entry *next;
} top;

Handle letter, given array of previous letters.  prev[0] is current
letter, prev[1] is 1st previous letter, prev[2] is 2nd previous letter,
etc.  Size of prev is n entries.  Letters are normalized to range 0-25.
Untested code, hopefully it will give you the idea.

handle (int prev[], int n)
{
	struct entry *cur;

	cur = top;
	for (i=0; i<n; ++i) {
		cur[prev[i]].count += 1;
		if (i==n-1)
			break;
		if (cur[prev[i]].next == NULL)
			cur[prev[i]].next = calloc (26, sizeof(struct entry));
		cur = cur[prev[i]].next;
	}
}

This actually counts the entries backwards, so that "the" is entered under
top['e'-'a'+1]->next['h'-'a'+1]->next['t'-'a'+1].  You can correct for
this when you print them out.  Or you can store the letters in the prev
array in the opposite order, putting the current letter in prev[n-1],
then they will be in the correct order.

Anonymous

tags

participants (1)