Re: text analysis

17 Dec 2003


      CyberPsychotic writes:
...
So I was playing with some crypto-analysis algorythms, and some steps of
them involve such thing as finding out the frequency of some character or
set of characters. So I wonder what would be the optimal (speed, resources
etc)way of coding this. While playing with single character frequency, I
guess the best way would be having unsigned int 256 elements array (I
refer to C coding) and each time, I find certain character, i just
increase the element of the array with this char offset. This seems very
neat to me (except the thing that some chars could be never found in
text). Anyways, when things come to 2 characters set, i have to get 1024
character set, and so on, which looks quite unreasonable to me to allocate
memory for elements, which probably will be never found in text... I was
thinking of other solution and came to two way connected lists (correct
term?)  things, i.e. : i have some structure like:
The term is "linked list".  If you want to collect all the arbitrarily
long repeats you may have to resign yourself to a two-pass algorithm.
For small cryptograms like the ones I usually deal with, it's
sufficient to use an N^2 algorithm, looking at each starting point and
each possible offset to see whether a string of at least k characters
is duplicated at that offset.  For a longer file, off the top of my
head, you might start with a set of pointers for where each possible
digraph starts; you can use a linked list, storing all the starting
locations for each digraph (a two-dimensional array of pointers, total
size 64K).  On your second pass you can do an N^2 search within those
reduced sets to see which repeated digraphs can be extended.  It will
help a lot if your target ciphertext fits in memory.  I make no claim
that this is optimal.

	Jim Gillogly

jim＠mentat.com

tags

participants (1)