Re: text analysis
CyberPsychotic writes:
So I was playing with some crypto-analysis algorythms, and some steps of them involve such thing as finding out the frequency of some character or set of characters. So I wonder what would be the optimal (speed, resources etc)way of coding this. While playing with single character frequency, I guess the best way would be having unsigned int 256 elements array (I refer to C coding) and each time, I find certain character, i just increase the element of the array with this char offset. This seems very neat to me (except the thing that some chars could be never found in text). Anyways, when things come to 2 characters set, i have to get 1024 character set, and so on, which looks quite unreasonable to me to allocate memory for elements, which probably will be never found in text... I was thinking of other solution and came to two way connected lists (correct term?) things, i.e. : i have some structure like:
The term is "linked list". If you want to collect all the arbitrarily long repeats you may have to resign yourself to a two-pass algorithm. For small cryptograms like the ones I usually deal with, it's sufficient to use an N^2 algorithm, looking at each starting point and each possible offset to see whether a string of at least k characters is duplicated at that offset. For a longer file, off the top of my head, you might start with a set of pointers for where each possible digraph starts; you can use a linked list, storing all the starting locations for each digraph (a two-dimensional array of pointers, total size 64K). On your second pass you can do an N^2 search within those reduced sets to see which repeated digraphs can be extended. It will help a lot if your target ciphertext fits in memory. I make no claim that this is optimal. Jim Gillogly
participants (1)
-
jim@mentat.com