Hello folks, I have abit practical question, since I know this list is partly devoted to cryptography matters, I guess this is the right place to ask: So I was playing with some crypto-analysis algorythms, and some steps of them involve such thing as finding out the frequency of some character or set of characters. So I wonder what would be the optimal (speed, resources etc)way of coding this. While playing with single character frequency, I guess the best way would be having unsigned int 256 elements array (I refer to C coding) and each time, I find certain character, i just increase the element of the array with this char offset. This seems very neat to me (except the thing that some chars could be never found in text). Anyways, when things come to 2 characters set, i have to get 1024 character set, and so on, which looks quite unreasonable to me to allocate memory for elements, which probably will be never found in text... I was thinking of other solution and came to two way connected lists (correct term?) things, i.e. : i have some structure like: struct element { char value[ELEMENT_LENGTH]; unsigned int frequency; struct element *previous; struct element *next; } and could dinamically allocate memory for each new found element, but this would slow down whole code by the time list of new elements grow up. (b/c i will have to look thro.. whole list in order to find whether such element has already been found or not), so I wonder maybe there another neat way to complete tasks like this? I would appreciate any ideas, hints, code examples. (would be very helpful having a look on some code performing such a task, I surfed the web but didn't find many)... Thanks beforehands, Fyodor
CyberPsychotic wrote:
text). Anyways, when things come to 2 characters set, i have to get 1024 character set, and so on, which looks quite unreasonable to me to allocate memory for elements, which probably will be never found in text... I was thinking of other solution and came to two way connected lists (correct term?) things, i.e. : i have some structure like:
struct element { char value[ELEMENT_LENGTH]; unsigned int frequency; struct element *previous; struct element *next; } and could dinamically allocate memory for each new found element, but this would slow down whole code by the time list of new elements grow up.
I think currently memory is cheap enough so that you could do frequency counts of at least trigrams with one dimensional array. M. K. Shen
At 01:44 PM 8/6/98 +0100, Mok-Kong Shen wrote:
CyberPsychotic wrote:
text). Anyways, when things come to 2 characters set, i have to get 1024 character set, and so on, which looks quite unreasonable to me to allocate memory for elements, which probably will be never found in text... I was thinking of other solution and came to two way connected lists (correct term?) things, i.e. : i have some structure like:
struct element { char value[ELEMENT_LENGTH]; unsigned int frequency; struct element *previous; struct element *next; } and could dinamically allocate memory for each new found element, but this would slow down whole code by the time list of new elements grow up.
I think currently memory is cheap enough so that you could do frequency counts of at least trigrams with one dimensional array.
RAM is cheap - any 486 computer and most 386 computers has at least 4MB. In the US, most new PCs have 16MB or more, and many have 64MB. But learning how to program efficiently is still important :-) And some computers have operating systems that make it hard to use all of the RAM. Disk is cheaper - almost all new computers have 1GB or more, and most older desktops have at least 200MB. You probably should use 32-bit integers to store your counters, but you need to make sure your program does the right thing if you exceed counts of 2**31. Are you counting all kinds of text, or only words made of alphabetic characters and spaces and punctuation, or only ASCII 127-bit? Your 1024 number is 32*32, which sounds like only alphabetic and spaces; if that's good enough, you could do 32*32*32*32 with 4-byte counts and still fit in 4MB. If you want all 256 possible 8-bit characters, two characters requires 64K counters (256KB). Trigrams require 16 million counters if you allocate space for them all, which is 64MB, and may be expensive. But you could store 256*256 pointers in 256KB, and allocate memory for 256-counter arrays only for the digrams that appear, and probably save lots of memory. You could also store the counts on disk, and keep only the more frequently used arrays of counts in RAM. This was how we did things in the old days, when almost everything was too big for RAM, and it really is much slower if you're not careful :-) If you have a machine with a real operating system, you can pretend keep your counters in RAM and let the operating system decide what to page out to disk. Think about this carefully, because it can be very fast or very slow depending on how well you write your programs. Thanks! Bill Bill Stewart, bill.stewart@pobox.com PGP Fingerprint D454 E202 CBC8 40BF 3C85 B884 0ABE 4639
yeah. So far I played with it, I found out that allocating arrays in memory as N-dimentional matrix seem to be the best solution. The rest of things seem to be pathetically slow. I generally do all my coding under Linux so there's no problem with memory limits. Thanks again for all help. F. --- Fyodor Yarochkin tel:[996-3312] 474465 email:fygrave@usa.net http://www.kalug.lug.net/ X-Noizz http://infernus.online.kg echo 'subscribe kalug' | mail majordomo@krsu.edu.kg : join Kyrgyztani L.U.G. - "Only Schizophrenia beats being alone." - http://www.kalug.lug.net/fygrave
participants (3)
-
Bill Stewart
-
CyberPsychotic
-
Mok-Kong Shen