[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: text analysis



CyberPsychotic writes:
>  So I was playing with some crypto-analysis algorythms, and some steps of
> them involve such thing as finding out the frequency of some character or
> set of characters. So I wonder what would be the optimal (speed, resources
> etc)way of coding this. While playing with single character frequency, I
> guess the best way would be having unsigned int 256 elements array (I
> refer to C coding) and each time, I find certain character, i just
> increase the element of the array with this char offset. This seems very
> neat to me (except the thing that some chars could be never found in
> text). Anyways, when things come to 2 characters set, i have to get 1024
> character set, and so on, which looks quite unreasonable to me to allocate
> memory for elements, which probably will be never found in text... I was
> thinking of other solution and came to two way connected lists (correct
> term?)  things, i.e. : i have some structure like: 

The term is "linked list".  If you want to collect all the arbitrarily
long repeats you may have to resign yourself to a two-pass algorithm.
For small cryptograms like the ones I usually deal with, it's
sufficient to use an N^2 algorithm, looking at each starting point and
each possible offset to see whether a string of at least k characters
is duplicated at that offset.  For a longer file, off the top of my
head, you might start with a set of pointers for where each possible
digraph starts; you can use a linked list, storing all the starting
locations for each digraph (a two-dimensional array of pointers, total
size 64K).  On your second pass you can do an N^2 search within those
reduced sets to see which repeated digraphs can be extended.  It will
help a lot if your target ciphertext fits in memory.  I make no claim
that this is optimal.

	Jim Gillogly