[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

*To*: [email protected]*Subject*: BBS E-mail policy*From*: Eric Hughes <[email protected]>*Date*: Thu, 22 Oct 92 23:01:43 -0700*In-Reply-To*: The Omega's message of Thu, 22 Oct 92 12:23:38 -0400 <[email protected]>

Re: distinguishing between encrypted mail, plaintext mail, and line-noise. I'm really glad this question came up. I passed over it before because I was more interested in the social issue, but the technical one is important. The basic technique is the foundation of cryptography: information theory. For this application, you can just measure the entropy; it alone should be able to distinguish between the three sources. The entropy measures how well one can statistically predict the output of a source. A random source has eight bits of entropy per byte. As randomness decreases, so does the entropy measure. (Mail me if you want references in order to learn this stuff yourself.) Now line noise, let's say, will appear random. So its entropy should be right near the maximum, 8 bits. Text encrypted with PGP using the ASCII armor uses only 64 characters out of 256 possible, or one fourth of the total available. Its entropy would be 2 bits per character. English text is usually around four and five bits per character, if I remember right. To calculate the entropy, you first make a table (of size 256) of character frequencies normalized to the range [0,1]. Call these p_i. The entropy is then (TeX here) $ \Sum_{i=0}^{256}n - p_i \log_2 p_i $. (The log base 2 give bits instead of natural units). Now see if this number is in one of the following ranges: [1.5 .. 2.5] encrypted text [3 .. 6] regular text [7 .. 8] line noise This is a very simple measure. There are other measures to look for the deviation from an expected distribution, which give much more accurate distinctions. One can very easily separate languages from each other just by looking at such measures. Note that none of these techniques ever look at the content. Nor do they look at digraph (two-letter combinations) or trigraph statistics. In fact, the content is completely destroyed by the scanning process! Lots of this stuff is known; this is how the big boys crack codes. I'm glad there arose a natural context to explain some of this stuff. Eric

**Follow-Ups**:**BBS E-mail policy Now see if this number is in one of the following ranges:***From:*[email protected] (E. Dean Tribble)

**Re: BBS E-mail policy***From:*Eric Hollander <[email protected]soda.berkeley.edu>

**References**:**BBS E-mail policy***From:*[email protected] (The Omega)

- Prev by Date:
**temporary request** - Next by Date:
**Eavesdropping on a printer's signature** - Prev by thread:
**BBS E-mail policy** - Next by thread:
**BBS E-mail policy Now see if this number is in one of the following ranges:** - Index(es):