[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Statistics on remail message sizes
A couple of weeks ago Eric asked for statistical information on remailer
message sizes. I put in a size-counter a week ago (just piping each message
into wc >> remail2/SIZE.REMAIL) or so, and here are some results. They show
645 messages logged, a sample of what the logs look like, the average size
of a message in characters (counting the header) of about 15K, and a
histogram of message sizes rounded to the nearest 1000. Note that the
histogram is pretty irregular, possibly being affected by repeated
sending of certain messages.
jobe% wc remail2/SIZE.REMAIL
645 1935 16125 remail2/SIZE.REMAIL
jobe% tail remail2/SIZE.REMAIL
58 189 3225
16 90 850
18 121 1016
14 90 896
23 140 1350
653 803 41937
710 860 45666
710 860 45642
20 96 901
28 146 1344
jobe% awk '{sum=sum+$3} END{print sum/NR}' < remail2/SIZE.REMAIL
14794.4
jobe% < remail2/SIZE.REMAIL awk '{print int(($3+500)/1000)*1000}' | sort -n | uniq -c
229 1000
82 2000
50 3000
21 4000
3 5000
45 6000
9 7000
1 8000
1 9000
3 10000
2 11000
1 12000
2 13000
5 14000
3 16000
3 17000
2 18000
1 19000
2 21000
3 23000
1 24000
2 25000
2 26000
2 27000
1 28000
1 30000
1 31000
1 32000
39 34000
37 35000
1 37000
2 38000
2 42000
2 46000
1 48000
1 49000
1 50000
1 51000
1 55000
9 59000
69 60000
I did one other test, which was to see which message sizes were repeated
the most. The first number shows the number of lines which have messages
of exactly the second number of bytes:
jobe% < remail2/SIZE.REMAIL awk '{print }' | sort -n | uniq -c | sort -nr | sed 20q > times2
40 896
40 1350
20 5797
14 1344
11 33845
11 1242
10 892
9 33992
9 1248
8 1753
7 33975
5 1765
5 1757
5 1236
4 901
4 1749
4 1251
3 59725
3 59668
3 5945
It is clear that there is a lot of repetition, probably standard ping
messages and the like. This should give enough info to discard the highly
repeated sets from the histogram above in order to get a possibly more
representative set of numbers.
Hal