[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Statistics on remail message sizes



A couple of weeks ago Eric asked for statistical information on remailer
message sizes.  I put in a size-counter a week ago (just piping each message
into wc >> remail2/SIZE.REMAIL) or so, and here are some results.  They show
645 messages logged, a sample of what the logs look like, the average size
of a message in characters (counting the header) of about 15K, and a
histogram of message sizes rounded to the nearest 1000.  Note that the
histogram is pretty irregular, possibly being affected by repeated
sending of certain messages.


jobe% wc remail2/SIZE.REMAIL
     645    1935   16125 remail2/SIZE.REMAIL
jobe% tail remail2/SIZE.REMAIL
      58     189    3225
      16      90     850
      18     121    1016
      14      90     896
      23     140    1350
     653     803   41937
     710     860   45666
     710     860   45642
      20      96     901
      28     146    1344
jobe% awk '{sum=sum+$3} END{print sum/NR}' < remail2/SIZE.REMAIL
14794.4
jobe% < remail2/SIZE.REMAIL awk '{print int(($3+500)/1000)*1000}' | sort -n | uniq -c
 229 1000
  82 2000
  50 3000
  21 4000
   3 5000
  45 6000
   9 7000
   1 8000
   1 9000
   3 10000
   2 11000
   1 12000
   2 13000
   5 14000
   3 16000
   3 17000
   2 18000
   1 19000
   2 21000
   3 23000
   1 24000
   2 25000
   2 26000
   2 27000
   1 28000
   1 30000
   1 31000
   1 32000
  39 34000
  37 35000
   1 37000
   2 38000
   2 42000
   2 46000
   1 48000
   1 49000
   1 50000
   1 51000
   1 55000
   9 59000
  69 60000

I did one other test, which was to see which message sizes were repeated
the most.  The first number shows the number of lines which have messages
of exactly the second number of bytes:


jobe% < remail2/SIZE.REMAIL awk '{print }' | sort -n | uniq -c | sort -nr | sed 20q > times2
  40 896
  40 1350
  20 5797
  14 1344
  11 33845
  11 1242
  10 892
   9 33992
   9 1248
   8 1753
   7 33975
   5 1765
   5 1757
   5 1236
   4 901
   4 1749
   4 1251
   3 59725
   3 59668
   3 5945
It is clear that there is a lot of repetition, probably standard ping
messages and the like.  This should give enough info to discard the highly
repeated sets from the histogram above in order to get a possibly more
representative set of numbers.

Hal