Statistics on remail message sizes
A couple of weeks ago Eric asked for statistical information on remailer message sizes. I put in a size-counter a week ago (just piping each message into wc >> remail2/SIZE.REMAIL) or so, and here are some results. They show 645 messages logged, a sample of what the logs look like, the average size of a message in characters (counting the header) of about 15K, and a histogram of message sizes rounded to the nearest 1000. Note that the histogram is pretty irregular, possibly being affected by repeated sending of certain messages. jobe% wc remail2/SIZE.REMAIL 645 1935 16125 remail2/SIZE.REMAIL jobe% tail remail2/SIZE.REMAIL 58 189 3225 16 90 850 18 121 1016 14 90 896 23 140 1350 653 803 41937 710 860 45666 710 860 45642 20 96 901 28 146 1344 jobe% awk '{sum=sum+$3} END{print sum/NR}' < remail2/SIZE.REMAIL 14794.4 jobe% < remail2/SIZE.REMAIL awk '{print int(($3+500)/1000)*1000}' | sort -n | uniq -c 229 1000 82 2000 50 3000 21 4000 3 5000 45 6000 9 7000 1 8000 1 9000 3 10000 2 11000 1 12000 2 13000 5 14000 3 16000 3 17000 2 18000 1 19000 2 21000 3 23000 1 24000 2 25000 2 26000 2 27000 1 28000 1 30000 1 31000 1 32000 39 34000 37 35000 1 37000 2 38000 2 42000 2 46000 1 48000 1 49000 1 50000 1 51000 1 55000 9 59000 69 60000 I did one other test, which was to see which message sizes were repeated the most. The first number shows the number of lines which have messages of exactly the second number of bytes: jobe% < remail2/SIZE.REMAIL awk '{print }' | sort -n | uniq -c | sort -nr | sed 20q > times2 40 896 40 1350 20 5797 14 1344 11 33845 11 1242 10 892 9 33992 9 1248 8 1753 7 33975 5 1765 5 1757 5 1236 4 901 4 1749 4 1251 3 59725 3 59668 3 5945 It is clear that there is a lot of repetition, probably standard ping messages and the like. This should give enough info to discard the highly repeated sets from the histogram above in order to get a possibly more representative set of numbers. Hal
A couple of weeks ago Eric asked for statistical information on remailer message sizes. I put in a size-counter a week ago [...] or so, and here are some results. Based on Hal's numbers, I would suggest a reasonable quantization for message sizes be a short set of geometrically increasing values, namely, 1K, 4K, 16K, 64K. In retrospect, this seems like the obvious quantization, and not arithmetic progressions. Live and learn. Eric
In article <9408291623.AA29767@ah.com>, Eric Hughes <hughes@ah.com> wrote:
Based on Hal's numbers, I would suggest a reasonable quantization for message sizes be a short set of geometrically increasing values, namely, 1K, 4K, 16K, 64K. In retrospect, this seems like the obvious quantization, and not arithmetic progressions. Live and learn.
A brief suggestion: Code the progression, not the four values. As time goes on (and lossy sendmails disappear), people are sending larger and larger messages; it's easily conceivable that people could be swapping multiMB files at some point in the not too distant future (indeed, I do occasionally send out files that are 4-5 MB large, uuencoded binaries and tar files). No point in limiting future behavior due to current usage. -- L. Todd Masco | "Which part of 'shall not be infringed' didn't cactus@bb.com | you understand?"
participants (3)
-
cactus@bb.com -
Hal -
hughes@ah.com