Attack (7) is made by an opponent who monitors all network traffic, but has no access to the insides of the remailer nodes. The defense is more subtle, however, than proposed.
(7): Look at all messages coming out of the first remailer, and follow them into their 2nd remailers; take all messages from those and follow them on, and so on. This will eventually lead to a number of destinations, one of which must have been the destination of the original message. Over a period of time, look for correlations between destinations and sources.
Let us assume that these remailers have the basic characteristics of mixes: encryption rewriting, size quantization, and message reordering. Furthermore, let us assume that the defense of using 'large' chains of 'popular' remailers is being used.
With enough mixing at each stage of the chain, the number of possible destinations will become astronomically large,
The possible number of destinations should increase exponentially with each hop. If gather-and-rearrange mixing is done, then the number is the product of the rearrangement thresholds for each remailer. If a radioactive decay model for reordering is used, then it is the expected value of the number of destinations which grows exponentially, that is, the possible number of destinations (those with non-zero probability) grows faster that the expected value of the number of destinations. They are both exponential, but one has a larger base than the other. What is more important than the reordering algorithm is that the expected number of destinations grows exponentially with the number of hops. There will be correlations, but with linear increase in cost we can get rid of them, we hope.
making correlations statistically impossible
What is the nature of the remailer path, however, for which we have an assurance that the correlations are too difficult to carry out? Or to ask a simpler question for a simpler environment where we assume all remailers are equal, just how long does the path have to be? We know that by making the paths "long enough" that we can prevent correlations from becoming significant. The question is how do we find out what is long enough?
such an attack (PROLONGED monitoring of all remailers) would be very difficult to perform, esp. with use of remailer-remailer socket connections.
The fact that it would be difficult is not the issue for the theory, but for the practice. The extremely high cost, however, could be justified for 'national security' reasons against a few targets, or to break the system completely open looking for 'tax evaders.' If our theory is good against an arbitrarily strong opponent, then the system can withstand sustained attack. If the existence of the system is seen as sufficiently threatening, for any number of different threats, we should plan for a sustained attack. We need to know what the limits of the capability are and not just guess. I've been thinking about an invariant for communications systems proof against traffic analysis which I call 'privacy diffusion'. The privacy diffusion is a probability distribution over possible recipients. One characteristic of the privacy diffusion is the expected value of the number of different recipients. This is a good first measure, but I suspect it won't be enough. The expected number of recipients is multiplicative in the diffusion per node, as described above. If different downstream nodes have different mixing thresholds, they'll need to be weighted. Since the system is multiplicative, the weighting should be by geometric mean, i.e. a downstream node with probability 1/10 should multiply by the 10th root of its own threshold. One can see that if all the downstream nodes have equal likelihood and identical thresholds, that this formula degenerates into the simple one above. In fact, there is a simple closed form expression for this value, namely, e^-E(ln p), the inverse exponential of the expected log probability. This is exactly e^H, where H is the entropy of the probability distribution. e^H is also the expected size of the search space, were we looking for encryption keys. On the other hand, this situation is unlike a key search space in that every value is not equally likely and that the priors are not independent. In a phrase, not everybody talks to everyone else, but everyone who talks talks to someone else. We can make a baseline model of a communications graph with probabilities on it. (This doesn't take into account state, e.g. conversations tend to happen alternately.) Most edges on this graph will have p=0, i.e. these two people have never communicated. Let us remove these null edges. What we are left with is a sparse graph with lots of clustering (friends of friends). In this situation, if our message could have gone to ten million people (say, 7 hops each with threshold 10), it is more likely that it went to one of twenty or fifty. Even if you don't know what the graph looks like, you'll know that it is sparse, and you'll have some idea of what the characteristic distributions are. This is exactly the equivalent of studying letter, digram, and higher order statistics for English and other natural languages. The statistics gathered as to the prior distribution will appear in the observed output unless one has some good idea of how to 'confuse and diffuse' them. I am pushing an analogy here between cracking codes and cracking traffic patterns. I am pretty sure that there are more parallels than meet the eye. The appearance of the entropy in the expected number of recipients may be only the tip of a much larger correlation. traffic cipher ======= ====== statistics of letter frequencies interconnection of the plaintext observed messages ciphertext path through key remailers mixing algorithms encryption null messages padding This whole mix system needs a lot more thought before we'll have an assurance that it will be secure against sustained attack. ------------------------------------------------------- On the lighter side, I couldn't resist this next one.
(8): Correlate messages being sent from person A with messages being received a certain time later by person B.
Defense: [...] The sender's traffic pattern is then constant and no information can be gained from it.
And for the receiver, just subscribe to cypherpunks under several different aliases. Eric