How I Learned to Stop Ph34ring NSA and Love the Base Rate Fallacy
Abstract Ladies and gentlemen, I am a Raccoon who has learned to do math. It used to be the case that on the Internet, nobody knew you were a Raccoon. But now that Privacy is dead, I've decided to come clean. I present to you this anonymously authored, non-peer reviewed communication to do with what you will. Should anyone actually cite this work in a published paper, I will ask my brethren to leave their garbage cans unmolested for the rest of their days. Introduction This post performs some basic analysis of the utility of timing correlation attacks against a moderately used anonymous network, specifically with respect to the Base Rate Fallacy[1] of Bayesian statistics. Via that same analysis, it also for the first time begins to quantify the utility that additional users bring to a low latency anonymous network in terms of resistance to timing attacks. You see, one day I was rifling through the local University dumpster looking for a free meal, and I stumbled upon a pile of discarded conference proceedings which I decided to peruse while I dined. After a while, it became apparent that many papers dealing with timing and correlation attacks completely ignore the Base Rate Fallacy and the effect of larger user bases and sample sizes on their results. Worse, while it may be possible that some papers actually DO report Bayesian Detection Rate probabilities, those that do never specify this fact, making their work impossible to differentiate from less appetizing dumpster morsels. It was enough to get this Raccoon down, and I survive contently on moldy bread and discarded hot dogs[2]! Event Notation and Grounding Most timing attack papers deal with the true positive rate and the false positive rate of detection. So let's establish some symbols and formalize these two rates. M = Two chosen entry and exit streams Match (are the same stream) ~M = Two chosen entry and exit streams Don't Match (are not the same) C = Packet/stream timing Correlation predicts a pair is matching ~C = Packet/stream timing Correlation predicts a pair is Non-matching True Positive Rate: P(C|M) False Negative Rate: P(~C|M) = 1-P(C|M) True Negative Rate: P(~C|~M) = 1-P(C|~M) False Positive Rate: P(C|~M) Bayesian Detection Rate Derivation Given a purely hypothetical network with 250,000 concurrent users generating approximately 5000 network-wide concurrent streams and a 99.9% accurate (Equal Error Rate) generic correlation detector, find the probability that we actually DO have a matching entry+exit pair given our Correlator says that we do. This is known as the Bayesian Detection Rate, and is written as P(M|C).
From Bayes: P(M|C) = P(C|M)*P(M)/P(C)
Summing out M to obtain P(C): P(C) = P(M)*P(C|M) + P(~M)*P(C|~M) Combined: P(M|C) = P(C|M)*P(M)/(P(M)*P(C|M) + P(~M)*P(C|~M)) Example 1: Fully Correlating Global Adversary This example deals with an adversary trying to correlate EVERYTHING on a 5000 concurrent stream network at all times. 5000 concurrent streams means 5000 entry connections and 5000 exit connections (as an approximation). This gives 5000^2 possible entry-exit pairings network-wide, thus: P(M) = (1/5000)^2 = 4E-8 99.9% EER generic correlation detector: P(C|M) = 0.999 P(C|~M) = 0.001 P(M|C) = 0.999*4E-8/(4E-8*0.999 + (1-4E-8)*0.001) P(M|C) = .00004 or .004%. In other words, we expect that for every 25000 times the correlator predicts a matching pair, only one of those actually is a valid match. So much for dragnet surveillance. Example 2: Single Site-targeting Global Adversary In this example, the adversary is only interested in connections to a particular site, say wikileaks.org. For this, let's say the adversary only has one exit stream at a given time to correlate to a given entry stream, giving: P(M) = 1/5000 = .0002 P(M|C) = 0.999*.0002/(.0002*0.999 + (1-.0002)*0.001) P(M|C) = .1666 = 16.66% So this means that we expect the adversary to have 6 suspect users for every user that leaks a document to wikileaks.org for a 99.9% accurate correlator. Unlike the dragnet adversary, the targeted adversary is still pretty effective, especially since they will almost certainly have additional a-priori information about the users they suspect to have done the leak. However, if their correlator drops to just 99% EER, however, P(M|C) drops to 0.0194 or 1.94%. At 90% EER, P(M|C) is 0.0018 or 0.18%. These provide with expectations of confusion sets of 52 and 556 users respectively. Example 3: Single Site-targeting "Local" Adversary To extend this into the capabilities of a local adversary (as opposed to global), insert a (c/n)^2 factor into P(M) to account for the likelihood that the adversary will see both sides of a connection, where c is the percentage of network bandwidth they control, and n is the total network capacity. Common accepted reasonable values for c/n are on the order of 0.1, though this may be much higher for IX-level yet not quite fully global adversaries[3]. Let's go with 0.3. P(M) = (1/5000)*(0.3)^2 = .000018 P(M|C) = 0.999*.000018/(.000018*0.999 + (1-.000018)*.001) P(M|C) = 0.0177 = 1.77% So this means that an adversary with control of 30% of the network (such as China, the US, or Germany) can expect to go through 57 other users before coming across a legitimate match predicted by their timing detector. At 99% EER, this number goes up to 562, and at 90% EER, all the way up to 6173. This seems a bit more infeasible, but may still be doable with enough information or many repeated observations. Conclusions So what does this all mean? Well, first of all, it underscores the importance of being absolutely clear in timing attack research about exactly what success probabilities are being reported, so we can better compare both attacks and defenses. In fact, in the opinion of this humble Raccoon, a large body of work is somewhat suspect for lack of clarity, both in this and other respects. Second, it gives us a small glimmer of hope that maybe all is not lost against IX, National, or ISP level adversaries. Especially those with only sampled or connection-level resolution, where fine-grained details on inter-packet timings is not available (as will likely be the case with EU data retention). Finally, it also quantifies that we certainly do benefit from a larger anonymity network not just in terms of nodes, but also in terms of total concurrent users doing similar things (like short-lived web traffic). This quantification strongly indicates that we should avoid splitting the network into segments if we want to gain any additional utility from growing it further, aside from simply supporting more users at some fixed level of anonymity. [1]. http://www.raid-symposium.org/raid99/PAPERS/Axelsson.pdf [2]. http://www.stinkymeat.net/ [3]. http://www.cl.cam.ac.uk/~sjm217/papers/pet07ixanalysis.pdf [4]. "If you only cite a handful of works, either you are doing something incredibly novel, or you're not nearly as novel as you thought." [5]. "Good artists imitate, great artists steal." ----- End forwarded message ----- -- Eugen* Leitl <a href="http://leitl.org">leitl</a> http://leitl.org ______________________________________________________________ ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE
participants (1)
-
The23rd Raccoon