"Welcome to London!" aka, the Internet is now DEF CON Wireless

4 Feb 2016

      https://conspicuouschatter.wordpress.com/2016/02/03/a-technical-reading-of-t...

A technical reading of the “HIMR Data Mining Research Problem Book”
3 February 2016

Boing Boing just released a classified GCHQ document that was meant to
act as the Sept 2011 guide to open research problems in Data Mining.
The intended audience, Heilbronn Institute for Mathematical Research
(HIMR), is part of the University of Bristol and composed of
mathematicians working for half their time on classified problems with
GCHQ.

First off, a quick perusal of the actual publication record of the
HIMR makes a sad reading for GCHQ: it seems that very little research
on data mining was actually performed post-2011-2014 despite this
pitch. I guess this is what you get trying to make pure mathematicians
solve core computer science problems.

However, the document presents one of the clearest explanations of
GCHQ’s operations and their scale at the time; as well as a very
interesting list of open problems, along with salient examples.

Overall, reading this document very much resembles reading the needs
of any other organization with big-data, struggling to process it to
get any value. The constrains under which they operate (see below),
and in particular the limitations to O(n log n) storage per vertex and
O(1) per edge event, is a serious threat — but of course this is only
for un-selected traffic. So the 5000 or so Tor nodes probably would
have a little more space and processing allocated to them, and so
would known botnets — I presume.

Secondly, there is clear evidence that timing information is both
recognized as being key to correlating events and streams; and it is
being recorded and stored at an increasing granularity. There is no
smoking gun as of 2011 to say they casually de-anonymize Tor circuits,
but the writing is on the wall for the onion routing system. GCHQ at
2011 had all ingredients needed to trace Tor circuits. It would take
extra-ordinary incompetence to not have refined their traffic analysis
techniques in the past 5 years. The Tor project should do well to not
underestimate GCHQ’s capabilities to this point.

Thirdly, one should wonder why we have been waiting for 3 years until
such clear documents are finally being published from the Snowden
revelations. If those had been the first published, instead of the
obscure, misleading and very non-informative slides, it would have
saved a lot of time — and may even have engaged the public a bit more
than bad powerpoint.

Some interesting points in the document in order:

    It turns out that GCHQ has a written innovation strategy,
reference [I75]. If someone has a copy it would be great to see it,
and also understand where the ACE program fits in it.

    GCHQ, at the time used heavily two families of technologies:
Hadoop, for bulk processing of raw collected data (6 months for
meta-data apparently), and IBM Streams (DISTILLERY) for stream,
real-time, processing. A lot of the discussion, and open problems,
relate to the fact that bulk collection can only provide a limited
window of visibility, and intelligence related selection, and
processing has to happen within this window. Hence the interest in
processing streaming data.

    Section 2 is probably the clearest explanation of how modern
SIGINT works. I would like to congratulate the (anonymous) author, and
will be setting it as the key reading for my PETS class. It summarizes
well what countless crappy articles on the Snowden leaks have
struggled to piece together. I wish journalists just released this
first, and skipped the ugly slides.

    The intro provides a hit at the scale of cable interception
operations as of 2011. It seems that 200 “bearers” of 10 Gigabit / sec
were being collected at any time; it makes clear that many more
sources were available to switch to depending on need. That is about 2
Terabit / sec, across 3 sites (Cheltenham, Bude, and LECKWITH).

    Section 2.1.2 explains that a lot (the majority) of data is
discarded very early on, by special hardware performing simple
matching on internet packets. I presume this is to filter out bulk
downloads (from CDNs), known sources of spam, youtube videos, etc.
    The same section (2.1.2) also explains that all meta-data is
pulled from the bearer, and provides an interpretation of what
meta-data is.

    Finally (2.1.2) there is a hint at indexing databases (Query
focused databases / QFD) that are specialized to store meta-data, such
as IP traffic flow data, for quick retrival based on selectors (like
IP addresses).

    Section 2.1.3 explains the problem of “target development”, namely
when no good known selectors exist for a target, and it is the job of
the analyst to match it though either contact chaining or
modus-operandi (MO) matching. It is a very instructive section, which
is the technical justification underpinning a lot of the mass
surveillance going on.

    The cute example chosen to illustrate it (Page 12, end of 2.1.3):
Apparently GCHQ developed many of those techniques to spy on foreign
delegations during the 2009 G20 meeting. Welcome to London!

    Section 2.2.2 provides a glimpse at the cybersecurity doctrine and
world-view at GCHQ, already in 2011. In particular, there is a vision
that CESG will act as a network security service for the nation,
blocking attacks at the “firewalls”, and doing attribution (as if the
attacks will be coming “from outside”). GCHQ would then counter-attack
the hostile sources, or simply use the material they intercepted from
others (4th party collection, the euphemism goes).

    Section 2.2.3 provides a glimpse of the difficulties of running
implants on compromised machines: something that is openly admitted.
Apparently ex-filtrating traffic and establishing command-and-control
with implants is susceptible to passive SIGINT, both a problem and an
opportunity.

    Section 3 and beyond describes research challenges that are very
similar with any other large organization or research group: the
difficulty of creating labelled data sets for training machine
learning models; the challenges on working on partial or streaming
data; the need for succinct representations of data structures; and
the problem of inferring “information flow” — namely chains of
communications that are related to each other.

    It seems the technique of choice when it comes to machine learning
is Random Decision Forrests. Good choice, I also prefer it to others.
They have an in-house innovation: they weight the outputs of each
decision tree. (Something that is sometimes called gradual learning in
the open literature, I believe).

    Steganography detection seems to be a highlight: however there is
no explanation if steganography is a real problem they encountered in
the field, or if it was an easy dataset to generate synthetically.

    Section 4, deals with research problems of “Information Flow in
Graphs”. This is the problem of associated multiple related
connections together, including across types of channels, detecting
botnet Command and Control nodes, and also tracing Tor connections.
Tracing Tor nodes is in fact a stated problem, with a stated solution.

    Highlights include the simple “remit” algorithm that developed by
Detica (page 26, Sect. 4.2.1); PRIME TIME that looks at chains of
length 2; and finally HIDDEN OTTER, that specifically targets Tor, and
botnets. (Apparently an internal group codenamed ICTR-NE develop it).

    Section 4.2.2 looks at communications association through temporal
correlation: one more piece of evidence that timing analysis, at a
coarse scale, is on the cards for mining associations. What is cute is
the example used is how to detect all GCHQ employees: they are the
ones with phones not active between 9am and 5pm when they are at work.

    Beside these, they are interested in change / anomaly detection
(4.4.1), spread of information (such as extremest material), etc. Not
that dissimilar from say an analysis Facebook would perform.

    Section 5 poses problem relating to algorithms on streaming graph
data. It provides a definitions of the tolerable costs of analysis
algorithms (the semi-streaming paradigm): for a graph of n vertices
(nodes), they can store a bit of info per vertex, but not all edges,
or even process all edges. So they have O(n log n) storage and can
only do O(1) processing per event / edge. That could be distilled to a
set of security assumptions.

    Section 5.2.2 / 5.2.3 is an interesting discussion about
relaxations of cliques and also point of that very popular nodes (the
pizza delivery line) probably are noise and should be discarded.

    As of 2011 (sect 5.2.4) it was an open problem how far contact
chaining is required. This is set as an open problem, but states that
analysis usually use 2-hops from targets. Note that other possible
numbers are 3, 4, and 5 since after 6 you probably have included
nearly everyone in the world. So it is not that exciting a problem and
cannot blame the pure mathematicians for not tackling it.

    Section 5.5.1 asks the question on whether there is an
approximation of the correlation matrix, to avoid storing and
processing an n x n matrix. It generally seems that matching
identifiers with identifiers is big business.

    Section 6 poses problems relating to the processing, data mining,
and analysis of “expiring” graphs, namely graphs with edges that
disappear after a deadline. This is again related to the constraint
that storage for bulk un-selected data is limited.

    In section 6.3.2 the semi-streaming model where only O(n log n)
storage per vertex is allowed, and O(1) processing per incoming event
/ edge is re-iterated.

    Appendix A deals with models of academic engagement. I have to say
it is very enlightened: it recognizes the value of openly publishing
the research, after some sanitization. Nice.

    Appendix B and C discuss the technical details and size of the IBM
Streams and Hadoop clusters. Section D presents the production
clusters (652 nodes, total 5216 cores, and 32 GB memory for each
node).

    Section E discusses the legalities of using intercepted data for
research, and bless them they do try to provide some logging and Human
Rights Justification (bulk authorization for research purposes).

coderman

dan＠geer.org

tags

participants (2)