data mine the snowden files [was: open the snowden files]

Tue Jul 8 12:08:19 PDT 2014

On Sat, Jul 5, 2014 at 11:29 AM, Geert Lovink <geert@desk.nl> wrote:
> ...
> the snowden files are of public interest. but only a small circle of
> people is able to access, read, analyze, interpret and publish them. and
> only a very small percentage of those files has been made available to
> the public...
>
> what can be done about this situation? are we able to find a way to
> "open" this data? and in the course of this create a modell for future
> leaks?
> ..
> prior to my intervention harding had already hinted at some very obvious
> limitations of the ongoing investigation, alluding to various reasons
> why those "few lucky ones" are incapable to deal with the investigation
> challenge in an approriate manner: "we are not technical experts" or
> "after two hours your eyes pop out". inspite of this, harding seemed
> unprepared to refelect the possibility to open the small circle of
> analysts dealing with the snowden files.

an impasse of extremes, a full or limited dump off the table.

let's find a middle ground. how best to proceed?

> * last but not least: one should work out a concept/model for
> transferring those files into the public domain -- taking also into
> account the obvious problems of "security" and "government pressure".
>
> it would be great of we could start a debate about in order to build a
> case for the future of handling big data leaks in a more democratic and
> sustainable manner.

very great indeed.  what kind of tools would make the journalists
involved more effective and productive?

1. using the leaks currently published, devise a framework for "data
mining" the leak documents, aka, generating metadata from the data and
operating various matches and relevance across the metadata to narrow
the search and aggregate related efforts or technologies across their
compartmentalized worlds.

2. #1 requires that there is an index of special terms, techniques,
suppliers, code names, algorithms, etc. that used to generate the
metadata for deeper search and tie to general themes of surveillance.

3. extrapolating from current leaks, also look toward recent
advancements and specific technical tell tales of interest.  doping
silicon as tailored access technique? this could refer to compromised
runs of security processors for desired targets. etc.

4. justifying technical detail specifically.  we have seen so little
technical detail of the source code / hardware design level.  how best
to justify source code - explaining that the language choice, the
nature of the algorithms, the structure of the distributed computing
upon which it runs all conveys critical technical details important to
understand what part of our technologies are compromised, and guiding
the fixes required to protect against such compromises?

in short, it would behoove us to build tools to make the journalists
more effective, rather than bitch about not being included in the
inner circle.  (sadly, many good knowledge discovery tools are
proprietary and applied to open source intelligence)

what types of features would you want such a leak-assistant software
to have?  what types of existing tools, if any, would provide these
capabilities?

best regards,