data mine the snowden files [was: open the snowden files]

On Sat, Jul 5, 2014 at 11:29 AM, Geert Lovink <geert@desk.nl> wrote:
an impasse of extremes, a full or limited dump off the table. let's find a middle ground. how best to proceed?
very great indeed. what kind of tools would make the journalists involved more effective and productive? 1. using the leaks currently published, devise a framework for "data mining" the leak documents, aka, generating metadata from the data and operating various matches and relevance across the metadata to narrow the search and aggregate related efforts or technologies across their compartmentalized worlds. 2. #1 requires that there is an index of special terms, techniques, suppliers, code names, algorithms, etc. that used to generate the metadata for deeper search and tie to general themes of surveillance. 3. extrapolating from current leaks, also look toward recent advancements and specific technical tell tales of interest. doping silicon as tailored access technique? this could refer to compromised runs of security processors for desired targets. etc. 4. justifying technical detail specifically. we have seen so little technical detail of the source code / hardware design level. how best to justify source code - explaining that the language choice, the nature of the algorithms, the structure of the distributed computing upon which it runs all conveys critical technical details important to understand what part of our technologies are compromised, and guiding the fixes required to protect against such compromises? in short, it would behoove us to build tools to make the journalists more effective, rather than bitch about not being included in the inner circle. (sadly, many good knowledge discovery tools are proprietary and applied to open source intelligence) what types of features would you want such a leak-assistant software to have? what types of existing tools, if any, would provide these capabilities? best regards,

One approach is to take the existing public data, make some assumptions (educated guesses) and do additional research on top of that. It's what I'm doing right now. It's also what led to the original cointelpro revelations. Before the follow-up research, it was a meaningless acronym. Find, extrapolate, expand. ~ Griffin -- Sent from my tracking device. Please excuse brevity and cat photos.

On Tue, Jul 8, 2014 at 1:05 PM, Griffin Boyce <griffin@cryptolab.net> wrote:
hi Griffin! this is the type of effort i was hoping to see undertaken. when you say "additional research", is this organic or structured? tool assisted or old skewl? i too have been building up some terms and technologies, but yet to put it into any structured format with context, as part of my post is to see how others are handling the vast complexity and extensive compartmentalization embodied in the leaks to date. i also would like to pursue this research anonymously, on hidden services rather than public sites or email. best regards,

On July 8, 2014 4:11:44 PM EDT, coderman <coderman@gmail.com> wrote:
hi Griffin!
this is the type of effort i was hoping to see undertaken.
Me too ^_^ eventually I realized I'd have to do it myself if I wanted more info on Topic X. I obviously don't have access to the source, but there are some clear ways to expand on the material that's been released.
when you say "additional research", is this organic or structured? tool assisted or old skewl?
Right now, the aspect I'm researching requires lots of structured research, but fully expect to come across something unexpected (a specific sourcing pattern, perhaps). Manual desk research is the new hotness. Well... maybe not. ;) It helps that I'm really good at it, so it doesn't take as much drudgery. Once collected, some things are trimmed and cleaned up using custom tools. But data collection is all manual.
Nice! :D I'd love to hear more about your conclusions sometime. I started by looking at one narrow outcome of the NSA's work that I find horribly disruptive to the ecosystem around my work. Now my task is to find further proof of this activity using unclassified source material and possibly patterns within their work in this area.
i also would like to pursue this research anonymously, on hidden services rather than public sites or email.
Indeed. Lots of excellent reasons to be light on detail in these types of public forums. ~ Griffin -- Sent from my tracking device. Please excuse brevity and cat photos.

On Tue, Jul 8, 2014 at 4:11 PM, coderman <coderman@gmail.com> wrote:
To do any of this you will need to collect all the releases of docs and images to date, in their original format (not AP newsspeak), in one place. Then dedicate much time to normalizing, convert to one format and import into tagged document store, etc. Yes, this could be hosted on the darknet.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256 I've been working on tools to do exactly this- to make it easier for journalists to rapidly analyze documents and combine different docs and datasets (http://transparencytoolkit.org/). This mostly includes tools for collecting data (uploading docs and getting them in a standard format, scraping pages, pulling data from APIs), filtering through docs (search/browsing tools, entity extraction, combining and crossreferencing, keyword extraction), and visualizing info (in maps, timelines, network graphs). Where possible, I've been basing these off of existing source software, but I also frequently build and heavily modify tools. I'd love to hear what suggestions people have for tools to make or use cases. More info- Demo: http://demo.transparencytoolkit.org Analysis Platform: https://github.com/TransparencyToolkit/Transparency-Toolkit All Tools: https://github.com/transparencytoolkit Network graph generated with TT from LinkedIn profiles mentioning NSA surveillance programs: http://transparencytoolkit.org/nsanetwork.html Article about the above: http://america.aljazeera.com/articles/2014/5/29/nsa-contractors-linkedinprof... Thoughts on how to use tools like this effectively: https://www.theengineroom.org/how-to-find-and-mash-online-info-for-anticorru... On 07/08/2014 03:27 PM, grarpamp wrote:
- -- M. C. McGrath Transparency Toolkit | http://transparencytoolkit.org -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 Comment: Using GnuPG with Icedove - http://www.enigmail.net/ iF4EAREIAAYFAlO8oJ0ACgkQHKENpovrR8UKmAEAhY06O24ReM52Us56SBSJZDu+ JKIjm0Juw+lG43vsxAQA/2lIAIipDU9BfYyA7+G9Uv0pwTzxhC9Ubnc7Yyd4H715 =uM9l -----END PGP SIGNATURE-----

On Tue, Jul 8, 2014 at 3:27 PM, grarpamp <grarpamp@gmail.com> wrote:
indeed. i will also be hosting the complete cryptome archive on hidden site, as it too is part of this corpus to feed into a normalization and extraction engine of great justice. i am using the various python image processing libraries to accomplish this but any language or tool could be useful. i had hoped to distribute the cryptome archives further during the Paris hackfest, alas, unexpected events conspired otherwise. anyone who would like to host mirrors is welcome to tell me how they anticipate mirroring ~30G of data as quickly as possible. :)

Tag the Cryptome Archive: "This is a trap, witting and unwitting. Do not use it or use at own risk. Source and this host is out to pwon and phuck you in complicity with global Internet authorities. Signed Batshit Cryptome and Host, 9 July 2014, 12:16ET." At 10:58 AM 7/9/2014, you wrote:

On Wed, Jul 9, 2014 at 12:17 PM, John Young <jya@pipeline.com> wrote:
Cryptome and JYA's curation, words, and work are important and a monument in their own right. Nuff said. As with other works in this class, I agree with this and other preservation, distribution and downstream analysis efforts. And with carrying whatever tag and preface he so wishes to be carried with them. Please ensure such frontmatter is attached.

On Wed, Jul 9, 2014 at 9:17 AM, John Young <jya@pipeline.com> wrote:
see attached. onion before torrent; rest TBD. also: http://cryptome.org/donations.htm best regards,


added example privoxy config as http_proxy to Tor, add sig note for Update 13. no further updates on list; contact direct if issues encountered. best regards,
participants (6)
-
coderman
-
edhelas
-
grarpamp
-
Griffin Boyce
-
John Young
-
M. C. McGrath