cypherpunks
Threads by month
- ----- 2025 -----
- January
- ----- 2024 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2023 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2022 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2021 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2020 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2019 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2018 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2017 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2016 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2015 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2014 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2013 -----
- December
- November
- October
- September
- August
- July
December 2015
- 50 participants
- 131 discussions
On January 8 2016 the US Supreme Court will consider a petition for
certiorari in the EPIC v. DHS "Standard Operating Procedure 303" FOIA
suit. SOP 303 is also known as the 'National Emergency Wireless Kill-Switch'
“Standard Operating Procedure 303,” is the protocol that codifies a
“shutdown and restoration process for use by commercial and private
wireless networks during national crisis.”
Moreover, the justices are expected to clarify: “the scope of FOIA’s
“Exemption 7(F),” which allows the government to withhold “records or
information compiled for law enforcement purposes, but only to the
extent that the production of such law enforcement records or
information … could reasonably be expected to endanger the life or
physical safety of any individual.”
The central question raised by Exemption 7(F) is whether the
government has to be able to identify with any specificity the
“individual” whose life or physical safety might be endangered by
disclosure of the requested law enforcement records. In 2008, the Second
Circuit answered that question in the affirmative in ACLU v. Dep’t of
Defense. In its February 2015 decision affirming the government’s
rejection of EPIC’s FOIA request, the D.C. Circuit held expressly to the
contrary. As I explain in the post that follows, not only is this
division of authority sufficiently important so as to justify the
Supreme Court’s intervention no matter how the Court ultimately rules,
but, in my view, the Second Circuit clearly has the better reading of
Exemption 7(F) as a matter of statutory purpose, structure, and policy.
I. The Second Circuit and the PNSDA
The story begins with FOIA requests filed in 2003 by the ACLU and
other organizations seeking records related to the treatment and death
of detainees held in US custody overseas after September 11, and records
related to the practice of “rendering” some of those detainees to
countries known to use torture. The litigation over the ACLU’s request
eventually reduced to a dispute over at least 29 (and perhaps thousands
of) photographs of detainees and detainee abuse in Afghanistan and Iraq.
In June 2006, the district court ordered the release of 21 of those
photographs (with proper redactions to alleviate privacy objections),
and the government appealed, arguing that the photographs were covered
by Exemption 7(F) insofar as their release would likely incite violence
against US personnel in Afghanistan and Iraq.”
In full with links at Just Security
https://www.justsecurity.org/28485/foia-circuit-split-supreme-court-resolve/
--
RR
"You might want to ask an expert about that - I just fiddled around
with mine until it worked..."
3
6
two new for DOCSIS tech @FBI, @CIA:
"Any and all "DOCSIS" technology records, including cross-references
and indirect mentions, including records outside the investigation
main file. This is to include a search of each of the following record
stores and interfaces: the Central Records System (CRS), the Automated
Case Support system ("ACS") Investigative Case Management system
("ICM"), the Automated Case Support system ("ACS") Electronic Case
File ("ECF"), and the Automated Case Support system ("ACS") Universal
Index ("UNI"). I also request a search of "ELSUR", the database
containing electronic surveillance information, for any and all
records or activities related to "DOCSIS" or "DOCSIS intercept" or
"DOCSIS access" technology. In addition, please extend the search
criteria across any external storage media, including I-Drives,
S-Drives, or related technologies used during the course of
investigation involving Cable internet data services. DITU
experimental technologies or research also within scope of this
request. Please include processing notes, even if request is denied in
part. Please identify individuals responsible for any aspect of FOIA
processing in the processing notes, along with explanation of their
involvement if not typically assigned FOIA responsibilities for the
record systems above."
- https://www.muckrock.com/foi/united-states-of-america-10/indocsis-19725/
"Any and all records, receipts, training, technology transfer
programs, research, evaluation technologies, or other materials
relevant to "DOCIS" cable communication technology. This is to include
"DOCSIS 1.0", "DOCSIS 2.0", "DOCSIS 3.0", and other relevant DOCSIS
protocols."
- https://www.muckrock.com/foi/united-states-of-america-10/indocsisxfer-19726/
On 7/12/15, coderman <coderman(a)gmail.com> wrote:
> On 7/12/15, Douglas Rankine <douglasrankine2001(a)yahoo.co.uk> wrote:
>> Are they giving reasons for the rejections?
>
> Glomar all around. see also:
>
> "What Is the Big Secret Surrounding Stingray Surveillance?"
> -
> http://www.scientificamerican.com/article/what-is-the-big-secret-surroundin…
>
> ---
>
> What Is the Big Secret Surrounding Stingray Surveillance?
>
> State and local law enforcement agencies across the U.S. are setting
> up fake cell towers to gather mobile data, but few will admit it
> By Larry Greenemeier | June 25, 2015
>
>
> Stung: Law enforcement agencies sometimes use a device called a
> stingray to simulate a cell phone tower, enabling them to gather
> international mobile subscriber identity (IMSI), location and other
> data from mobile phones connecting to them. Pictured here is an actual
> cell tower in Palatine, Ill.
>
>
> Given the amount of mobile phone traffic that cell phone towers
> transmit, it is no wonder law enforcement agencies target these
> devices as a rich source of data to aid their investigations. Standard
> procedure involves getting a court order to obtain phone records from
> a wireless carrier. When authorities cannot or do not want to go that
> route, they can set up a simulated cell phone tower—often called a
> stingray—that surreptitiously gathers information from the suspects in
> question as well as any other mobile device in the area.
>
> These simulated cell sites—which collect international mobile
> subscriber identity (IMSI), location and other data from mobile phones
> connecting to them—have become a source of controversy for a number of
> reasons. National and local law enforcement agencies closely guard
> details about the technology’s use, with much of what is known about
> stingrays revealed through court documents and other paperwork made
> public via Freedom of Information Act (FOIA) requests.
>
> One such document recently revealed that the Baltimore Police
> Department has used a cell site simulator 4,300 times since 2007 and
> signed a nondisclosure agreement with the FBI that instructed
> prosecutors to drop cases rather than reveal the department’s use of
> the stingray. Other records indicate law enforcement agencies have
> used the technology hundreds of times without a search warrant,
> instead relying on a much more generic court order known as a pen
> register and trap and trace order. Last year Harris Corp., the
> Melbourne, Fla., company that makes the majority of cell site
> simulators, went so far as to petition the Federal Communications
> Commission to block a FOIA request for user manuals for some of the
> company’s products.
>
> The secretive nature of stingray use has begun to backfire on law
> enforcement, however, with states beginning to pass laws that require
> police to obtain a warrant before they can set up a fake cell phone
> tower for surveillance. Virginia, Minnesota, Utah and Washington State
> now have laws regulating stingray use, with California and Texas
> considering similar measures. Proposed federal legislation to prevent
> the government from tracking people’s cell phone or GPS location
> without a warrant could also include stingray technology.
>
> Scientific American recently spoke with Brian Owsley, an assistant
> professor of law at the University of North Texas Dallas College of
> Law, about the legal issues and privacy implications surrounding the
> use of a stingray to indiscriminately collect mobile phone data. Given
> the invasive nature of the technology and scarcity of laws governing
> its use, Owsley, a former U.S. magistrate judge in Texas, says the
> lack of reliable information documenting the technology’s use is
> particularly troubling.
>
>
> [An edited transcript of the interview follows.]
>
> When and why did law enforcement agencies begin using international
> cell site simulators to intercept mobile phone traffic and track
> movement of mobile phone users?
>
> Initially, intelligence agencies—CIA and the like—couldn’t get local
> or national telecommunications companies in other countries to
> cooperate with U.S. surveillance operations against nationals in those
> countries. To fill that void companies like the Harris Corp. started
> creating cell site simulators for these agencies to use. Once Harris
> saturated the intelligence and military markets [with] their products,
> they turned to federal agencies operating in the U.S. So the [Drug
> Enforcement Administration], Homeland Security, FBI and others started
> having their own simulated cell sites to use for surveillance.
> Eventually this trickled down further to yet another untapped market:
> state and local law enforcement. That’s where we are today in terms of
> the proliferation of this technology.
>
>
> Under what circumstances do U.S. law enforcement agencies use cell
> site simulators and related technology?
>
> There are three examples of how law enforcement typically use
> stingrays for surveillance: First, law enforcement officials may use
> the cell site simulator with the known cell phone number of a targeted
> individual in order to determine that individual's location. For
> example, officials are searching for a fugitive and have a cell phone
> number that they believe the individual is using. They may operate a
> stingray near areas where they believe that the individual may be,
> such as a relative's home.
>
> Second, law enforcement officials may use the stingray to target a
> specific individual who is using a cell phone, but these officials do
> not know the cell phone number. They follow the targeted individual
> from a site to various other locations over a certain time period. At
> each new location, they activate the stingray and capture the cell
> phone data for all of the nearby cell phones. After they have captured
> the data at a number of sites they can analyze the data to determine
> the cell phone or cell phones used by the targeted individual. This
> approach captures the data of all nearby cell phones, including
> countless cell phones of individuals unrelated to the criminal
> investigation.
>
> Third, law enforcement officials have been known to operate stingray
> at political rallies and protests. Using the stingray at these types
> of events captures the cell phone data of everyone in attendance.
>
>
> How does law enforcement get permission to perform this type of
> surveillance?
>
> Federal law enforcement agencies typically get courts to approve use
> of something like stingray through a pen register application [a pen
> register is a device that records the numbers called from a particular
> phone line]. With that type of application, essentially the government
> says, we want this information. We think it’s going to be relevant to
> an ongoing criminal investigation. As you can imagine, that’s a pretty
> low bar for them to satisfy in the eyes of the court. Just about
> anything could fit into that description. You don’t even have to show
> that such an investigation would lead to an arrest or prosecution. Law
> enforcement is telling the court, look, we’re in the middle of this
> investigation. If we get this information, we think it might lead to
> some other important information.
>
> Different court orders have different standards for approval. The
> highest standard would be for a wiretap. A search warrant likewise has
> a much higher standard than a pen register, requiring law enforcement
> to prove probable cause before a judge will grant permission to use
> additional means of investigation. The problem that I have with a pen
> register to justify use of something like a stingray is that the
> standard for a pen register is much too low, given the invasive nature
> of a pen register. Instead, I think the use of a stingray should be
> consistent with the Fourth Amendment of the Constitution and pursuant
> to a search warrant.
>
>
> Why not explicitly state the type of technology being used and its
> specific purpose when filing for a court order?
>
> [When] law enforcement agencies seek to obtain judicial authorization
> through a pen register, they do not directly indicate that they are
> applying for authorization to use a stingray. Doing so might cause
> some courts to question whether the pen register statute [as opposed
> to some higher standard] is the appropriate basis for authorizing a
> stingray. In addition, law enforcement agencies typically have to sign
> nondisclosure agreements with Harris Corp. in order to receive the
> federal Homeland Security funding needed to purchase the technology.
> So there’s this concern, at least at the local law enforcement level,
> about revealing any information about it because that would violate
> the agreement with Harris and maybe subject them to losing the
> equipment or some other consequences.
>
>
> Why would law enforcement agencies sign a nondisclosure agreement with
> a technology company?
>
> I’m not sure whether the agreements are being driven by the FBI or by
> Harris, but these agreements seem to be getting less relevant insofar
> as [there is less] need to keep the public unaware of the existence of
> this technology. In the last three or so years there’s been a lot more
> awareness about the technology and its use. When agencies were first
> signing these agreements years ago, use of this technology wasn’t
> widely known. Now you are getting situations where criminal defense
> attorneys learn about stingray and similar technologies and the role
> they may be playing in the arrests of some of their clients. Defense
> teams are starting to ask questions and require the government to
> produce documentation such as court orders, and that’s creating the
> confrontation you’re now seeing.
>
>
> Why have law enforcement agencies kept their use of cell site
> simulators so secretive?
>
> Some of it is the cloudy legal issues surrounding the legitimate uses
> of this technology. Law enforcement agencies will also argue that the
> more information that’s available about this technology, the harder it
> is for them to use these devices to fight crime. Yet there’s a growing
> knowledge of this technology, and a serious criminal enterprise is
> already aware of it. People are already using prepaid disposable
> phones [sometimes referred to as “burner phones”] to some extent to
> defeat this technology. Sophisticated criminals are aware that there’s
> electronic surveillance out there in myriad ways, and so they’re going
> to take precautions. From a technology perspective, it’s sort of a
> cat-and-mouse game. There’s also a device that locates cell site
> simulators, something referred to as an IMSI catcher. There’s an arms
> race back and forth to get the best technology and to get the edge.
>
>
> What does it say to you about the whole process that a prosecutor or a
> law enforcement agency is willing to sacrifice a conviction in order
> to keep their methods a secret?
>
> I think it’s a very odd approach. You are throwing away some
> convictions or potential convictions for the sake of secrecy. But it’s
> even harder to understand now that knowledge of the technology is
> becoming so common. There have been documented cases in Baltimore and
> Saint Louis where stingray has supposedly been used. The use of
> stingray and related technologies is a roll of the dice in the sense
> that law enforcement is hoping that either the defense attorneys don’t
> have enough savvy or wherewithal to find out about the technology and
> ask the right questions or, even if that does happen, they’re hoping
> that the judge that they have is favorable to their approach and not
> going to order them to reveal information about its use. In the rare
> occasions when things go against them, they just dismiss it.
>
>
> You yourself denied a law enforcement application three years ago to
> use a stingray. Under what circumstances would you approve its use?
>
> I want to make clear: I don’t have a problem with stingray itself—I
> understand that this can be a valuable tool in law enforcement’s
> arsenal. My problem is that I want it to be used pursuant to a high
> standard of proof that it’s needed, and that I want the approval
> process to be more transparent. One of the reasons I’d like to see
> some more documentation of stingray applications and orders is because
> I have this suspicion—but there’s no way of confirming it one way or
> another—that some judges are signing approvals to use this technology
> thinking that they’re just signing a pen register. If a judge thinks
> it’s [just] another pen register application, they’re just going to
> sign it without giving it much pause.
>
>
> Now that the use of this stingrays and related technologies has been
> made public, where will this issue be a year or a few years from now?
>
> A year from now I think we’re in the same position. You’re dealing
> with outdated statutes concerning new and very different technology.
> It’s possible in five years maybe that Congress will step in and do
> something. More likely, state legislatures will take most of the
> action to monitor this type of surveillance. Washington State,
> California [and others] have already acted, and Texas is evaluating
> the standards for approving stingray use.
>
11
83
Possible crypto backdoor in RFC-2631 Diffie-Hellman Key Agreement Method
by Georgi Guninski 18 Feb '16
by Georgi Guninski 18 Feb '16
18 Feb '16
Possible crypto backdoor in RFC-2631 Diffie-Hellman Key Agreement Method
I am n00b at crypto so this might not make any sense.
In DH, if one can select group parameters (g,q,p) he can break both parties
private very fast time IMHO.
The RFC: https://tools.ietf.org/html/rfc2631
The main problem appears:
https://tools.ietf.org/html/rfc2631#section-2.2.2
2.2.2. Group Parameter Validation
The ASN.1 for DH keys in [PKIX] includes elements j and validation-
Parms which MAY be used by recipients of a key to verify that the
group parameters were correctly generated. Two checks are possible:
1. Verify that p=qj + 1. This demonstrates that the parameters meet
the X9.42 parameter criteria.
2. Verify that when the p,q generation procedure of [FIPS-186]
Appendix 2 is followed with seed 'seed', that p is found when
'counter' = pgenCounter.
The main problem appears MAY.
As I read it, implementation MAY NOT verify it.
Sketch of the attack:
Chose $q$ product of small primes $p_i$.
Solve the discrete logarithm modulo $p_i$ for the public keys.
Apply the Chinese remainder theorem to get the privates keys.
(This is well known method for DL and for this reason
the group order must be prime [160 bits ;)]).
Would be interested how implementations implement this MAY.
Let me know if there is better list for this.
--
georgi
4
44
Re: Fwd: Do ethnic Germans have the right to racial and cultural strength? - was Re: At a Berlin church, Muslim refugees converting in droves.
by Zenaan Harkness 17 Jan '16
by Zenaan Harkness 17 Jan '16
17 Jan '16
On 9/11/15, Nathan wrote:
> WTF? I've read this *twice* and I *still* don't get it..... Is this an
> argument against multiculturalism?
That's not the question - the question is, does each individual human
of a nation have the right to protect their current, existing culture?
Or at least to have a say, i.e. a vote about it?
Some nations, through their power structures, go to extraordinary
lengths to protect their rights to cultural and racial strength, Japan
for example.
Is this the right of the Japanese? Or are they doing the wrong thing?
>From their perspective, they are doing what is required to protect
their rights, to self determination as a people, a tribe, a race.
It appears that the rights of individuals are determined (in practice)
by those in power, and not by those who are directly affected by the
decision makers (the people themselves). Is keeping such decisions in
the hands of a few "educated elected" an elitist approach?
If democracy be the will of the people manifested, then ought the
people of any nation be asked directly (nation wide vote) on such
questions as cultural dilution due to immigration and or asylum?
Or is it appropriate for an "elected" few to impose their personal
preference on a nation?
Regards
Zenaan
6
14
08 Jan '16
Most consumer routers and modems are outdated, and use firmware that's
never or rarely updated.
It would be trivial to hack into any such devices used by government
employees at home and correlate the user to a database to gather additional
information on them.
Can foreign governments still teach blackmailed government employees how to
pass a polygraph?
Seems like counterintelligence investigations now need to include labor
intensive firmware dumps of routers.
Not like the government accomplishes anything good nowadays anyway.
4
3
De-anonymizing Programmers via Code Stylometry
Aylin Caliskan-Islam
Drexel University
Arvind Narayanan
Princeton University
Richard Harang
U.S. Army Research Laboratory
Clare Voss
U.S. Army Research Laboratory
Andrew Liu
University of Maryland
Fabian Yamaguchi
University of Goettingen
Rachel Greenstadt
Drexel University
Abstract
Source code authorship attribution is a significant privacy threat to
anonymous code contributors. However,
it may also enable attribution of successful attacks from
code left behind on an infected system, or aid in resolving copyright,
copyleft, and plagiarism issues in the programming fields. In this
work, we investigate machine
learning methods to de-anonymize source code authors
of C/C++ using coding style. Our Code Stylometry Feature Set is a
novel representation of coding style found
in source code that reflects coding style from properties
derived from abstract syntax trees.
Our random forest and abstract syntax tree-based approach attributes
more authors (1,600 and 250) with significantly higher accuracy (94%
and 98%) on a larger
data set (Google Code Jam) than has been previously
achieved. Furthermore, these novel features are robust,
difficult to obfuscate, and can be used in other programming
languages, such as Python. We also find that (i) the
code resulting from difficult programming tasks is easier
to attribute than easier tasks and (ii) skilled programmers
(who can complete the more difficult tasks) are easier to
attribute than less skilled programmers.
1
Introduction
Do programmers leave fingerprints in their source code?
That is, does each programmer have a distinctive “coding style�
Perhaps a programmer has a preference for
spaces over tabs, or while loops over for loops, or,
more subtly, modular rather than monolithic code.
These questions have strong privacy and security implications.
Contributors to open-source projects may
hide their identity whether they are Bitcoin’s creator or
just a programmer who does not want her employer to
know about her side activities. They may live in a regime
that prohibits certain types of software, such as censorship
circumvention tools. For example, an Iranian pro-
grammer was sentenced to death in 2012 for developing
photo sharing software that was used on pornographic
websites [31].
The flip side of this scenario is that code attribution
may be helpful in a forensic context, such as detection of
ghostwriting, a form of plagiarism, and investigation of
copyright disputes. It might also give us clues about the
identity of malware authors. A careful adversary may
only leave binaries, but others may leave behind code
written in a scripting language or source code downloaded into the
breached system for compilation.
While this problem has been studied previously, our
work represents a qualitative advance over the state of the
art by showing that Abstract Syntax Trees (ASTs) carry
authorial ‘fingerprints.’ The highest accuracy achieved
in the literature is 97%, but this is achieved on a set of
only 30 programmers and furthermore relies on using
programmer comments and larger amounts of training
data [12, 14]. We match this accuracy on small programmer sets without
this limitation. The largest scale experiments in the literature use
46 programmers and achieve
67.2% accuracy [10]. We are able to handle orders of
magnitude more programmers (1,600) while using less
training data with 92.83% accuracy. Furthermore, the
features we are using are not trivial to obfuscate. We are
able to maintain high accuracy while using commercial
obfuscators. While abstract syntax trees can be obfuscated to an
extent, doing so incurs significant overhead
and maintenance costs.
Contributions. First, we use syntactic features for
code stylometry. Extracting such features requires parsing of
incomplete source code using a fuzzy parser to
generate an abstract syntax tree. These features add a
component to code stylometry that has so far remained
almost completely unexplored. We provide evidence that
these features are more fundamental and harder to obfuscate. Our
complete feature set consists of a comprehensive set of around 120,000
layout-based, lexical, and
syntactic features. With this complete feature set we are
able to achieve a significant increase in accuracy compared to
previous work. Second, we show that we can
scale our method to 1,600 programmers without losing
much accuracy. Third, this method is not specific to C or
C++, and can be applied to any programming language.
We collected C++ source of thousands of contestants
from the annual international competition “Google Code
Jamâ€. A bagging (portmanteau of “bootstrap aggregatingâ€)
classifier - random forest was used to attribute programmers to source
code. Our classifiers reach 98% accuracy in a 250-class closed world
task, 93% accuracy in
a 1,600-class closed world task, 100% accuracy on average in a
two-class task. Finally, we analyze various
attributes of programmers, types of programming tasks,
and types of features that appear to influence the success
of attribution. We identified the most important 928 features out of
120,000; 44% of them are syntactic, 1% are
layout-based and the rest of the features are lexical. 8
training files with an average of 70 lines of code is sufficient for
training when using the lexical, layout and syntactic features. We
also observe that programmers with
a greater skill set are more easily identifiable compared
to less advanced programmers and that a programmer’s
coding style is more distinctive in implementations of
difficult tasks as opposed to easier tasks.
The remainder of this paper is structured as follows.
We begin by introducing applications of source code authorship
attribution considered throughout this paper in
Section 2, and present our AST-based approach in Section 3. We proceed
to give a detailed overview of the experiments conducted to evaluate
our method in Section 4
and discuss the insights they provide in Section 5. Section 6 presents
related work, and Section 7 concludes.
2
Jam competition, as we discuss in Section 4.1. Doubtless there will be
additional challenges in using our techniques for digital forensics or
any of the other real-world
applications. We describe some known limitations in
Section 5.
Programmer De-anonymization. In this scenario,
the analyst is interested in determining the identity of an
anonymous programmer. For example, if she has a set of
programmers who she suspects might be Bitcoin’s creator, Satoshi,
and samples of source code from each of
these programmers, she could use the initial versions of
Bitcoin’s source code to try to determine Satoshi’s identity. Of
course, this assumes that Satoshi did not make
any attempts to obfuscate his or her coding style. Given a
set of probable programmers, this is considered a closedworld machine
learning task with multiple classes where
anonymous source code is attributed to a programmer.
This is a threat to privacy for open source contributors
who wish to remain anonymous.
Ghostwriting Detection. Ghostwriting detection is
related to but different from traditional plagiarism detection. We are
given a suspicious piece of code and one or
more candidate pieces of code that the suspicious code
may have been plagiarized from. This is a well-studied
problem, typically solved using code similarity metrics,
as implemented by widely used tools such as MOSS [6],
JPlag [25], and Sherlock [24].
For example, a professor may want to determine
whether a student’s programming assignment has been
written by a student who has previously taken the class.
Unfortunately, even though submissions of the previous
year are available, the assignments may have changed
considerably, rendering code-similarity based methods
ineffective. Luckily, stylometry can be applied in this
setting—we find the most stylistically similar piece of
code from the previous year’s corpus and bring both students in for
gentle questioning. Given the limited set of
students, this can be considered a closed-world machine
learning problem.
Software Forensics. In software forensics, the analyst
assembles a set of candidate programmers based on previously collected
malware samples or online code repositories. Unfortunately, she cannot
be sure that the anonymous programmer is one of the candidates, making
this
an open world classification problem as the test sample
might not belong to any known category.
Copyright Investigation. Theft of code often leads to
copyright disputes. Informal arrangements of hired programming labor
are very common, and in the absence of
a written contract, someone might claim a piece of code
was her own after it was developed for hire and delivered.
A dispute between two parties is thus a two-class classification
problem; we assume that labeled code from both
programmers is available to the forensic expert.
Motivation
Throughout this work, we consider an analyst interested
in determining the programmer of an anonymous fragment of source code
purely based on its style. To do so,
the analyst only has access to labeled samples from a set
of candidate programmers, as well as from zero or more
unrelated programmers.
The analyst addresses this problem by converting each
labeled sample into a numerical feature vector, in order to
train a machine learning classifier, that can subsequently
be used to determine the code’s programmer. In particular, this
abstract problem formulation captures the following five settings and
corresponding applications (see
Table 1). The experimental formulations are presented in
Section 4.2.
We emphasize that while these applications motivate
our work, we have not directly studied them. Rather, we
formulate them as variants of a machine-learning (classification)
problem. Our data comes from the Google Code
2
Authorship Verification. Finally, we may suspect
that a piece of code was not written by the claimed programmer, but
have no leads on who the actual programmer might be. This is the
authorship verification problem. In this work, we take the textbook
approach and
model it as a two-class problem where positive examples
come from previous works of the claimed programmer
and negative examples come from randomly selected unrelated
programmers. Alternatively, anomaly detection
could be employed in this setting, e.g., using a one-class
support vector machine [see 30].
As an example, a recent investigation conducted by
Verizon [17] on a US company’s anomalous virtual private network
traffic, revealed an employee who was outsourcing her work to
programmers in China. In such
cases, training a classifier on employee’s original code
and that of random programmers, and subsequently testing pieces of
recent code, could demonstrate if the employee was the actual
programmer.
In each of these applications, the adversary may try to
actively modify the program’s coding style. In the software
forensics application, the adversary tries to modify
code written by her to hide her style. In the copyright and
authorship verification applications, the adversary modifies code
written by another programmer to match his
own style. Finally, in the ghostwriting application, two
of the parties may collaborate to modify the style of code
written by one to match the other’s style.
Application
De-anonymization
Ghostwriting detection
Software forensics
Copyright investigation
Authorship verification
Learner
Multiclass
Multiclass
Multiclass
Two-class
Two/One-class
Comments
Closed world
Closed world
Open world
Closed world
Open world
different features to represent both syntax and structure
of program code (Section 3.2) and finally, we train a random forest
classifier for classification of previously unseen source files
(Section 3.3). In the following sections,
we will discuss each of these steps in detail and outline
design decisions along the way. The code for our approach is made
available as open-source to allow other
researchers to reproduce our results1 .
3.1
To date, methods for source code authorship attribution focus mostly
on sequential feature representations of
code such as byte-level and feature level n-grams [8, 13].
While these models are well suited to capture naming
conventions and preference of keywords, they are entirely language
agnostic and thus cannot model author
characteristics that become apparent only in the composition of
language constructs. For example, an author’s
tendency to create deeply nested code, unusually long
functions or long chains of assignments cannot be modeled using n-grams alone.
Addressing these limitations requires source code to
be parsed. Unfortunately, parsing C/C++ code using traditional
compiler front-ends is only possible for complete
code, i.e., source code where all identifiers can be resolved. This
severely limits their applicability in the setting of authorship
attribution as it prohibits analysis of
lone functions or code fragments, as is possible with simple n-gram models.
As a compromise, we employ the fuzzy parser Joern that has been
designed specifically with incomplete
code in mind [32]. Where possible, the parser produces
abstract syntax trees for code fragments while ignoring
fragments that cannot be parsed without further information. The
produced syntax trees form the basis for
our feature extraction procedure. While they largely preserve the
information required to create n-grams or bagof-words representations,
in addition, they allow a wealth
of features to be extracted that encode programmer habits
visible in the code’s structure.
As an example, consider the function foo as shown
in Figure 1, and a simplified version of its corresponding abstract
syntax tree in Figure 2. The function contains a number of common
language constructs found
in many programming languages, such as if-statements
(line 3 and 7), return-statements (line 4, 8 and 10), and
function call expressions (line 6). For each of these constructs, the
abstract syntax tree contains a corresponding
node. While the leaves of the tree make classical syntactic features
such as keywords, identifiers and operators accessible, inner nodes
represent operations showing
Evaluation
Section 4.2.1
Section 4.2.1
Section 4.2.2
Section 4.2.3
Section 4.2.4
Table 1: Overview of Applications for Code Stylometry
We emphasize that code stylometry that is robust to
adversarial manipulation is largely left to future work.
However, we hope that our demonstration of the power
of features based on the abstract syntax tree will serve as
the starting point for such research.
3
Fuzzy Abstract Syntax Trees
De-anonymizing Programmers
One of the goals of our research is to create a classifier
that automatically determines the most likely author of
a source file. Machine learning methods are an obvious
choice to tackle this problem, however, their success crucially
depends on the choice of a feature set that clearly
represents programming style. To this end, we begin by
parsing source code, thereby obtaining access to a wide
range of possible features that reflect programming language use
(Section 3.1). We then define a number of
1 https://github.com/calaylin/CodeStylometry
3
Function
int
foo
CompoundStmt
If
RelExpr (<)
x
Figure 1: Sample Code Listing
Or
UnaryOp (-)
RelExpr (>)
1
x
Feature
Condition
Return
EqExpr (!=)
ret
0
UnaryOp (-)
1
Else
int
Return
ret
Assign(=)
ret
1
Call
bar
Args
MAX
Definition
WordUnigramTF
Term frequency of word unigrams in
source code
ln(numkeyword/
Log of the number of occurrences of keylength)
word divided by file length in characters,
where keyword is one of do, else-if, if, else,
switch, for or while
ln(numTernary/
Log of the number of ternary operators dilength)
vided by file length in characters
ln(numTokens/
Log of the number of word tokens divided
length)
by file length in characters
ln(numComments/ Log of the number of comments divided by
length)
file length in characters
ln(numLiterals/
Log of the number of string, character, and
length)
numeric literals divided by file length in
characters
ln(numKeywords/ Log of the number of unique keywords
length)
used divided by file length in characters
ln(numFunctions/
Log of the number of functions divided by
length)
file length in characters
ln(numMacros/
Log of the number of preprocessor direclength)
tives divided by file length in characters
nestingDepth
Highest degree to which control statements
and loops are nested within each other
branchingFactor
Branching factor of the tree formed by converting code blocks of files
into nodes
avgParams
The average number of parameters among
all functions
stdDevNumParams The standard deviation of the number of
parameters among all functions
avgLineLength
The average length of each line
stdDevLineLength The standard deviation of the character
lengths of each line
Feature Extraction
Analyzing coding style using machine learning approaches is not
possible without a suitable representation of source code that clearly
expresses program style.
To address this problem, we present the Code Stylometry Feature Set
(CSFS), a novel representation of source
code developed specifically for code stylometry. Our feature set
combines three types of features, namely lexical
features, layout features and syntactic features. Lexical
and layout features are obtained from source code while
the syntactic features can only be obtained from ASTs.
We now describe each of these feature types in detail.
3.2.1
Return
Decl
x
Figure 2: Corresponding Abstract Syntax Tree
how these basic elements are combined to form expressions and
statements. In effect, the nesting of language
constructs can also be analyzed to obtain a feature set
representing the code’s structure.
3.2
0
Condition
If
Lexical and Layout Features
We begin by extracting numerical features from the
source code that express preferences for certain identifiers and
keywords, as well as some statistics on the use
of functions or the nesting depth. Lexical and layout features can be
calculated from the source code, without
having access to a parser, with basic knowledge of the
programming language in use. For example, we measure the number of
functions per source line to determine
the programmer’s preference of longer over shorter functions.
Furthermore, we tokenize the source file to obtain
the number of occurrences of each token, so called word
unigrams. Table 2 gives an overview of lexical features.
In addition, we consider layout features that represent
code-indentation. For example, we determine whether
the majority of indented lines begin with whitespace
or tabulator characters, and we determine the ratio of
whitespace to the file size. Table 3 gives a detailed description of
these features.
Count
dynamic*
7
1
1
1
1
1
1
1
1
1
1
1
1
1
*About 55,000 for 250 authors with 9 files.
Table 2: Lexical Features
3.2.2
Syntactic Features
The syntactic feature set describes the properties of the
language dependent abstract syntax tree, and keywords.
Calculating these features requires access to an abstract
syntax tree. All of these features are invariant to changes
in source-code layout, as well as comments.
Table 4 gives an overview of our syntactic features.
We obtain these features by preprocessing all C++ source
files in the dataset to produce their abstract syntax trees.
4
Feature
Definition
ln(numTabs/length)
Log of the number of tab characters divided by file length in characters
Log of the number of space characters divided by file length in characters
Log of the number of empty lines divided
by file length in characters, excluding
leading and trailing lines between lines of
text
The ratio between the number of whitespace characters (spaces, tabs,
and newlines) and non-whitespace characters
A boolean representing whether the majority of code-block braces are preceded
by a newline character
A boolean representing whether the majority of indented lines begin with spaces
or tabs
ln(numSpaces/length)
ln(numEmptyLines/
length)
whiteSpaceRatio
newLineBefore
OpenBrace
tabsLeadLines
particular type of node. The inverse document frequency
is calculated by dividing the number of authors in the
dataset by the number of authors that use that particular
node. Consequently, we are able to capture how rare of a
node it is and weight it more according to its rarity.
The maximum depth of an abstract syntax tree reflects the deepest
level a programmer nests a node in the
solution. The average depth of the AST nodes shows
how nested or deep a programmer tends to use particular
structural pieces. And lastly, term frequency of each C++
keyword is calculated. Each of these features is written
to a feature vector to represent the solution file of a specific
author and these vectors are later used in training
and testing by machine learning classifiers.
Count
1
1
1
1
1
1
Table 3: Layout Features
3.3
Using the feature set presented in the previous section,
we can now express fragments of source code as numerical vectors,
making them accessible to machine learning
algorithms. We proceed to perform feature selection and
train a random forest classifier capable of identifying the
most likely author of a code fragment.
An abstract syntax tree is created for each function in the
code. There are 58 node types in the abstract syntax tree
(see Appendix A) produced by Joern [33].
Feature
Definition
MaxDepthASTNode
ASTNodeBigramsTF
ASTNodeTypesTF
Maximum depth of an AST node
Term frequency AST node bigrams
Term frequency of 58 possible AST
node type excluding leaves
ASTNodeTypesTFIDF Term frequency inverse document frequency of 58
possible AST node type
excluding leaves
ASTNodeTypeAvgDep Average depth of 58 possible AST
node types excluding leaves
cppKeywords
Term frequency of 84 C++ keywords
CodeInASTLeavesTF
Term frequency of code unigrams in
AST leaves
CodeInASTLeaves
Term frequency inverse document freTFIDF
quency of code unigrams in AST
leaves
CodeInASTLeaves
Average depth of code unigrams in
AvgDep
AST leaves
Classification
Count
1
dynamic*
58
3.3.1
58
Due to our heavy use of unigram term frequency and
TF/IDF measures, and the diversity of individual terms
in the code, our resulting feature vectors are extremely
large and sparse, consisting of tens of thousands of features for
hundreds of classes. The dynamic Code stylometry feature set, for
example, produced close to 120,000
features for 250 authors with 9 solution files each.
In many cases, such feature vectors can lead to overfitting (where a
rare term, by chance, uniquely identifies
a particular author). Extremely sparse feature vectors
can also damage the accuracy of random forest classifiers, as the
sparsity may result in large numbers of zerovalued features being
selected during the random subsampling of the features to select a
best split. This reduces the number of ‘useful’ splits that can be
obtained
at any given node, leading to poorer fits and larger trees.
Large, sparse feature vectors can also lead to slowdowns
in model fitting and evaluation, and are often more difficult to
interpret. By selecting a smaller number of more
informative features, the sparsity in the feature vector can
be greatly reduced, thus allowing the classifier to both
produce more accurate results and fit the data faster.
We therefore employed a feature selection step using
WEKA’s information gain [26] criterion, which evaluates
the difference between the entropy of the distribution of
classes and the entropy of the conditional distribution of
classes given a particular feature:
58
84
dynamic**
dynamic**
dynamic**
*About 45,000 for 250 authors with 9 files.
**About 7,000 for 250 authors with 9 files.
**About 4,000 for 150 authors with 6 files.
**About 2,000 for 25 authors with 9 files.
Table 4: Syntactic Features
The AST node bigrams are the most discriminating
features of all. AST node bigrams are two AST nodes
that are connected to each other. In most cases, when
used alone, they provide similar classification results to
using the entire feature set.
The term frequency (TF) is the raw frequency of a
node found in the abstract syntax trees for each file. The
term frequency inverse document frequency (TFIDF) of
nodes is calculated by multiplying the term frequency of
a node by inverse document frequency. The goal in using
the inverse document frequency is normalizing the term
frequency by the number of authors actually using that
Feature Selection
IG(A, Mi ) = H(A) − H(A|Mi )
5
(1)
where A is the class corresponding to an author, H is
Shannon entropy, and Mi is the ith feature of the dataset.
Intuitively, the information gain can be thought of as
measuring the amount of information that the observation of the value
of feature i gives about the class label
associated with the example.
To reduce the total size and sparsity of the feature vector, we
retained only those features that individually had
non-zero information gain. (These features can be referred to as
IG-CSFS throughout the rest of the paper.)
Note that, as H(A|Mi ) ≤ H(A), information gain is always
non-negative. While the use of information gain
on a variable-per-variable basis implicitly assumes independence
between the features with respect to their impact on the class label,
this conservative approach to feature selection means that we only use
features that have
demonstrable value in classification.
To validate this approach to feature selection, we applied this method
to two distinct sets of source code files,
and observed that sets of features with non-zero information gain were
nearly identical between the two sets, and
the ranking of features was substantially similar between
the two. This suggests that the application of information
gain to feature selection is producing a robust and consistent set of
features (see Section 4 for further discussion). All the results are
calculated by using CSFS and
IG-CSFS. Using IG-CSFS on all experiments demonstrates how these
features generalize to different datasets
that are larger in magnitude. One other advantage of IGCSFS is that it
consists of a few hundred features that
result in non-sparse feature vectors. Such a compact representation of
coding style makes de-anonymizing thousands of programmers possible in
minutes.
3.3.2
of features can be found using the out of bag (oob) error
estimate, or the error estimate derived from those samples not
selected for training on a given tree.
During classification, each test example is classified
via each of the trained decision trees by following the binary
decisions made at each node until a leaf is reached,
and the results are then aggregated. The most populous
class can be selected as the output of the forest for simple
classification, or classifications can be ranked according
to the number of trees that ‘voted’ for a label when performing
relaxed attribution (see Section 4.3.4).
We employed random forests with 300 trees, which
empirically provided the best trade-off between accuracy
and processing time. Examination of numerous oob values across
multiple fits suggested that (logM) + 1 random features (where M
denotes the total number of features) at each split of the decision
trees was in fact optimal in all of the experiments (listed in Section
4), and
was used throughout. Node splits were selected based on
the information gain criteria, and all trees were grown to
the largest extent possible, without pruning.
The data was analyzed via k-fold cross-validation,
where the data was split into training and test sets stratified by
author (ensuring that the number of code samples per author in the
training and test sets was identical across authors). k varies
according to datasets and
is equal to the number of instances present from each
author. The cross-validation procedure was repeated 10
times, each with a different random seed. We report the
average results across all iterations in the results, ensuring that
they are not biased by improbably easy or difficult to classify
subsets.
Random Forest Classification
4
We used the random forest ensemble classifier [7] as
our classifier for authorship attribution. Random forests
are ensemble learners built from collections of decision
trees, each of which is grown by randomly sampling
N training samples with replacement, where N is the
number of instances in the dataset. To reduce correlation between
trees, features are also subsampled; commonly (logM) + 1 features are
selected at random (without replacement) out of M, and the best split
on these
(logM) + 1 features is used to split the tree nodes. The
number of selected features represents one of the few
tuning parameters in random forests: increasing the number of features
increases the correlation between trees in
the forest which can harm the accuracy of the overall ensemble,
however increasing the number of features that
can be chosen at each split increases the classification accuracy of
each individual tree making them stronger classifiers with low error
rates. The optimal range of number
Evaluation
In the evaluation section, we present the results to the
possible scenarios formulated in the problem statement
and evaluate our method. The corpus section gives an
overview of the data we collected. Then, we present the
main results to programmer de-anonymization and how
it scales to 1,600 programmers, which is an immediate
privacy concern for open source contributors that prefer
to remain anonymous. We then present the training data
requirements and efficacy of types of features. The obfuscation
section discusses a possible countermeasure to
programmer de-anonymization. We then present possible machine learning
formulations along with the verification section that extends the
approach to an open world
problem. We conclude the evaluation with generalizing
the method to other programming languages and providing software
engineering insights.
6
4.1
(ii) Within each year, we ordered the corpus files by the
round in which they were written, and by the problem
within a round, as all competitors proceed through the
same sequence of rounds in that year. As a result, we
performed stratified cross validation on each program file
by the year it was written, by the round in which the program was
written, by the problems solved in the round,
and by the author’s highest round completed in that year.
Some limitations of this dataset are that it does not allow us to
assess the effect of style guidelines that may
be imposed on a project or attributing code with multiple/mixed
programmers. We leave these interesting
questions for future work, but posit that our improved results with
basic stylometry make them worthy of study.
Corpus
One concern in source code authorship attribution is that
we are actually identifying differences in coding style,
rather than merely differences in functionality. Consider
the case where Alice and Bob collaborate on an open
source project. Bob writes user interface code whereas
Alice works on the network interface and backend analytics. If we used
a dataset derived from their project,
we might differentiate differences between frontend and
backend code rather than differences in style.
In order to minimize these effects, we evaluate our
method on the source code of solutions to programming
tasks from the international programming competition
Google Code Jam (GCJ), made public in 2008 [2]. The
competition consists of algorithmic problems that need
to be solved in a programming language of choice. In
particular, this means that all programmers solve the
same problems, and hence implement similar functionality, a property
of the dataset crucial for code stylometry
analysis.
The dataset contains solutions by professional programmers, students,
academics, and hobbyists from 166
countries. Participation statistics are similar over the
years. Moreover, it contains problems of different difficulty, as the
contest takes place in several rounds. This
allows us to assess whether coding style is related to programmer
experience and problem difficulty.
The most commonly used programming language was
C++, followed by Java, and Python. We chose to investigate source code
stylometry on C++ and C because of
their popularity in the competition and having a parser
for C/C++ readily available [32]. We also conducted
some preliminary experimentation on Python.
A validation dataset was created from 2012’s GCJ
competition. Some problems had two stages, where the
second stage involved answering the same problem in a
limited amount of time and for a larger input. The solution to the
large input is essentially a solution for the
small input but not vice versa. Therefore, collecting both
of these solutions could result in duplicate and identical
source code. In order to avoid multiple entries, we only
collected the small input versions’ solutions to be used in
our dataset.
The programmers had up to 19 solution files in these
datasets. Solution files have an average of 70 lines of
code per programmer.
To create our experimental datasets that are discussed
in further detail in the results section;
(i) We first partitioned the corpus of files by year of competition.
The “main†dataset includes files drawn from
2014 (250 programmers). The “validation†dataset files
come from 2012, and the “multi-year†dataset files come
from years 2008 through 2014 (1,600 programmers).
4.2
Applications
In this section, we will go over machine learning task
formulations representing five possible real-world applications
presented in Section 2.
4.2.1
Multiclass Closed World Task
This section presents our main experiment—deanonymizing 250
programmers in the difficult scenario
where all programmers solved the same set of problems. The machine
learning task formulation for
de-anonymizing programmers also applies to ghostwriting detection. The
biggest dataset formed from 2014’s
Google Code Jam Competition with 9 solution files to
the same problem had 250 programmers. These were the
easiest set of 9 problems, making the classification more
challenging (see Section 4.3.6). We reached 91.78%
accuracy in classifying 250 programmers with the Code
Stylometry Feature Set. After applying information gain
and using the features that had information gain, the
accuracy was 95.08%.
We also took 250 programmers from different years
and randomly selected 9 solution files for each one of
them. We used the information gain features obtained
from 2014’s dataset to see how well they generalize.
We reached 98.04% accuracy in classifying 250 programmers. This is 3%
higher than the controlled large
dataset’s results. The accuracy might be increasing because of using
a mixed set of Google Code Jam problems, which potentially contains
the possible solutions’
properties along with programmers’ coding style and
makes the code more distinct.
We wanted to evaluate our approach and validate our
method and important features. We created a dataset
from 2012’s Google Code Jam Competition with 250
programmers who had the solutions to the same set of
9 problems. We extracted only the features that had positive
information gain in 2014’s dataset that was used as
7
Where V j (i) = 1 if the jth tree voted for class i and
0 otherwise, and |T | f denotes the total number of trees
in forest f . Note that by construction, ∑i P(Ci ) = 1 and
P(Ci ) ≥ 0 ∀ i, allowing us to treat P(Ci ) as a probability
measure.
There was one correct classification made with 13.7%
confidence. This suggests that we can use a threshold between 13.7%
and 15% confidence level for verification,
and manually analyze the classifications that did not pass
the confidence threshold or exclude them from results.
We picked an aggressive threshold of 15% and to validate it, we
trained a random forest classifier on the same
set of 30 programmers 270 code samples. We tested on
150 different files from the programmers in the training
set. There were 6 classifications below the 15% threshold
and two of them were misclassified. We took another set
of 420 test files from 30 programmers that were not in the
training set. All the files from the 30 programmers were
attributed to one of the 30 programmers in the training
set since this is a closed world classification task, however, the
highest confidence level in these classifications
was 14.7%. The 15% threshold catches all the instances
that do not belong to the programmers in the suspect set,
gets rid of 2 misclassifications and 4 correct classifications.
Consequently, when we see a classification with
less than a threshold value, we can reject the classification and
attribute the test instance to an unknown suspect.
the main dataset to implement the approach. The classification
accuracy was 96.83%, which is higher than the
95.07% accuracy obtained in 2014’s dataset.
The high accuracy of validation results in Table 5 show
that we identified the important features of code stylometry and found
a stable feature set. This feature set does
not necessarily represent the exact features for all possible
datasets. For a given dataset that has ground truth
information on authorship, following the same approach
should generate the most important features that represent coding
style in that particular dataset.
A = #programmers, F = max #problems completed
N = #problems included in dataset (N ≤ F)
A = 250 from 2014
A = 250 from 2012
A = 250 all years
F = 9 from 2014
F = 9 from 2014
F ≥ 9 all years
N=9
N=9
N=9
Average accuracy after 10 iterations with IG-CSFS features
95.07%
96.83%
98.04%
Table 5: Validation Experiments
4.2.2
Mutliclass Open World Task
The experiments in this section can be used in software
forensics to find out the programmer of a piece of malware. In
software forensics, the analyst does not know if
source code belongs to one of the programmers in the
candidate set of programmers. In such cases, we can
classify the anonymous source code, and if the majority
number of votes of trees in the random forest is below a
certain threshold, we can reject the classification considering the
possibility that it might not belong to any of the
classes in the training data. By doing so, we can scale
our approach to an open world scenario, where we might
not have encountered the suspect before. As long as we
determine a confidence threshold based on training data
[30], we can calculate the probability that an instance
belongs to one of the programmers in the set and accordingly accept or
reject the classification.
We performed 270 classifications in a 30-class problem using all the
features to determine the confidence
threshold based on the training data. The accuracy was
96.67%. There were 9 misclassifications and all of them
were classified with less than 15% confidence by the
classifier. The class probability or classification confidence that
source code fragment C is of class i is calculated by taking the
percentage of trees in the random
forest that voted for that particular class, as follows2:
P(Ci ) =
∑ j V j (i)
|T | f
4.2.3
Two-class Closed World Task
Source code author identification could automatically
deal with source code copyright disputes without requiring manual
analysis by an objective code investigator.
A copyright dispute on code ownership can be resolved
by comparing the styles of both parties claiming to have
generated the code. The style of the disputed code can
be compared to both parties’ other source code to aid in
the investigation. To imitate such a scenario, we took
60 different pairs of programmers, each with 9 solution
files. We used a random forest and 9-fold cross validation
to classify two programmers’ source code. The average
classification accuracy using CSFS set is 100.00% and
100.00% with the information gain features.
4.2.4
Two-class/One-class Open World Task
Another two-class machine learning task can be formulated for
authorship verification. We suspect Mallory of
plagiarizing, so we mix in some code of hers with a large
sample of other people, test, and see if the disputed code
gets classified as hers or someone else’s. If it gets classified as
hers, then it was with high probability really
written by her. If it is classified as someone else’s, it
really was someone else’s code. This could be an open
(2)
8
world problem and the person that originally wrote the
code could be a previously unknown programmer.
This is a two-class problem with classes Mallory and
others. We train on Mallory’s solutions to problems a,
b, c, d, e, f, g, h. We also train on programmer A’s solution to
problem a, programmer B’s solution to problem b,
programmer C’s solution to problem c, programmer D’s
solution to problem d, programmer E’s solution to problem e,
programmer F’s solution to problem f, programmer G’s solution to
problem g, programmer H’s solution
to problem h and put them in one class called ABCDEFGH. We train a
random forest classifier with 300 trees
on classes Mallory and ABCDEFGH. We have 6 test instances from Mallory
and 6 test instances from another
programmer ZZZZZZ, who is not in the training set.
These experiments have been repeated in the exact same setting with 80
different sets of programmers
ABCDEFGH, ZZZZZZ and Mallorys. The average classification accuracy for
Mallory using the CSFS set is
100.00%. ZZZZZZ’s test instances are classified as programmer
ABCDEFGH 82.04% of the time, and classified as Mallory for the rest of
the time while using the
CSFS. Depending on the amount of false positives we
are willing to accept, we can change the operating point
on the ROC curve.
These results are also promising for use in cases where
a piece of code is suspected to be plagiarized. Following
the same approach, if the classification result of the piece
of code is someone other than Mallory, that piece of code
was with very high probability not written by Mallory.
4.3
Additional Insights
4.3.1
Scaling
scales. We are able to de-anonymize 1,600 programmers
using 32GB memory within one hour. Alternately, we
can use 40 trees and get nearly the same accuracy (within
0.5%) in a few minutes.
Figure 3: Large Scale De-anonymization
4.3.2
Training Data and Features
We selected different sets of 62 programmers that had F
solution files, from 2 up to 14. Each dataset has the solutions to the
same set of F problems by different sets
of programmers. Each dataset consisted of programmers
that were able to solve exactly F problems. Such an experimental setup
makes it possible to investigate the effect of programmer skill set on
coding style. The size of
the datasets were limited to 62, because there were only
62 contestants with 14 files. There were a few contestants with up to
19 files but we had to exclude them since
there were not enough programmers to compare them.
The same set of F problems were used to ensure that
the coding style of the programmer is being classified
and not the properties of possible solutions of the problem itself. We
were able to capture personal programming style since all the
programmers are coding the same
functionality in their own ways.
Stratified F-fold cross validation was used by training
on everyone’s (F − 1) solutions and testing on the F th
problem that did not appear in the training set. As a result, the
problems in the test files were encountered for
the first time by the classifier.
We used a random forest with 300 trees and (logM)+1
features with F-fold stratified cross validation, first with
the Code Stylometry Feature Set (CSFS) and then with
the CSFS’s features that had information gain.
Figure 4 shows the accuracy from 13 different sets of
62 programmers with 2 to 14 solution files, and consequently 1 to 13
training files. The CSFS reaches an optimal training set size at 9
solution files, where the classifier trains on 8 (F − 1) solutions.
In the datasets we constructed, as the number of files
increase and problems from more advanced rounds are
included, the average line of code (LOC) per file also
increases. The average lines of code per source code
in the dataset is 70. Increased number of lines of code
might have a positive effect on the accuracy but at the
same time it reveals programmer’s choice of program
We collected a larger dataset of 1,600 programmers from
various years. Each of the programmers had 9 source
code samples. We created 7 subsets of this large dataset
in differing sizes, with 250, 500, 750, 1,000, 1,250,
1,500, and 1,600 programmers. These subsets are useful to understand
how well our approach scales. We extracted the specific features that
had information gain in
the main 250 programmer dataset from this large dataset.
In theory, we need to use more trees in the random forest as the
number of classes increase to decrease variance, but we used fewer
trees compared to smaller experiments. We used 300 trees in the random
forest to
run the experiments in a reasonable amount of time with
a reasonable amount of memory. The accuracy did not
decrease too much when increasing the number of programmers. This
result shows that information gain features are robust against changes
in class and are important properties of programmers’ coding styles.
The
following Figure 3 demonstrates how well our method
9
detectable change in the performance of the classifier for
this sample. The results are summarized in Table 6.
We took the maximum number of programmers, 20,
that had solutions to 9 problems in C and obfuscated
the code (see example in Appendix B) using a much
more sophisticated open source obfuscator called Tigress
[1]. In particular, Tigress implements function virtualization, an
obfuscation technique that turns functions into
interpreters and converts the original program into corresponding
bytecode. After applying function virtualization, we were less able to
effectively de-anonymize
programmers, so it has potential as a countermeasure to
programmer de-anonymization. However, this obfuscation comes at a
cost. First of all, the obfuscated code is
neither readable nor maintainable, and is thus unsuitable
for an open source project. Second, the obfuscation adds
significant overhead (9 times slower) to the runtime of
the program, which is another disadvantage.
The accuracy with the information gain feature set on
the obfuscated dataset is reduced to 67.22%. When we
limit the feature set to AST node bigrams, we get 18.89%
accuracy, which demonstrates the need for all feature
types in certain scenarios. The accuracy on the same
dataset when the code is not obfuscated is 95.91%.
Figure 4: Training Data
length in implementing the same functionality. On the
other hand, the average line of code of the 7 easier (76
LOC) or difficult problems (83 LOC) taken from contestants that were
able to complete 14 problems, is higher
than the average line of code (68) of contestants that
were able to solve only 7 problems. This shows that
programmers with better skills tend to write longer code
to solve Google Code Jam problems. The mainstream
idea is that better programmers write shorter and cleaner
code which contradicts with line of code statistics in our
datasets. Google Code Jam contestants are supposed to
optimize their code to process large inputs with faster
performance. This implementation strategy might be
leading to advanced programmers implementing longer
solutions for the sake of optimization.
We took the dataset with 62 programmers each with
9 solutions. We get 97.67% accuracy with all the features and 99.28%
accuracy with the information gain features. We excluded all the
syntactic features and the accuracy dropped to 88.89% with all the
non-syntactic features and 88.35% with the information gain features
of
the non-syntactic feature set. We ran another experiment
using only the syntactic features and obtained 96.06%
with all the syntactic features and 96.96% with the information gain
features of the syntactic feature set. Most
of the classification power is preserved with the syntactic features,
and using non-syntactic features leads to a
significant decline in accuracy.
4.3.3
Obfuscator
Programmers
Stunnix
Stunnix
Tigress
Tigress
20
20
20
20
Lang
Results w/o
Obfuscation
C++
98.89%
C++
98.89*%
C
93.65%
C
95.91*%
*Information gain features
Results
w/
Obfuscation
100.00%
98.89*%
58.33%
67.22*%
Table 6: Effect of Obfuscation on De-anonymization
4.3.4
Relaxed Classification
The goal here is to determine whether it is possible to reduce the
number of suspects using code stylometry. Reducing the set of suspects
in challenging cases, such as
having too many suspects, would reduce the effort required to manually
find the actual programmer of the
code.
In this section, we performed classification on the
main 250 programmer dataset from 2014 using the information gain
features. The classification was relaxed
to a set of top R suspects instead of exact classification
of the programmer. The relaxed factor R varied from 1
to 10. Instead of taking the highest majority vote of the
decisions trees in the random forest, the highest R majority vote
decisions were taken and the classification result
was considered correct if the programmer was in the set
of top R highest voted classes. The accuracy does not
improve much after the relaxed factor is larger than 5.
Obfuscation
We took a dataset with 9 solution files and 20 programmers and
obfuscated the code using an off-the-shelf C++
obfuscator called stunnix [3]. The accuracy with the information gain
code stylometry feature set on the obfuscated dataset is 98.89%. The
accuracy on the same
dataset when the code is not obfuscated is 100.00%. The
obfuscator refactored function and variable names, as
well as comments, and stripped all the spaces, preserving the
functionality of code without changing the structure of the program.
Obfuscating the data produced little
10
programming languages by implementing the layout and
lexical features as well as using a fuzzy parser.
Lang.
Python
Python
Programmers
23
229
Classification IG
87.93%
79.71%
53.91%
39.16%
Top-5
99.52%
75.69%
Top-5 IG
96.62
55.46
Figure 5: Relaxed Classification with 250 Programmers
Table 7: Generalizing to Other Programming Languages
4.3.5
4.3.6
Generalizing the Method
Features derived from ASTs can represent coding styles
in various languages. These features are applicable in
cases when lexical and layout features may be less discriminating due
to formatting standards and reliance on
whitespace and other ‘lexical’ features as syntax, such
as Python’s PEP8 formatting. To show that our method
generalizes, we collected source code of 229 Python programmers from
GCJ’s 2014 competition. 229 programmers had exactly 9 solutions.
Using only the Python
equivalents of syntactic features listed in Table 4 and
9-fold cross-validation, the average accuracy is 53.91%
for top-1 classification, 75.69% for top-5 relaxed attribution. The
largest set of programmers to all work on
the same set of 9 problems was 23 programmers. The
average accuracy in identifying these 23 programmers is
87.93% for top-1 and 99.52% for top-5 relaxed attribution. The same
classification tasks using the information
gain features are also listed in Table 7. The overall accuracy in
datasets composed of Python code are lower
than C++ datasets. In Python datasets, we only used
syntactic features from ASTs that were generated by a
parser that was not fuzzy. The lack of quantity and specificity of
features accounts for the decreased accuracy.
The Python dataset’s information gain features are significantly
fewer in quantity, compared to C++ dataset’s
information gain features. Information gain only keeps
features that have discriminative value all on their own.
If two features only provide discriminative value when
used together, then information gain will discard them.
So if a lot of the features for the Python set are only
jointly discriminative (and not individually discriminative), then the
information gain criteria may be removing
features that in combination could effectively discriminate between
authors. This might account for the decrease when using information
gain features. While in
the context of other results in this paper the results in Table 7
appear lackluster, it is worth noting that even this
preliminary test using only syntactic features has comparable
performance to other prior work at a similar scale
(see Section 6 and Table 9), demonstrating the utility
of syntactic features and the relative ease of generating
them for novel programming languages. Nevertheless, a
CSFS equivalent feature set can be generated for other
Software Engineering Insights
We wanted to investigate if programming style is consistent throughout
years. We found the contestants that had
the same username and country information both in 2012
and 2014. We assumed that these are the same people but
there is a chance that they might be different people. In
2014, someone else might have picked up the same username from the
same country and started using it. We are
going to ignore such a ground truth problem for now and
assume that they are the same people.
We took a set of 25 programmers from 2012 that were
also contestants in 2014’s competition. We took 8 files
from their submissions in 2012 and trained a random forest classifier
with 300 trees using CSFS. We had one instance from each one of the
contestants from 2014. The
correct classification of these test instances from 2014
is 96.00%. The accuracy dropped to 92.00% when using
only information gain features, which might be due to the
aggressive elimination of pairs of features that are jointly
discriminative. These 25 programmers’ 9 files from 2014
had a correct classification accuracy of 98.04%. These
results indicate that coding style is preserved up to some
degree throughout years.
To investigate problem difficulty’s effect on coding
style, we created two datasets from 62 programmers that
had exactly 14 solution files. Table 8 summarizes the
following results. A dataset with 7 of the easier problems out of 14
resulted in 95.62% accuracy. A dataset
with 7 of the more difficult problems out of 14 resulted
in 99.31% accuracy. This might imply that more difficult
coding tasks have a more prevalent reflection of coding
style. On the other hand, the dataset that had 62 programmers with
exactly 7 of the easier problems resulted
in 91.24% accuracy, which is a lot lower than the accuracy obtained
from the dataset whose programmers were
able to advance to solve 14 problems. This might indicate that,
programmers who are advanced enough to answer 14 problems likely have
more unique coding styles
compared to contestants that were only able to solve the
first 7 problems.
To investigate the possibility that contestants who are
able to advance further in the rounds have more unique
coding styles, we performed a second round of experiments on
comparable datasets. We took the dataset with
11
12 solution files and 62 programmers. A dataset with 6
of the easier problems out of 12 resulted in 91.39% accuracy. A
dataset with 6 of the more difficult problems
out of 12 resulted in 94.35% accuracy. These results are
higher than the dataset whose programmers were only
able to solve the easier 6 problems. The dataset that had
62 programmers with exactly 6 of the easier problems
resulted in 90.05% accuracy.
that coding style is reflected more through difficult programming
tasks. This might indicate that programmers
come up with unique solutions and preserve their coding style more
when problems get harder. On the other
hand, programmers with a better skill set have a prevalent
coding style which can be identified more easily compared to
contestants who were not able to advance as
far in the competition. This might indicate that as programmers become
more advanced, they build a stronger
coding style compared to novices. There is another possibility that
maybe better programmers start out with a
more unique coding style.
Effects of Obfuscation. A malware author or plagiarizing programmer
might deliberately try to hide his
source code by obfuscation. Our experiments indicate
that our method is resistant to simple off-the-shelf obfuscators such
as stunnix, that make code look cryptic while
preserving functionality. The reason for this success is
that the changes stunnix makes to the code have no effect
on syntactic features, e.g., removal of comments, changing of names,
and stripping of whitespace.
In contrast, sophisticated obfuscation techniques such
as function virtualization hinder de-anonymization to
some degree, however, at the cost of making code
unreadable and introducing a significant performance
penalty. Unfortunately, unreadability of code is not acceptable for
open-source projects, while it is no problem
for attackers interested in covering their tracks. Developing methods
to automatically remove stylometric information from source code
without sacrificing readability
is therefore a promising direction for future research.
Limitations. We have not considered the case where
a source file might be written by a different author than
the stated contestant, which is a ground truth problem
that we cannot control. Moreover, it is often the case that
code fragments are the work of multiple authors. We
plan to extend this work to study such datasets. To shed
light on the feasibility of classifying such code, we are
currently working with a dataset of git commits to open
source projects. Our parser works on code fragments
rather than complete code, consequently we believe this
analysis will be possible.
Another fundamental problem for machine learning
classifiers are mimicry attacks. For example, our classifier may be
evaded by an adversary by adding extra
dummy code to a file that closely resembles that of another
programmer, albeit without affecting the program’s
behavior. This evasion is possible, but trivial to resolve
when an analysts verifies the decision.
Finally, we cannot be sure whether the original author is actually a
Google Code Jam contestant. In this
case, we can detect those by a classify and then verify
approach as explained in Stolerman et al.’s work [30].
Each classification could go through a verification step
A = #programmers, F = max #problems completed
N = #problems included in dataset (N ≤ F)
A = 62
F = 14
N=7
F=7
N=7
N=7
F = 12
N=6
F=6
N=6
N=6
Average accuracy after 10 iterations while using CSFS
99.31%
95.62%2
91.24%1
94.35%
91.39%2
90.05%1
Average accuracy after 10 iterations while using IG CSFS
99.38%
98.62%2
96.77%1
96.69%
95.43%2
94.89%1
1 Drop in accuracy due to programmer skill set.
2 Coding style is more distinct in more difficult tasks.
Table 8: Effect of Problem Difficulty on Coding Style
5
Discussion
In this section, we discuss the conclusions we draw from
the experiments outlined in the previous section, limitations, as well
as questions raised by our results. In particular, we discuss the
difficulty of the different settings
considered, the effects of obfuscation, and limitations of
our current approach.
Problem Difficulty. The experiment with random
problems from random authors among seven years most
closely resembles a real world scenario. In such an experimental
setting, there is a chance that instead of only
identifying authors we are also identifying the properties
of a specific problem’s solution, which results in a boost
in accuracy.
In contrast, our main experimental setting where all
authors have only answered the nine easiest problems is
possibly the hardest scenario, since we are training on the
same set of eight problems that all the authors have algorithmically
solved and try to identify the authors from
the test instances that are all solutions of the 9th problem. On the
upside, these test instances help us precisely
capture the differences between individual coding style
that represent the same functionality. We also see that
such a scenario is harder since the randomized dataset
has higher accuracy.
Classifying authors that have implemented the solution to a set of
difficult problems is easier than identifying authors with a set of
easier problems. This shows
12
to eliminate instances where the classifier’s confidence is
below a threshold. After the verification step, instances
that do not belong to the set of known authors can be
separated from the dataset to be excluded or for further
manual analysis.
6
tural features to achieve higher accuracies at larger scales
and the first to study how code obfuscation affects code
stylometry.
There has also been some code stylometry work that
focused on manual analysis and case studies. Spafford
and Weeber [29] suggest that use of lexical features such
as variable names, formatting and comments, as well as
some syntactic features such as usage of keywords, scoping and
presence of bugs could aid in source code attribution but they do not
present results or a case study
experiment with a formal approach. Gray et al. [15]
identify three categories in code stylometry: the layout
of the code, variable and function naming conventions,
types of data structures being used and also the cyclomatic complexity
of the code obtained from the control
flow graph. They do not mention anything about the syntactic
characteristics of code, which could potentially be
a great marker of coding style that reveals the usage of
programming language’s grammar. Their case study is
based on a manual analysis of three worms, rather than
a statistical learning approach. Hayes and Offutt [16]
examine coding style in source code by their consistent
programmer hypothesis. They focused on lexical and
layout features, such as the occurrence of semicolons,
operators and constants. Their dataset consisted of 20
programmers and the analysis was not automated. They
concluded that coding style exists through some of their
features and professional programmers have a stronger
programming style compared to students. In our results
in Section 4.3.6, we also show that more advanced programmers have a
more identifying coding style.
There is also a great deal of research on plagiarism
detection which is carried out by identifying the similarities between
different programs. For example, there is a
widely used tool called Moss that originated from Stanford University
for detecting software plagiarism. Moss
[6] is able to analyze the similarities of code written by
different programmers. Rosenblum et al. [27] present a
novel program representation and techniques that automatically detect
the stylistic features of binary code.
Related Work
Our work is inspired by the research done on authorship
attribution of unstructured or semi-structured text [5, 22].
In this section, we discuss prior work on source code
authorship attribution. In general, such work (Table 9)
looks at smaller scale problems, does not use structural
features, and achieves lower accuracies than our work.
The highest accuracies in the related work are
achieved by Frantzeskou et al. [12, 14]. They used 1,500
7-grams to reach 97% accuracy with 30 programmers.
They investigated the high-level features that contribute
to source code authorship attribution in Java and Common Lisp. They
determined the importance of each feature by iteratively excluding one
of the features from the
feature set. They showed that comments, layout features
and naming patterns have a strong influence on the author
classification accuracy. They used more training
data (172 line of code on average) than us (70 lines of
code). We replicated their experiments on a 30 programmer subset of
our C++ data set, with eleven files containing 70 lines of code on
average and no comments. We
reach 76.67% accuracy with 6-grams, and 76.06% accuracy with 7-grams.
When we used a 6 and 7-gram feature set on 250 programmers with 9
files, we got 63.42%
accuracy. With our original feature set, we get 98% accuracy on 250 programmers.
The largest number of programmers studied in the related work was 46
programmers with 67.2% accuracy.
Ding and Samadzadeh [10] use statistical methods for
authorship attribution in Java. They show that among
lexical, keyword and layout properties, layout metrics
have a more important role than others which is not the
case in our analysis.
There are also a number of smaller scale, lower accuracy approaches in
the literature [9, 11, 18–21, 28],
shown in Table 9, all of which we significantly outperform. These
approaches use a combination of layout and
lexical features.
The only other work to explore structural features is
by Pellin [23], who used manually parsed abstract syntax
trees with an SVM that has a tree based kernel to classify
functions of two programmers. He obtains an average of
73% accuracy in a two class classification task. His approach
explained in the white paper can be extended to
our approach, so it is the closest to our work in the literature. This
work demonstrates that it is non-trivial to
use ASTs effectively. Our work is the first to use struc-
Related Work
Pellin [23]
MacDonell et al.[21]
Frantzeskou et al.[14]
Burrows et al. [9]
Elenbogen and Seliya [11]
Kothari et al. [18]
Lange and Mancoridis [20]
Krsul and Spafford [19]
Frantzeskou et al. [14]
Ding and Samadzadeh [10]
This work
This work
This work
This work
# of Programmers
2
7
8
10
12
12
20
29
30
46
8
35
250
1,600
Results
73%
88.00%
100.0%
76.78%
74.70%
76%
75%
73%
96.9%
67.2%
100.00%
100.00%
98.04%
92.83%
Table 9: Comparison to Previous Results
13
7
Conclusion and Future Work
References
[1] The tigress diversifying c virtualizer, http://tigress.cs.arizona.edu.
Source code stylometry has direct applications for privacy, security,
software forensics, plagiarism, copyright infringement disputes, and
authorship verification.
Source code stylometry is an immediate concern for programmers who
want to contribute code anonymously because de-anonymization is quite
possible. We introduce
the first principled use of syntactic features along with
lexical and layout features to investigate style in source
code. We can reach 94% accuracy in classifying 1,600
authors and 98% accuracy in classifying 250 authors
with eight training files per class. This is a significant
increase in accuracy and scale in source code authorship
attribution. In particular, it shows that source code authorship
attribution with the Code Stylometry Feature Set
scales even better than regular stylometric authorship attribution, as
these methods can only identify individuals
in sets of 50 authors with slightly over 90% accuracy [see
4]. Furthermore, this performance is achieved by training
on only 550 lines of code or eight solution files, whereas
classical stylometric analysis requires 5,000 words.
Additionally, our results raise a number of questions
that motivate future research. First, as malicious code
is often only available in binary format, it would be interesting to
investigate whether syntactic features can be
partially preserved in binaries. This may require our feature set to
be improved in order to incorporate information obtained from control
flow graphs.
Second, we would also like to see if classification accuracy can be
further increased. For example, we would
like to explore whether using features that have joint information
gain alongside features that have information
gain by themselves improve performance. Moreover, designing features
that capture larger fragments of the abstract syntax tree could
provide improvements. These
changes (along with adding lexical and layout features)
may provide significant improvements to the Python results and help
generalize the approach further.
Finally, we would like to investigate whether code can
be automatically normalized to remove stylistic information while
preserving functionality and readability.
8
[2] Google code jam, https://code.google.com/codejam, 2014.
[3] Stunnix, http://www.stunnix.com/prod/cxxo/, November 2014.
[4] A BBASI , A., AND C HEN , H. Writeprints: A stylometric approach
to identity-level identification and similarity detection in
cyberspace. ACM Trans. Inf. Syst. 26, 2 (2008), 1–29.
[5] A FROZ , S., B RENNAN , M., AND G REENSTADT, R. Detecting
hoaxes, frauds, and deception in writing style online. In Security and
Privacy (SP), 2012 IEEE Symposium on (2012), IEEE,
pp. 461–475.
[6] A IKEN , A., ET AL . Moss: A system for detecting software
plagiarism. University of California–Berkeley. See www. cs.
berkeley. edu/aiken/moss. html 9 (2005).
[7] B REIMAN , L. Random forests. Machine Learning 45, 1 (2001),
5–32.
[8] B URROWS , S., AND TAHAGHOGHI , S. M. Source code authorship
attribution using n-grams. In Proc. of the Australasian Document
Computing Symposium (2007).
[9] B URROWS , S., U ITDENBOGERD , A. L., AND T URPIN , A. Application
of information retrieval techniques for source code authorship
attribution. In Database Systems for Advanced Applications (2009),
Springer, pp. 699–713.
[10] D ING , H., AND S AMADZADEH , M. H. Extraction of java program
fingerprints for software authorship identification. Journal
of Systems and Software 72, 1 (2004), 49–57.
[11] E LENBOGEN , B. S., AND S ELIYA , N. Detecting outsourced student
programming assignments. Journal of Computing Sciences
in Colleges 23, 3 (2008), 50–57.
[12] F RANTZESKOU , G., M AC D ONELL , S., S TAMATATOS , E., AND
G RITZALIS , S. Examining the significance of high-level programming
features in source code author classification. Journal
of Systems and Software 81, 3 (2008), 447–460.
[13] F RANTZESKOU , G., S TAMATATOS , E., G RITZALIS , S.,
C HASKI , C. E., AND H OWALD , B. S. Identifying authorship
by byte-level n-grams: The source code author profile (scap)
method. International Journal of Digital Evidence 6, 1 (2007),
1–18.
[14] F RANTZESKOU , G., S TAMATATOS , E., G RITZALIS , S., AND
K ATSIKAS , S. Effective identification of source code authors
using byte-level information. In Proceedings of the 28th International
Conference on Software Engineering (2006), ACM,
pp. 893–896.
Acknowledgments
This material is based on work supported by the ARO
(U.S. Army Research Office) Grant W911NF-14-10444, the DFG (German
Research Foundation) under the
project DEVIL (RI 2469/1-1), and AWS in Education
Research Grant award. Any opinions, findings, and conclusions or
recommendations expressed in this material
are those of the authors and do not necessarily reflect
those of the ARO, DFG, and AWS.
[15] G RAY, A., S ALLIS , P., AND M AC D ONELL , S. Software
forensics: Extending authorship analysis techniques to computer
programs.
[16] H AYES , J. H., AND O FFUTT, J. Recognizing authors: an
examination of the consistent programmer hypothesis. Software Testing,
Verification and Reliability 20, 4 (2010), 329–356.
[17] I NOCENCIO , R. U.s. programmer outsources own job to china,
surfs cat videos, January 2013.
14
A
[18] KOTHARI , J., S HEVERTALOV, M., S TEHLE , E., AND M AN CORIDIS ,
S. A probabilistic approach to source code authorship
identification. In Information Technology, 2007. ITNG’07. Fourth
International Conference on (2007), IEEE, pp. 243–248.
Appendix: Keywords and Node Types
AdditiveExpression
AndExpression
Argument
[19] K RSUL , I., AND S PAFFORD , E. H. Authorship analysis:
Identifying the author of a program. Computers & Security 16, 3
(1997), 233–257.
ArgumentList
ArrayIndexing
AssignmentExpr
BitAndExpression
BlockStarter
BreakStatement
Callee
CallExpression
CastExpression
[20] L ANGE , R. C., AND M ANCORIDIS , S. Using code metric histograms
and genetic algorithms to perform author identification
for software forensics. In Proceedings of the 9th Annual Conference on
Genetic and Evolutionary Computation (2007), ACM,
pp. 2082–2089.
CastTarget
CompoundStatement
Condition
ConditionalExpression
ContinueStatement
DoStatement
ElseStatement
EqualityExpression
ExclusiveOrExpression
Expression
ExpressionStatement
ForInit
ForStatement
FunctionDef
GotoStatement
Identifier
IdentifierDecl
IdentifierDeclStatement
IdentifierDeclType
IfStatement
IncDec
IncDecOp
InclusiveOrExpression
InitializerList
Label
MemberAccess
MultiplicativeExpression
OrExpression
Parameter
ParameterList
ParameterType
PrimaryExpression
PtrMemberAccess
RelationalExpression
ReturnStatement
ReturnType
ShiftExpression
Sizeof
SizeofExpr
SizeofOperand
Statement
SwitchStatement
UnaryExpression
UnaryOp
UnaryOperator
[21] M AC D ONELL , S. G., G RAY, A. R., M AC L ENNAN , G., AND
S ALLIS , P. J. Software forensics for discriminating between
program authors using case-based reasoning, feedforward neural
networks and multiple discriminant analysis. In Neural Information
Processing, 1999. Proceedings. ICONIP’99. 6th International
Conference on (1999), vol. 1, IEEE, pp. 66–71.
[22] NARAYANAN , A., PASKOV, H., G ONG , N. Z., B ETHENCOURT,
J., S TEFANOV, E., S HIN , E. C. R., AND S ONG , D. On the
feasibility of internet-scale author identification. In Security and
Privacy (SP), 2012 IEEE Symposium on (2012), IEEE, pp. 300–
314.
[23] P ELLIN , B. N. Using classification techniques to determine
source code authorship. White Paper: Department of Computer
Science, University of Wisconsin (2000).
WhileStatement
[24] P IKE , R. The sherlock plagiarism detector, 2011.
Table 10: Abstract syntax tree node types
[25] P RECHELT, L., M ALPOHL , G., AND P HILIPPSEN , M. Finding
plagiarisms among a set of programs with jplag. J. UCS 8, 11
(2002), 1016.
Table 10 lists the AST node types generated by Joern
that were incorporated to the feature set. Table 11 shows
the C++ keywords used in the feature set.
[26] Q UINLAN , J. Induction of decision trees. Machine learning 1, 1
(1986), 81–106.
[27] ROSENBLUM , N., Z HU , X., AND M ILLER , B. Who wrote this
code? identifying the authors of program binaries. Computer
Security–ESORICS 2011 (2011), 172–189.
alignas
alignof
and
and_eq
asm
auto
bitand
bitor
bool
break
case
catch
char
char16_t
char32_t
class
compl
const
constexpr
const_cast
continue
decltype
default
delete
do
double
dynamic_cast
else
enum
explicit
export
extern
false
float
for
friend
goto
if
inline
int
long
mutable
namespace
new
noexcept
not
not_eq
nullptr
operator
or
or_eq
private
protected
public
register
[31] W IKIPEDIA. Saeed Malekpour, 2014. [Online; accessed 04November-2014].
reinterpret_cast
return
short
signed
sizeof
static
static_assert
static_cast
struct
switch
[32] YAMAGUCHI , F., G OLDE , N., A RP, D., AND R IECK , K. Modeling
and discovering vulnerabilities with code property graphs. In
Proc of IEEE Symposium on Security and Privacy (S&P) (2014).
template
this
thread_local
throw
true
try
typedef
typeid
typename
union
unsigned
using
virtual
void
volatile
[33] YAMAGUCHI , F., W RESSNEGGER , C., G ASCON , H., AND
R IECK , K. Chucky: Exposing missing checks in source code
for vulnerability discovery. In Proceedings of the 2013 ACM
SIGSAC Conference on Computer & Communications Security
(2013), ACM, pp. 499–510.
wchar_t
while
xor
xor_eq
[28] S HEVERTALOV, M., KOTHARI , J., S TEHLE , E., AND M AN CORIDIS ,
S. On the use of discretized source code metrics for author
identification. In Search Based Software Engineering, 2009
1st International Symposium on (2009), IEEE, pp. 69–78.
[29] S PAFFORD , E. H., AND W EEBER , S. A. Software forensics:
Can we track code to its authors? Computers & Security 12, 6
(1993), 585–595.
[30] S TOLERMAN , A., OVERDORF, R., A FROZ , S., AND G REEN STADT, R.
Classify, but verify: Breaking the closed-world assumption in
stylometric authorship attribution. In IFIP Working
Group 11.9 on Digital Forensics (2014), IFIP.
Table 11: C++ keywords
15
B
Appendix: Original vs Obfuscated Code
Figure 6: A code sample X
Figure 6 shows a source code sample X from our
dataset that is 21 lines long. After obfuscation with Tigress, sample
X became 537 lines long. Figure 7 shows
the first 13 lines of the obfuscated sample X.
Figure 7: Code sample X after obfuscation
16
5
7
Above my 'paygrade' but someone here might be interested
Runs in WINE.
"Flare-dbg Tool: To Aid Malware Reverse Engineers in Developing Debugger
Scripts"
http://blog.hackersonlineclub.com/2015/12/flare-dbg-to-aid-malware-reverse.…
--
RR
"You might want to ask an expert about that - I just fiddled around
with mine until it worked..."
2
2
THE BOY WHO COULD CHANGE THE WORLD: THE WRITINGS OF AARON SWARTZ by Aaron SwartzThe New Press
by coderman 31 Dec '15
by coderman 31 Dec '15
31 Dec '15
https://newrepublic.com/article/126674/reading-everything-aaron-swartz-wrote
'''
... In a way, Aaron is a cautionary tale for unschooling. One of the
lessons that school teaches is that the people who make the rules
don’t really have to follow them. It’s something even the most
rebellious students learn one way or another, but Aaron looked up a
different set of rules and hacked his way out of school instead. On
one hand Aaron was happy with his choice and felt more engaged and
happier with online peers, on the other he absorbed a dangerous lesson
about navigating bureaucratic systems. Plenty of legal scholars and
technology experts thought Aaron had kept on the right side of the
letter of the law, but the criminal justice system is resistant to the
kind of hacking he tended to practice. I don’t know if he considered
fleeing the country, but I doubt it. Maybe if he had lived to see
Edward Snowden make dodging extradition look good, things would have
been different.
I was surprised when I saw the security footage of Aaron entering the
MIT building, his bike helmet held half-heartedly in front of his
face, his telltale hair poking out the sides. I had read the
Manifesto, but I didn’t think it really reflected Aaron’s intentions.
I was worried about what could happen to him, but not that worried. I
figured he had enough institutional support to keep his punishment to
a slap on the wrist. Mostly I was angry that he hadn’t taken what he
was doing seriously enough; with a team and a little bit of planning,
there’s no reason the authorities should have been able to tie Aaron
to the action. But covert ops wasn’t one of his strengths, and he
never got the chance to learn.
If I’m part of the we that counted on Aaron, then I’m also part of the
we that failed him. I thought his connections and credibility and
reputation would keep him safe, and maybe he did too. Maybe we
convinced him that a boy like him could change the world, or at least
always hack an escape route. But there’s no individual who can’t be
picked off if they cross the wrong line, or just the wrong prosecutor.
'''
2
2
On Tue, Dec 29, 2015 at 1:34 PM, Ray Dillinger <bear(a)sonic.net> wrote:
>
> On 12/28/2015 02:02 PM, Henry Baker wrote:
>> At 11:45 AM 12/28/2015, Ray Dillinger wrote:
>>> Maybe I've always been a suspicious bastard where software is concerned
>>
>> Apparently, not suspicious enough.
>>
>> It's nine o'clock; do you know what all of your processes (i.e., ps ax or equivalent) are doing?
>
> Oh hell no, there's hundreds of running processes now. I'm still a
> suspicious bastard, but these days I feel utterly helpless to do
> anything about it. There is no practical alternative to complicated
> OSes that start processes without permission and refuse to explain
> what they are and why they're running. I look at my process table,
> I see $NAME I don't recognize, I ask "man $NAME" and there is no
> such documentation. One more mysterious thing that will crash my
> system if I kill it but I have to look elsewhere to explain what it
> is and why it's running.
>
>> If you're running Win10, a goodly amount of disk & net traffic has to do with surveilling you
>
> This is one of the reasons I don't. The other is that there is even
> less explanation of what the hell is running, and you can't even get
> the code to check what the hell it does.
The minimal unix userland process count up to a basic window
manager is about ten, with all remote network bindings for same closed.
Learning unix is essential for those who want to do that.
BSD's usually have fewer layers of crap installed by default and
are generally more discrete, thus easier to minimize than Linux's.
As in this thread, minimization is only part of security.
1
0
https://archive.org/details/internet-mapping
Includes datasets on Internet and DNS censuses, vulnerability scans, sets
focused on specific weaknesses, sets focused on specific events (i.e.
Hurricane Sandy), and ecosystems for encryption protocols, and modeling the
strengths and weaknesses of the internet. Potentially useful for large
scale data analysis and things I'm not clever enough tech savvy enough to
figure out.
More should be added next year, but for now I'm focused on finishing
uploading and organizing the CIA files
<http://that1archive.neocities.org/subfolder1/early-cia.html>.
Enjoy or ignore.
--Mike
1
0