Realities of Data Mining / Brokering You and Beyond

grarpamp grarpamp at gmail.com
Thu Mar 30 21:22:27 PDT 2017


https://www.reddit.com/r/news/comments/62fjdr/cards_against_humanity_creator_wants_to_buy/dfmmkeq/
https://www.reddit.com/user/delftblauw
http://www.stopdatamining.me/opt-out-list/

I wanted to counter some notions that this is not personally
identifiable information. The data from ISPs itself is not. It is true
that you won't be able to call the ISP company and ask for the search
history of /u/delftblauw (I am boring anyway), but rather it is how
that little slice of data the ISPs provide is part of the whole pie
that is your digital profile. I worked in the industry that aggregates
and models your data and wanted to give some insight to how this all
works for those willing to read this wall of text.

I started in IT fresh out of college for a small marketing company in
the Midwest that aggregated personal information and demographics to
sell lists for companies to market to. We worked with a larger company
that provided us a "master list" of persons in the US. This master
list was kind of like a digital phone book, with a lot of aggregate
demographics attached to the record. All sorts of information was
collected in aggregate from private and public sources this larger
company gathered. Attributes such as political affiliations are
garnered from lists bought from political parties and candidates you
donated to. Personal interests from the magazines you subscribe to,
even charities and causes you donate to. Of course your age, estimated
income, household size, education, etc. are all folded in.

The aggregation here works off of semi-anonymous identifiers and keys
that are incredibly interconnected digitally such as your home
address, email address, phone numbers, and even usernames for niche
topics. The aggregation is pulled together from disparate sources. For
example, let's say Company A gave us your name and email address along
with the fact you are "likely" a democrat since you contributed $10 to
Bernie Sanders PAC. Company B is "more secure" so they only provide
your email address and the fact that you said you make over $100k on
their survey. Company C gives us your name and home address along with
the fact that you subscribe to National Geographic. We can query
census data to find the average income in your census block based on
your address. Also, I see your home address has a lien on it from the
HOA dues you didn't pay after linking the public record provided from
your county government.

I say it's semi-anonymous because on its own these elements don't
identify you, and they are subject to change. However, these attribute
have varying degrees of permanency. You are far more likely to change
your home address than your email address, and you are more likely to
have multiple email addresses than you are phone numbers. There is a
reason LinkedIn and Facebook Messenger will not shut up about asking
for your phone number. Now think of things like Reddit. How many
accounts do you have? One, for sure right? Maybe another for a
throwaway? Your main account has all that sweet karma though, no way
you're going to abandon that. If Reddit sells your username and the
email address you registered together (no idea if they do this, this
is purely a hypothetical everyone reading here can relate to), all of
the sudden the "master list" now has your Reddit username. Tack on the
fact that the subreddits you are subscribed to indicates likely
interest. For instance, if you are subscribed to a very loud subreddit
dedicated to the current US president along with other conservative
subreddits, you are likely a Republican. If you are subscribed to
those along with Democrat-leaning subreddits, you are likely highly
interested in politics in general.

My job was to build everything from queries to ANN models to create a
profile for you. Given the basic example above, three companies and
public data just let me identify or at least make an educated guess
linking your name, email address, phone number, address, hobbies,
income, and public delinquencies. This is basic, and there is a lot
more that goes into this, but it should give a rough idea here. I did
it. Facebook does it. Reddit does it. And now your ISPs are/will
continue doing it.

Now, the way we monetize this is from the aggregate list. The more
data we have on you, and the higher confidence we have in that data,
the more valuable you are to us. One of our partners was a large
website used for researching new car purchases. They would give us
their data to build into our models and sell on their behalf. When a
car manufacturer would call us and ask for a list of households with
5+ people, income 75k+, and have a car first registered 4+ years ago,
and were researching competitor models, we could look at our models
and pull back a list of names, email address, phone numbers, etc. they
can market to. This is how you get the email that says, "Instead of
the CX9, take a look at the all new 2018 Subaru Ascent!" just a few
days after you were casually looking at a car. For the digital
companies like Facebook, Google, and Reddit, this data is largely
self-serving to be able to target ads to you as you use the site.

Now that said, this is all absolutely personally identifiable. I
routinely queried myself as well as friends and family to pull back
their "full profile". Everyone I showed had the same reaction that
goes from utter surprise to a mix of embarrassment and vulnerability.
I always tried to reassure them that there was nothing "crazy" we
could find out, but if I wanted to I could have checked the debt and
credit history, political party, and general interests, and gender (no
surprises!) of all the dates I had in my 20's. Just joking, I was a
data engineer, I didn't have any dates!

After I left there, I made sure to opt out of the master list we
received from the larger company. There are so many damn companies
doing this now it's exhausting. Once your data is out there, it's out
there forever. If my old company was the only company that had say,
the ability to know if I contributed to the Endangered Toucan Fund,
and a company came along asking for a list of people who support
endangered wildlife, they could purchase my email address and now all
of the sudden my old company just became Company N from the example
listed above. I have learned to just accept this as a fact of the
digital age.

Edit: For those looking to opt-out from the large private data
aggregation and mining that is going on all over, here is a good list
to start from. It's whack-a-mole though guys and gals. For every one
you opt out of, a new start up is niche mining something else. Try and
make peace or push public policy.


More information about the cypherpunks mailing list