Your Data is Mine: What the Rise of Big Data Means for Digital Rights

Data is an increasingly important aspect of life. What was once a simple part of our online interactions now makes up the lifeblood of business, and now more than ever it is important that data is managed, safeguarded, and handled responsibly.

And at this very moment, odds are your personal data is helping crunch analytics reports for a company you’ve never even heard of.

A Data-Driven Society

You own your laptop, just like how you own the car you drive and the lunch you bought with your cash. A random stranger can’t just walk up to you and glue advertisements to your laptop.

So why is it that companies like AT&T and Google can sell and profit from your personal data? Your search history and geographic location? The answer is simple:

Just because your data is “yours” doesn’t mean you own it.

In fact, the legal concept of data ownership is quite fluid. We may have privacy laws that prevent invasive spying and copyright laws that protect intellectual property, but the entire notion of personal data as a valued commodity is just beginning to take hold.

Each day we leak a little of our own data. We search online. We share our lives on Facebook. We shop for clothing and games. We get points with our rewards card. We call our family and friends. It’s incredibly difficult, in fact, to avoid leaking any data at all. And when one of the world’s biggest data holders is also the owner of the world’s most popular internet browser, how much protection can we expect? [1]

…when one of the world’s biggest data holders is also the owner of the world’s most popular internet browser, how much protection can we expect?

The name of the game is “online tracking,” a constant battle between privacy advocates and marketing magnates that is tilting further and further out of consumer control. Think of how Cartoon Network runs sugary cereal ads to target kids, and now imagine how a company like Google might target ads to you knowing exactly what you’ve been looking for online.

In this Information Age we live in, it is ignorant to suggest that any single drop of personal data might pass by uncollected. In 2014 it was estimated that each day 2.5 quintillion bytes of data are created, with 90% of the world’s data created in the previous two years alone. [2] The amount of data in existence continues to grow with no signs of slowing down.

The profitability of data solutions has seen remarkable increase as the costs associated with data acquisition, data storage, and data processing have declined. Amazon and other technology giants have invested massive amounts of capital in developing powerful cloud-computing networks that companies can “rent” to build and run their own data solutions. The rise of cloud computing was itself a reaction to the data-driven successes of the early internet.

Having a greater amount of data allows companies to create “qualitative solutions,” which is essentially a code-word for targeted advertising. Google’s AdWords program accounted for an estimated 90% of its $75 billion revenue in 2015. [3] Clicking a product on Amazon displays sections for “frequently bought together”, “customers who bought this item also bought”, and “sponsored products related to this item,” using data from previous purchases to drive future ones. Netflix and Spotify run complex algorithms to pair users with content they are expected to enjoy.

Is it within the rights of companies to harness the data of users who visit their websites and access their content? A recent survey found 68% of consumers would opt-out of tracking if they had an easily available option. [4] Additionally, only 14% of consumers believe companies are honest about how they use personal data. There is a significant disparity between the expectation of the general public and the reality that goes on behind the scenes.

The War on Personal Information

Traditionally, the primary tracking methods have included IP address tracing and browser “cookies”. By tracing a user’s IP address, websites can determine the geographic location in which they are located. This is a relatively basic form of tracking, and because many users can share the same IP address, it is generally an unreliable method for tracking individual users. Cookies, unlike IP addresses, store specific information on an individual user’s computer. Cookies are helpful in many ways, such as managing your logins and accounts.  However, cookies are also used extensively to identify individual users for tracking purposes.

Recently, an even more invasive method has begun to appear – browser fingerprinting. Browser fingerprinting tracks web browsers by discreetly collecting information about their configuration and settings, making tracking difficult to detect and even more difficult to avoid. If you use a unique browser, online trackers can build entire files containing personal information, including but not limited to websites that you frequently visit. This process allows browsers to profile users and infer demographics, preferences, and other personal information.

Naturally, tracking data must be treated extremely carefully. High-profile data breaches are constantly in the news, and an even greater number of breaches go undetected or are covered up for fear of the economic impact. [5] Concerningly, if fingerprint data were leaked, it would compromise the online activity of thousands of users.

While it is hard to defend against an attack, companies have an ethical burden to do whatever is in their power to prevent a breach. However, many kinds of breaches do not directly affect the company’s revenue and offer little incentive to create stronger defenses. [6] The European Union has unveiled a program to penalize companies that cover up data breaches with fines of up to 20 million Euros, but the US is still lagging behind. [7]

Even if a company doesn’t collect your name during tracking, it doesn’t need it to identify you. With only a zip-code, gender, and date of birth, a company could uniquely identify almost 90% of all Americans. [8]

With only a zip-code, gender, and date of birth, a company could uniquely identify almost 90% of all Americans.

Companies have increased their collection and usage of data in order to stay competitive in the market, but at the same time, many have neglected to protect that data properly. These aren’t small, novice software companies. They’re large A-list companies that we talk about in everyday conversation. Target, Yahoo, and eBay have all suffered from massive data breaches within the past 5 years, leaking information for over 700 million accounts combined. That’s twice the population of the United States. [9]

A Special Offer, Just for You

Companies like Google and Facebook have been tracking personal data for years. The monstrous amount of data they collect, from phone location to social engagement, is used to help tailor user experience and improve their products. There’s a simple experiment you can conduct to witness this effect for yourself. Google the exact same topic with a friend, and the two of you may see drastically different results depending on your search history, browser, location and time of day.

Facebook uses their extensive data about your social groups to properly distribute content via a so-called “organic reach” algorithm to ensure that users are not overloaded. YouTube tracks how long you watch a particular video to suggest you “watch next” videos and to provide producers insight about their videos and audiences. Overall, these techniques provide users access to content they are more likely to enjoy (and be more likely to spend money on).

Although Google and Facebook profit from your data, they do not explicitly sell them. The data is frankly more valuable if they keep it to themselves. Instead, they sell customized internal systems such as AdWords to target advertisements to users who are most likely to respond. Third party advertisers can only control parameters and select targeted groups to fine-tune how Google/Facebook chooses to distribute their ads.

The disturbing aspect of these ads is that they were based on information collected without knowledge or consent.

However, there is a “creepiness factor” associated with ads that are considered “too personal.” Researchers at Ithaca College found that while tailoring ads to individuals generally led to an increase in purchases, overly personal ads actually reduced the likelihood a consumer would follow up on a purchase. [10] The disturbing aspect of these ads is that they were based on information collected without knowledge or consent. Even more disturbing is that Facebook very likely is already aware of this and tailors their algorithms specifically to avoid the creepiness factor. In general, this is a widely accepted model for online advertisement. But it would be naive to assume this is the furthest extent of data tracking.

In 2014, Verizon’s Select and Relevant Mobile Ad programs fell under scrutiny for their use of “supercookies” on devices that access affiliated sites. By covertly injecting unique identifiers to any HTTP requests, which are sent every time we access a website, Verizon could easily trace users’ online behavior, bypassing any common defenses consumers have against tracking. It is practically impossible to disassociate one’s identity from these supercookies. Worst of all, any third party websites can also conduct tracking using the exposed identifier. The original goal was to provide a reliable way to profile online behavior across devices, a “holy grail” in the advertising industry, but what emerged out of it is a consumer nightmare, filled with major security and privacy flaws.

Verizon has since allowed clients to opt-out of these programs, and a recent $1.35 million lawsuit from the FCC made it mandatory to first prompt user permission before sharing tracking data. Nevertheless, it is still a grim illustration of how far service providers are willing to go just to have a competitive edge in advertising revenue and data collection. Tracking consumers through immutable fingerprints seems to be the next paradigm shift in digital marketing, if it isn’t already here.

The Dark Market of Data Brokerage

Having the right data at the right time can determine whether a business can sink or swim, but many businesses lack the ability to collect or analyze the data themselves. This shortage of data supply is a lucrative market to mobile carriers and service providers who are willing to sell them.

“The value of mobile subscribers is flattening out, and wireless operators are all interested in new ways to generate revenue.” Cy Smith, chief executive officer of AirSage

To circumvent the risk of privacy infringement from explicitly selling personal data, providers usually sell data in an anonymous manner. This means stripping details such as names, complete addresses and credit card numbers. Aggregated market data is a very powerful tool for investors and business owners, but the real power lies in identifiable data.

In fact, healthcare companies are now using your transaction history to quantify risk scores. From payments for gym memberships, late night GrubHub orders, to purchases at home-improvement stores, insurers can predict the likelihood you will get sick or become depressed even before you notice a single symptom.

“I think I could better predict someone’s risk of a heart attack based upon their Visa bill than their genome,” said Dr. Harry Greenspun, a director at Deloitte who leads a team that mines data for health insurers.

The emerging sector of data brokerage is slowly but surely taking over this untapped market. By combining public government sources, retail sold information, anonymous data and tidbits from social media presence, data brokerage traders analyze and package data to marketers. Custom solutions from marketers may ask for “scores”, which are attributed to individuals based on perhaps their “vulnerability” to gun advertisements, likelihood to send their children to private school or even life expectancy. If you would like to know just how much your intimate and private information is worth in the vast new world of data brokerage, feel free to give a try. It just might be worth a fortune (or, more likely, a penny).

But the question remains: in a consumer ecosystem where every corporation claims to have some form of privacy protections, how is it possible that there even exists a comprehensive profile of your identity and habits? As it turns out, it is not too difficult to recover personal identities from leaked datasets.

In 2009, Netflix released a dataset of user viewing habits for a public competition, asking data engineers to help improve its recommendation system. The dataset was supposed to be anonymized, but two researchers from University of Texas were able to trace real identities back to the dataset using inferential statistical learning models and open source data from IMDb. When AOL leaked 650,000 “anonymous” web searches in 2007, New York Times reporters were able to identify one user as Thelma Arnold, a 62-year-old widow living in Lilburn who had searched for topics from “numb fingers” to “60 single men”.

The Dangers of the Data “Black Box”

Data analysis programs are often protected from outside scrutiny because they are considered trade secrets, leading to the crux of the “black box” problem. We must be cautious of how much trust we place in data analysis without understanding how the program arrived at that answer.

We must be cautious of how much trust we place in data analysis without understanding how the program arrived at that answer.

The White House published a report in May 2016 described how data, used carelessly,  could lead to serious discrimination issues. [12] The White House is not just concerned with the ethics of how groups collect data, but how this data is analyzed. Even when companies have good intentions, it’s possible that the dataset they’re using will lead to biased results. Without proper attention, data can easily enforce negative stereotypes.

Facebook, naturally, has one of the largest and most extensive collections of user data in the world. They not only conduct observational studies on people’s habits, but also manipulate what people see when they log in to Facebook. In the 2010 congressional elections, Facebook conducted a study across 61 million users on how to get people to vote. [13] There’s an excellent chance you unknowingly participated in this study.

The concern escalates when you realize this had real impacts on the election itself. Quietly conducting an experiment on influencing voting in the middle of an election is unethical at best, and dangerous at worst. Facebook is expected to be politically impartial on how they conduct these studies, but even in that case, the set of Facebook users is not representative of the entire population. No matter what Facebook does, if it has a significant impact on its users it disproportionately affects the voting population.

In another case, an unintentionally racist algorithm could mean years behind bars. One company, Northpointe, designed software to calculate a risk profile for offenders. Their framework is to ask questions about an offender and use the answers to determine how likely they are to “recidivate”, to commit another crime after release. The framework does not specifically mention race, but the model overestimated the risk scores of blacks twice as often as it did for whites. While some have been critical of the software, this software is still used to decide prison sentences without a second thought. [14]

Unless there is an analysis of the algorithm created, like in Northepointe’s case, the algorithm behind the data can be a complete mystery. Unfortunately, as data is applied to fields as varied as voting and prison sentencing, the supposed truths seem to come from a black box.

A Wiser Data-Driven Society

Is this really the end of the story?

Are we content with the black box that is the data market?

Are we content not knowing who gets their hands on our personal data, not knowing what it will be used for, not knowing how it’s being used?

We cannot turn a blind eye to these issues. The prevalence of personal data in data analytics will remain so long as the majority of the population is uninformed about its dangers. If we as a society hope to benefit from the rise of big data without sacrificing our personal rights, the standards for data collection must be clearly stated and follow ethical principles. There must be accountability for algorithms that influence the daily lives of millions, possibly billions of people.

We only hope that this article has helped broaden your understanding of a critical issue largely viewed as unimportant and inconsequential by mass media. There are real, serious concerns associated with the growth of these programs, and it is only right that people are informed about the uses of their personal data.



Research and analysis conducted by Ryan Havens, Rachel Lai, Justin Lee, and Emerson Wenzel. Additional thanks to Astrid Weng for designing the opening graphic.
















Ryan Havens on linkedinRyan Havens on github
Ryan Havens
CTO at Enigma

Leave a Reply

Your email address will not be published. Required fields are marked *