Data Anonymization Doesn’t Work
The claim that anonymizing data protects consumer privacy is as deceptive as it gets
Note: In August 2022, the Federal Trade Commission announced an Advance Notice of Proposed Rulemaking (ANPR), which is part of the informal process of notice and comment rulemaking required by federal law when agencies want to create new administrative regulations or “rules”. An advance notice is sometimes used when an agency wants comments from the public to determine the appropriate scope of the rulemaking or input on specific topics. In this case, the FTC is considering new rules regarding commercial surveillance and data security, and is seeking comments from the public based on specific questions ranging from data extraction practices to protecting children to algorithmic bias to security. As an interested party, I submitted a comment with some thoughts on data anonymization. This essay is a lightly edited version of the comment.
***
The FTC asks: Which measures do companies use to protect consumer data? Which of these measures or practices are prevalent? Data anonymization is one such prevalent practice, but its effectiveness is flagrantly embellished. Describing data anonymization as a privacy safeguard is a deceptive practice that harms consumers.
The role of data anonymization as a source of reassurance for consumers, businesses, and regulators has a long history that has become increasingly detached from the reality of modern data science and analytics. One could argue that data anonymization has never worked in any context, as 1950’s computer scientists employed “computational automation of simple anonymization methods such as rounding and aggregation,” followed by well-known database security expert Dorothy Denning writing in the 1980’s that “when working with data it can probably never be completely ensured that no sensitive information is revealed.” Data anonymization does not work.
Data brokers in particular are at the nexus of this deceptive practice, with claims of data anonymization used as a shield against critique of what consumers increasingly see as an unsavory business that profits from buying and selling their personal information. As the Electronic Frontier Foundation (EFF) writes,
“Apps and data brokers claim they are only sharing so-called ‘anonymized’ data. But that’s simply not possible. Data brokers sell rich profiles with more than enough information to link sensitive data to real people, even if the brokers don’t include a legal name.”
EFF’s assertion is supported by peer-reviewed research and investigative reporting. For example, in 2013 researchers studied fifteen months of human mobility data for one and a half million individuals and found that even coarse-grained data that specifies a person’s location hourly was “enough to uniquely identify ninety-five percent of the individuals.” In 2015, researchers studied three months of credit card records for 1.1 million people and found that “four spatiotemporal points are enough to uniquely reidentify 90% of individuals … [and] that even data sets that provide coarse information … provide little anonymity and that women are more reidentifiable than men in credit card metadata.” In 2018, New York Times reporters reviewed a database of more than a million phones in the New York City area, in which anonymous location data was recorded as often as every two seconds. Reporters were able to uniquely identify several individuals who ended up being interviewed for the story and noted that at least seventy-five companies obtain location data from apps, several of which track up to two hundred million mobile devices in the United States. These companies concede that:
“[T]he information apps collect is tied not to someone’s name or phone number but to a unique ID. But those with access to the raw data — including employees or clients — could still identify a person without consent. They could follow someone they knew, by pinpointing a phone that regularly spent time at that person’s home address.”
The rapid increase in available data combined with more tools and technologies to analyze patterns makes de-anonymization easier and more cost-effective than ever before. As EFF attorney Nate Cardozo notes, “Once you add up multiple data points—like, say, the route you take to work, coupled with your browsing habits—it’s relatively trivial to narrow it down to a single individual.” The commercial opportunity to monetize consumer data through de-anonymization techniques has become substantial enough to sustain a growing industry to accomplish this task at scale.
Adding to the problem of evaporation of consumer privacy via the buying and selling of easily de-anonymized data, law enforcement can lawfully access this data under the provisions of the Stored Communications Act (SCA), which provides that “A [remote computing service] provider … may divulge a record ... to any person other than a governmental entity.” This provision of the SCA is the well-documented “data broker loophole”, which is the subject of legislative proposals to provide a remedy, most notably from U.S. Sen. Ron Wyden (D-OR).
While amending the SCA is far outside the authority and statutory purview of the FTC, this state of affairs provides context to the body of harm done to consumers by the extraction of personal data. As a contemporary example, we can consider the U.S. Supreme Court’s decision in Dobbs v. Jackson Women’s Health Organization, which has not only paved the way for criminalized abortion in several states but has also given law enforcement the ability to conduct investigations into illegal abortions using its access to this data via the data broker loophole. This should concern the FTC, which asks, How, if at all, should potential new trade regulation rules address harms to different consumers across different sectors?, and, Should, for example, the Commission impose limits on data use for essential services such as … healthcare …?. This approach begs a fundamental question: what does “healthcare data” encompass and does a sector-specific approach appropriately safeguard it? Pharmaceutical purchases, web searches, book purchases, and location data showing visits to medical facilities are technically not healthcare data under current law but can nonetheless be combined to deduce a consumer’s health status without ever accessing formal health records. As the Washington Post reports,
“Users have few protections under the Health Insurance Portability and Accountability Act (HIPAA) when it comes to digital data, and popular health apps share information with a broad collection of advertisers, according to our investigation.”
At a minimum, this inference can be used for the marketing and selling of health-related products to which the consumer did not consent, but it can also be used by law enforcement to conduct investigations based on reasonable suspicion of illegal medical procedures, e.g., abortions in certain states.
The FTC asks if data minimization requirements should be instituted to limit data collection, retention, and transfer. Yes, they should. Support for the practice of data minimization from academia and privacy advocates in civil society is well-documented. In addition, its inclusion as a provision of the EU General Data Protection Regulation (GDPR) has provided researchers with a large corpus of data against which to study its effects, both positive and negative. As such, I won’t include a multi-page literature review in this essay. It is worth noting, however, that the principle of proportionality described by GDPR Article 5 should apply, namely the idea that the amount of data collected and used should be proportional to the purpose of doing so. As renowned tech ethicist Joanna Bryon warns,
“The principal utility of large amounts of data on humans is surveillance. If you want to manipulate a population politically or economically (police or upsell) then the marginal returns for data are stable.”
In a subsequent explainer, she goes on to write,
“Data might display diminishing returns from an ‘improving your AI algorithm’ perspective, while displaying stable or increasing marginal returns from a surveillance perspective. This is true whether surveillance is political or economic (as in surveillance capitalism); and whether it is benevolent (for the benefit of the individual surveilled) or exploitative / autocratic. [S]ome applications or organizations could easily be in the business of solving all or any of these goals simultaneously. For example, a company or government might mostly use a set of data to provide a service that helps people keep healthy, but then also use it to recognize when people illegally cross a border.”
There is no shortage of opinions in the public domain on the costs and benefits of data minimization and its broad-based effects on innovation and competition, among other things, but as a defining principle to underpin the FTC’s rulemaking, I agree with Bryson, who gets to the heart of the matter: the only reason to collect more data than is necessary is manipulation and surveillance.
/end