The 2020 Census and Beyond: Why Differential Privacy Should Be Implemented to Protect Confidentiality – N.Y.U. Journal of Legislation & Public Policy

By: Michelle Liu

November 25, 2019

Political power is anticipated to shift after the 2020 Census. Congressional reapportionment should shift west and south in line with population growth in western and southern states. However, the 2020 Census has been shrouded in controversy – namely the citizenship question and whether adding that question would impact the accuracy of the 2020 Census.

The Census Bureau is bound by Title 13 of the United States Code which includes some of the strictest privacy and confidentiality protections in federal law. Under Title 13, it is forbidden to: (1) publish private information that identifies an individual or business, (2) use the information against respondents in any government agency or court, and (3) violate confidentiality. A violation could result in a five-year prison sentence, a fine of up to $250,000, or both.

What is the Concern About Confidential Information and Data Privacy in the 2020 Census?

For one, Japanese Americans incarcerated during World War II expressed concern that the data collected from the citizenship question can be potentially misused. Confidentiality of Japanese citizens was legally suspended under the Second War Powers Act in 1942. The expansive war power rationale rendered it legal for the Census Bureau to share personal information with the Secret Service including a list of names, addresses, and the citizenship status of Japanese descendants living in Washington D.C. This concern has abated as the Supreme Court upheld a lower federal court barring the citizenship question from appearing on the 2020 Census.

Even though the citizenship question is no longer a concern, there are still broader concerns about data security surrounding the 2020 Census. One concern is database reconstruction, or using public information to deduce confidential information by using another set of public information. The New York Times provided an extreme example from the 2010 Census when it published that a 63-year-old Asian man and a 58-year-old Asian woman lived on Liberty Island. There are only two residents of Liberty Island, the superintendent at the Statue of Liberty and his wife – the Luchsingers – and they happen to be white. Switching the Luchsingers’s race was an effort to protect their confidential information. This ability to use sets of public information to deduce confidential information may be particularly acute in rural and sparsely populated areas. Recently a research director at the Census Bureau was able to use public Census data from across the country to fill in the fields of block address, sex, age, race, and ethnicity from public records. He then matched it with actual, confidential records and found that 50% of people matched exactly and over 90% of people matched but for a year or two from age.

The Census Bureau is working to address these privacy concerns before 2020, by rolling out a new statistical method of data release and aggregating statistics, using differential privacy as opposed to the “swapping” method. There has already been criticism from social scientists that this new method of differential privacy will restrict information access, leading to a dampening of research. However, given the grave privacy concerns and potential political ramifications, differential privacy is a step in the right direction.

What is Differential Privacy?

Differential Privacy has a precise mathematical definition, but the goal is to add “noise” to an underlying database. Adding “noise” will obscure any individual’s identity so that statistical analyses run on the database provide the same result with the individual as a but-for database without the particular individual.

Woods et al. explains differential privacy using an example that concerns household income and the number of students at a university. Professor A uses the dataset and releases a statistic that out of 5,000 students, there are 300 students from households with incomes greater than $350,000 a year. Professor B uses the same dataset and finds that there are 299 students from households with incomes exceeding $350,000 a year because one student, Z, dropped out of school. It is then possible to deduce that student Z also has a family income of greater than $350,000 a year and to match the individual student, Z, to an aggregate statistical measurement. In contrast, differential privacy would add “noise” by either creating a band around family income or by approximating the number of students in the dataset so that the measurement reported is “out of approximately 300 students” to make it is less likely an individual would be matched to a statistic.

Currently there are a variety of calculations including counts, regressions, and sampling that are possible with differential privacy guarantees. However, research also shows that there is a tradeoff between accuracy and privacy. In 2014, researchers created an algorithm from genetic markers to model the optimal amount of warfarin, an anticoagulant, to provide stroke patients. The more differentially private the dataset, the less accurate the algorithm, and the higher likelihood that the algorithm suggested an incorrect patient dosage. In contrast, a highly accurate dosage, required a lot of demographic information (which is a good proxy for genetic markers) from the individual patients.

How is Differential Privacy Applied Elsewhere?

Google, Uber, and Apple all use differential privacy analysis to preserve individual user privacy when user data is analyzed as part of a larger database so that input used for producing aggregate statistics cannot be traceable to a particular customer. Additionally, it is predicted that large tech firms will adopt a technology called “Federated Learning” for their artificial intelligence algorithms. Federated Learning allows customer mobile data to stay on mobile devices without being uploaded to a central server or cloud when running background machine learning.

Wrapping Up

This Census will play a large role in shaping politics for at least the next decade. This will also be the first time that internet and technology will play a significant role in the responses of a decennial operation. Given the stakes – reapportionment of Congressional seats and reallocation of federal funds – that come with every decennial Census, privacy may seem less critical. Nevertheless, differential privacy is required to prevent an erosion of public trust and ensure that every person is counted.

Michelle Liu, J.D. Class of 2020, N.Y.U. School of Law.

Suggested Citation: Michelle Liu, The 2020 Census and Beyond: Why Differential Privacy Should Be Implemented to Protect Confidentiality, N.Y.U. J. Legis & Pub. Pol’y Quorum (2019).