Implications of Differential Privacy on Decennial Census Data Accuracy and Utility
The US Census Bureau will use a new disclosure avoidance technique based on differential privacy for the 2020 Decennial Census. This technique injects noise into nearly all published statistics, potentially impacting the utility of decennial data. The overarching goal of this project is to investigate the impact the new technique has on four typical uses cases that rely on decennial data.
Four research teams have spent the past year studying these use cases, using demonstration datasets published by the US Census Bureau. This website describes the four use cases ad provides links to working papers and data. Funding for this project was provided by the Alfred P. Sloan Foundation (G-2019-12589).
The four research teams come from IPUMS, University of Washington, University of Tennessee, and NORC. We first met at a workshop on differential privacy hosted by IPUMS in August 2019, and then we developed our use cases throughout the fall of that year. In addition to the Sloan Foundation grant, the IPUMS team has support from the National Science Foundation, and the Washington team benefitted from support from the University of Washington's CSDE Shanahan Endowed Fellowship.
- University of Washington
- University of Tennessee
- NORC at the University of Chicago
The four research teams presented their preliminary results at the 2020 Association for Public Policy Analysis & Management (APPAM) Fall Research Conference. Connie Citro, senior scholar at the Committee on National Statistics, moderated the session, and Amy O'Hara, research professor in the Massive Data Institute and executive director of the Federal Statistical Research Data Center at Georgetown University. Abstracts from the presentations are below.
Measuring Racial Residential Segregation Using Data from the Census Bureau's Disclosure Avoidance System
David Van Riper, Jonathan Schroeder PhD, Tracy Kugler PhDIPUMS
Social scientists have identified racial residential segregation, often induced and enforced by sociopolitical mechanisms, as a major driver of racial inequality in the U.S. Racial residential segregation is measured with high-quality, geographically detailed decennial census data. The Census Bureau’s new disclosure avoidance system modifies the race/ethnicity counts in geographic units through noise injection and post-processing, potentially enhancing or muting segregation measures. To assess the impact that the new disclosure avoidance system may have on the measurement of segregation, we will compute multiple segregation indices, including aspatial and spatial measures, from five versions of the 2010 decennial census—the original data published in 2011 protected with traditional statistical disclosure limitation techniques and four datasets produced by various iterations of the Census Bureau’s disclosure avoidance system. In addition to traditional metropolitan statistical area-based results, we will also compute municipality-based results to assess the impact of non-nested geographies on the measurement of segregation.
Lee Fiorio, Neal Marquez, Sara Curran PhD, Mark Ellis PhDUniversity of Washington
This paper will investigate the impact of the Census Bureau's implementation of differential privacy on the scalar decomposition of Theil's H, a multi-group entropy-based index of segregation. We will examine the extent to which groups live apart at different scales using Theil's H to calculate between and within subunit geography components of segregation (e.g., is a metropolitan area segregated because groups live in different municipalities or because they live in different neighborhoods within municipalities?) Using the 2010 test data, we plan on decomposing Theil's H along a series of nested scales: from region, to metro-area, to city-suburb, to place, to neighborhood, to block, and so on; and we plan on decomposing H calculated for different population and household characteristics: racial residential segregation, segregation by household type, segregation by sex, segregation by householder race, etc. We will compare results from the differential private data with results from the 2010 published data. Even if the Census Bureau is arguing that the block-level is not a meaningful scale of analysis, our decomposition analysis will demonstrate the importance of block-level data for interpreting patterns at higher scales.
Impacts of Differentially Private Noise Injection on Allocation of Federal and State Funds
Nicholas Nagle PhDUniversity of Tennessee
Differential privacy creates sampling error and stochastic variation in population counts that can impact funding formulas in at least two ways. The first impact arises from sampling error—the difference between the published statistic and the population truth. Increases in sampling error will lead to misallocation and inefficient use of resources. The second impact arises from the stochastic nature of the differential privacy mechanism. Variations from census-to-census may not be due to population changes but to stochastic variation in the differential privacy mechanism. We anticipate that this volatility will be greater than was introduced by the existing Statistical Disclosure Control (SDC) and that this volatility will be detrimental to planning for small governmental agencies. Even if the previous statistical disclosure control mechanisms created some stochastic variation across censuses, it is very likely that the amount of volatility from census-to-census was much less than it will be under differential privacy. For example, presumably, an individual in a hypothetical, stable census tract would have received similar SDC protections in successive censuses. Existing SDC methods also hold total population invariant for census defined places—in other words, SDC does not change total population but may change characteristics of the population.
We will conduct three assessments of the impact of differential privacy on government funding and planning by comparing original 2010 data to the 2010 demonstration data:
- We will compare population counts for tribal areas nationally. These variations in population will be fed through the funding formula used by the IHBG to determine variation in funding allocations. This funding allocation also depends on the ACS, which has published margins of error. We will approximate the magnitude of uncertainty owing to the differential privacy mechanism relative to the magnitude of error owing to the ACS sample design.
- We will calculate the volatility in estimates for all counties and municipalities in Tennessee. Counties in Tennessee range from less than 7,000 people to almost 800,000 people, providing a decent range of population sizes.
- We will estimate the population under age five in each Local Educational Agency (LEA) and project the enrollments of each LEA. We will compare these with actual LEA enrollments between 2010 and 2015 and assess the relative accuracy of the official and the differentially private 2010 counts.
Quentin Brummet PhD, Kirk Wolter PhDNORC at the University of Chicago
We consider the effect of the U.S. Census Bureau's announcement that it will use differentially private (DP) noise injection in the 2020 Census on survey operations. Sample surveys rely on census data to function appropriately, and the use of DP will likely lead lower data utility from additional noise. Therefore, it is important to understand how the noise will affect downstream survey operations that form the backbone for a variety of social science research and public planning.
We first discuss at a high level the various pieces of survey operations that rely on census data. We then perform comparisons between released 2010 Census data and a series of DP demonstration files released by the Census Bureau. Our analysis first documents descriptive differences in data between DP and the original data, showing the relative error in DP data across geographies and variables. These results vary some across data release, but in general point to noisier data for small geographies. We then examine in depth the effect of DP data on sample design. Our results show that for large-scale survey operations there are likely to be very few if any effects on sample efficiency or coverage. However, surveys targeting smaller populations may see decreases in survey coverage. Therefore, survey collectors may be faced with a decision of whether to accept the worse coverage or increase costs by pursuing additional methodologies or fielding strategies to mitigate coverage losses. These results are consistent across data releases, with modest improvements in newer versions of the DP demonstration data.
The US Census Bureau published four demonstration datasets, based on the Census Edited File from the 2010 Decennial Census of Population and Housing, using various iterations of its disclosure avoidance system. The first dataset, the Demonstration Data Product, was released in October 2019 and consists of demographic and housing counts for all geographic units used in 2010. Three additional datasets--Privacy Protected Microdata Files (PPMF)--were released in 2020. These datasets consist of individual person or housing unit records (i.e., microdata) with state, county, census tract, and census block identifiers. IPUMS generated tabulations for a variety of geographic summary levels, including states, counties, places, county subdivisions, American Indian and Alaska Native tribal lands, and census blocks for dissemination to the research community.
In order to increase the utility of the demonstration datasets, IPUMS includes comparable statists from the original 2010 Census Summary Files. By including differentially private and original statistics in the same data files, data users can quickly compare statistics instead of having to track down original 2010 data and merge it with the demonstration data.