Sampling from an Alternate Universe: Overview of Privacy-preserving Synthetic Data

Christine Task
Knexus Research Corporation
Friday, September 27, 2019 - 1:25pm to 2:25pm
Lind 305

Data accessibility is important--publicly available data-sets support vital social science research, social programs and data-informed governance. In recent years, an increasing amount of data has been curated and made generally available through sites like data.gov, IPUMS, and other resources, fueling the progress of research in Big Data. However, data with the most potential value for public good can also be the most privacy sensitive--such as data on abuse, STDS, extreme poverty, or mental health. These datasets exist, but may be redacted or entirely withheld from public view due to legal restrictions and the very real danger that anonymized individuals may be reidentified.

Privacy-preserving synthetic data provides a pathway to publicly release datasets that contain very sensitive information. The basic process consists of three parts: A generative model is built which captures the distribution of the original sensitive data, perturbation steps are applied to the model to improve its privacy properties (either formal or heuristic-based), and then the model is used to synthesize a new data set of synthetic individuals. The synthetic dataset preserves the significant properties of the original data, but because it contains no real people, it can be safely released to the public. When the distributional difference between the real and synthetic data mimics the difference between two subsamples of the original data, i.e. when privacy error mimics sampling error, we can think of the synthetic data as survey results from a parallel dimension: The same pattern of information as the original data, with no real people.

In this talk, I'll cover approaches to creating synthetic data, the difference between formal and heuristic-based privacy, and, importantly, quality metrics used to verify that the synthetic data is a good substitute for the original data (a challenging problem itself in a high dimensional feature space). High quality synthetic data is a rapidly progressing research area, with both promising success stories and an exciting frontier of open problems.