Disclosure Limitation in Large Statistical Databases

CMU/Pitt Applied Decision Modeling Seminar Disclosure Limitation in Large Statistical Databases Stephen F. Roehrig

Collaborators • George Duncan, Heinz/Statistics, CMU • Stephen Fienberg, Statistics, CMU • Adrian Dobra, Statistics, Duke • Larry Cox, Center for Health Statistics, H&HS • Jesus De Loera, Math, UC Davis • Bernd Sturmfels, Math, UC Berkeley • JoAnne O’Roarke, Institute for Social Research, U. Michigan

Funding • NSF • NCES • NIA • NCHS • NISS • Census • BLS

How Should Government Distribute What It Knows? • Individuals and businesses contribute data about themselves, as mandated by government • Government summarizes and returns these data to policy makers, policy researchers, individuals and businesses • Everyone is supposed to see the value, but with no privacy downside • Obligations: return as much information as possible • Restrictions: don’t disclose sensitive information

Who Collects Data? • Federal, state and local government: • Census • Bureau of Labor Statistics • National Center for Health Statistics, etc. • Federally funded surveys: • Health and Retirement Study (NIA) • Treatment Episode Data Set (NIH) • Industry • Health care • Insurance • “Dataminers”

Real-World Example I • The U.S. Census Bureau wants to allow online queries against its census data. • It builds a large system, with many safeguards, then calls in “distinguished university statisticians” to test it. • Acting as “data attackers”, they attempt to discover sensitive information. • They do, and so the Census Bureau’s plans must be scaled back. (Source: Duncan, Roehrig and Kannan)

Real-World Example II • To better understand health care usage patterns, a researcher wants to measure the distance from patients’ residences to the hospitals where their care was given. • But because of confidentiality concerns, it’s only possible to get the patients’ ZIP codes, not addresses. • The researcher uses ZIP code centroids instead of addresses, causing large errors in the analysis. (Source: Marty Gaynor)

Real-World Example III • To understand price structures in the health care industry, a researcher wants to compare negotiated prices to list prices, for health services. • One data set has hospital IDs and list prices for various services. Another data set gives negotiated prices for actual services provided, but for for confidentiality reasons, no hospital IDs can be provided. • Thus matching is impossible. (Source: Marty Gaynor)

Data Utility vs. Disclosure Risk • Data utility: a measure of the usefulness of a data set to its intended users • Disclosure risk: the degree to which a data set reveals sensitive information • The tradeoff is a hard one, currently judged mostly heuristically • Disclosure limitation, in theory and practice, examines this tradeoff.

The R-U Confidentiality Map Original Data Two possible releases of the same data Disclosure Risk Disclosure Threshold No Data Data Utility

The R-U Confidentiality Map For many disclosure limitation methods, we can choose one or more parameters. Original Data Disclosure Risk Disclosure Threshold DL Method 1 No Data DL Method 2 Data Utility

Traditional Methods: Microdata • Microdata: a set of records containing information on individual respondents • Suppose you are supposed to release microdata, for the public good, about individuals. The data include: • Name • Address • City of residence • Occupation • Criminal record

Microdata • Typical safeguard: delete “identifiers” • So release a dataset containing only city of residence, occupation and criminal record…

Microdata • Typical safeguard: delete “identifiers” • So release a dataset containing only city of residence, occupation and criminal record… • Residence = Amsterdam • Occupation = Mayor • Criminal history = has criminal record

Microdata • Typical safeguard: delete “identifiers” • So release a dataset containing only city of residence, occupation and criminal record… • Residence = Amsterdam • Occupation = Mayor • Criminal history = has criminal record • HIPPA says this is OK, so long as a researcher “promises not to attempt re-identification” • Is this far-fetched?

Microdata • Unique (or almost unique) identifiers are common • The 1997 voting list for Cambridge, MA has 54,805 voters: Uniqueness of demographic fields birth date alone 12% birth date and gender 29% birth date and 5-digit ZIP 69% birth date and full postal code 97% (Source: Latanya Sweeney)

Traditional Solutions for Microdata • Sampling • Adding noise • Global recoding (coarsening) • Local suppression • Data swapping • Micro-aggregation

Location Age Sex Candidate Pittsburgh Young M Bush Pittsburgh Young M Gore Pittsburgh Young F Gore Pittsburgh Old M Gore Pittsburgh Old M Bush Cleveland Young F Bush Cleveland Young F Gore Cleveland Old M Gore Cleveland Old F Bush Cleveland Old F Gore Example: Data Swapping Unique on location, age and sex Find a match in another location Flip a coin to see if swapping is done

Data Swapping Terms • Uniques Key: Variables that are used to identify records that pose a confidentiality risk. • Swapping Key: Variables that are used to identify which records will be swapped. • Swapping Attribute: Variables over which swapping will occur. • Protected Variables: Other variables, which may or may not be sensitive.

Location Age Sex Candidate Pittsburgh Young M Bush Pittsburgh Young M Gore Pittsburgh Young F Gore Pittsburgh Old M Gore Pittsburgh Old M Bush Cleveland Young F Bush Cleveland Young F Gore Cleveland Old M Gore Cleveland Old F Bush Cleveland Old F Gore Swapping Attribute Uniques Key & Swapping Key Protected Variable

Data Swapping In Use • The Treatment Episode Data Set (TEDS) • National admissions to drug treatment facilities • Administered by SAMHSA (part of HHS) • Data released through ICPSR • 1,477,884 records in 1997, of an estimated 2,207,375 total admissions nationwide

TEDS (cont.) • Each record contains • Age • Sex • Race • Ethnicity • Education level • Marital status • Source of income • State & PMSA • Primary substance abused • and much more…

First Step: Recode • Example recode: Education level 1: 8 years or less 2: 9-11 3: 12 4: 13-15 5: 16 or more becomes Continuous 0-25

TEDS Uniques Key • This was determined empirically,after recoding. We ended up choosing: • State and PMSA • Pregnancy • Veteran status • Methadone planned as part of treatment • Race • Ethnicity • Sex • Age

Other Choices • Swapping key: the uniques key, plus primary substance of abuse • Swapping attribute: state, PMSA, Census Region and Census Division • Protected variables: all other TEDS variables

TEDS Results • After recoding, only 0.3% of the records needed to be swapped • Swapping was done between nearby locations, to preserve statistics over natural geographic aggregations

Tables: Magnitude Data Language Region A Region B Region C Total C++ 11 47 58 116 Java 1 15 33 49 Smalltalk 2 31 20 53 Total 14 93 111 218 Profit of Software Firms  $10 million (Source: Java Random.nextInt(75))

Tables: Frequency Data Income Level, Gender = Male Race  $10,000 >$10,000 and  $25,000 > $25,000 Total White 96 72 161 329 Black 10 7 6 23 Chinese 1 1 2 4 Total 107 80 169 356 Income Level, Gender = Female Race  $10,000 >$10,000 and  $25,000 > $25,000 Total White 186 127 51 364 Black 11 7 3 21 Chinese 0 1 0 1 Total 197 135 51 386 (Source: 1990 Census)

Traditional Solutions for Tables • Suppress some cells • Publish only the marginal totals • Suppress the sensitive cells, plus others as necessary • Perturb some cells • Controlled rounding • Lots of research here, and good results for 2-way tables • For 3-way and higher, this is surprisingly hard!

Disclosure Risk, Data Utility • Risk • the degree to which confidentiality might be compromised • perhaps examine cell feasibility intervals, or better, distributions of possible cell values • Utility • a measure of the value to a legitimate user • higher, the more accurately a user can estimate magnitude of errors in analysis based on the released table

Example: Delinquent Children Education Level of Head of Household Number of Delinquent Children by County and Education Level (Source: OMB Statistical Policy Working Paper 22)

Controlled Rounding (Base 3) • Uniform (and known) feasibility interval. • Easy for 2-D tables, sometimes impossible for 3-D • 1,025,908,683 possible original tables.

Suppress Sensitive Cells & Others • Hard to do optimally (NP complete). • Feasibility intervals easily found with LP. • Users have no way of finding cell value probabilities.

Release Only the Margins

Release Only the Margins • 18,272,363,056 tables have our margins (thanks to De Loera & Sturmfels). • Low risk, low utility. • Easy! • Very commonly done. • Statistical users might estimate internal cells with e.g., iterative proportional fitting. • Or with more powerful methods…

Some New Methods for Disclosure Detection • Use methods from commutative algebra to explore the set of tables having known margins • Originated with the Diaconis-Sturmfels paper (Annals of Statistics, 1998) • Extensions and complements by Dobra and Fienberg • Related to existing ideas in combinatorics

Background: Algebraic Ideals • Let A be a ring (e.g., ℝ or ℤ), and I  A. Then I is called an ideal if • 0  I • If f, g  I, then f + g  I • If f  I and h  A, then f h  I • Ex. 1: The ring ℤ of integers, and the set I = {…-4, -2, 0, 2, 4,…}. A I f g h

Generators • Let f1,…,fs  A. Then is the ideal generated by f1,…,fs • Ex. 1: The ring ℤ of integers, and the set I = {…-4, -2, 0, 2, 4,…}. • SAT question: What is a generator of I ? • I = 2, since I = {2} {…-1, 0, 1,…}

Ideal Example 2 • The ring k [x] of polynomials of one variable, and the ideal I = x 4-1, x 6-1. • GRE question: What is a minimal generator of I (i.e., no subset also a generator)? • I = x 2-1 since x 2-1 is the greatest common divisor of {x 4-1, x 6-1}.

Why Are Minimal Generators Useful? • Compact representation of the ideal---the initial description may be “verbose”. • Allow easy generation of elements, often in a disciplined order. • Guaranteed to explore the full ideal.

Disclosure Analysis: Marginals • Suppose a “data snooper” knows only this:

Disclosure Analysis • Of course the data collector knows the true counts:

Disclosure Analysis • What are the feasible tables, given the margins? Here is one:

Disclosure Analysis Problems • Both the data collector and the snooper are interested in the largest and smallest feasible values for each cell. • Narrow bounds might constitute disclosure. • Both might also be interested in the distribution of possible cell values. • A tight distribution might constitute disclosure.

The Bounds Problem This is usually framed as a continuous problem. Is this the right question? 0 1 2 3 4 5

The Distribution Problem Given the margins, and a set of priors over cell values, what distribution results? 0 1 2 3 4 5

Transform to an Algebraic Problem • Define some indeterminates: • Then write the move as the polynomial x11x22 - x12x21

Ideal Example 3 • Ideal I is the set of monomial differences that take a non-negative table of dimension J K with fixed margins M to another non-negative table with the same margins. • Putnam Competition question: What is a minimal generator of I ?

Solutions: Bounds, Distributions • Upper and lower bounds on cells: • Integer linear programming (max/min, subject to constraints implied by the marginals). • Find a generator of the ideal: • Use Buchberger’s algorithm to find a Gröbner basis. • Apply long division repeatedly; the remainder yields an optimum (Conti & Traverso). • Distribution of cell values: • Use the Gröbner basis to enumerate or sample (Diaconis & Sturmfels, Dobra, Dobra & Fienberg).

Trouble in Paradise • For larger disclosure problems (dimension  3), the linear programming bounds may be fractional. • For larger disclosure problems (dimension  3), the Gröbner basis is very hard to compute.

Disclosure Limitation in Large Statistical Databases

Disclosure Limitation in Large Statistical Databases

Presentation Transcript

Statistical Disclosure Limitation: Releasing Useful Data for Statistical Analysis

Discussion of Statistical Disclosure Limitation: Releasing Useful Data for Statistical Analysis

Inference Control in Statistical databases

Very Large Databases

Large Databases – Introduction

Overview of New Legislation Protecting Confidentiality of Statistical Information and Statistical Disclosure Limitation

Statistical Disclosure Control

New Challenges in Large Simulation Databases

ABS Statistical Databases

Limiting Disclosure in Hippocratic Databases

Statistical Disclosure Limitation: Releasing Useful Data for Statistical Analysis

Tel Aviv April 29th, 2007 Disclosure Limitation from a Statistical Perspective

Statistical Databases – Query Auditing

Discussion of “ Statistical Disclosure Limitation: Releasing Useful Data for Statistical Analysis”

Disclosure Limitation in Microdata with Multiple Imputation

Statistical databases in theory and practice Part III: Designing statistical databases

On Privacy-Preserving Utility-Based Statistical Disclosure Limitation Methods

Large Databases in Industry

Statistical Disclosure Limitation: Releasing Useful Data for Statistical Analysis

Large Scientific Databases

On Privacy-Preserving Utility-Based Statistical Disclosure Limitation Methods

Statistical Disclosure Control