1 / 36

Issues Encountered Deploying Differential Privacy

Learn about the challenges faced by the U.S. Census Bureau in implementing differential privacy for the upcoming 2020 Census, including personnel, computing environment, data usage, privacy-loss parameter, and verification.

grante
Télécharger la présentation

Issues Encountered Deploying Differential Privacy

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Issues Encountered Deploying Differential Privacy Simson L. GarfinkelSenior Scientist, Confidentiality and Data AccessU.S. Census BureauIQT Technology Focus Day March 14, 2019 | Marriott Tysons Corner The views in this presentation are those of the author, and not those of the U.S. Census Bureau.

  2. Abstract • When differential privacy was created more than a decade ago,  the motivating example was statistics published by an official statistics agency. In attempting to transition differential privacy from the academy to practice, and in particular for the 2020 Census of Population and Housing, the U.S. Census Bureau has encountered many challenges unanticipated by differential privacy's creators. These challenges include obtaining qualified personnel and a suitable computing environment, the difficulty accounting for all uses of the confidential data, the lack of release mechanisms that align with the needs of data users, the expectation on the part of some data users that they will have access to micro-data, the difficulty in setting the value of the privacy-loss parameter, ε (epsilon), and the lack of  tools and trained individuals to verify the correctness of differential privacy implementations. 

  3. Acknowledgments • This presentation incorporates work by: • Dan Kifer (Scientific Lead) • John Abowd (Chief Scientist) • Tammy Adams, Robert Ashmead, ArefDajani, Jason Devine, Nathan Goldschlag, Michael Hay, Cynthia Hollingsworth, MeritonIbrahimi, Michael Ikeda, Philip Leclerc, AshwinMachanavajjhala, Christian Martindale, Gerome Miklau, Brett Moran, Ned Porter, Anne Ross, William Sexton, Lars Vilhuber, and Pavel Zhuravlev

  4. Outline Motivation The flow of census response data Disclosure avoidance for the 2010 census Disclosure avoidance for the 2020 census Conclusion

  5. Our motivation: The US Constitution • Article 1, Section 2 “The actual Enumeration shall be made within three Years after the first Meeting of the Congress of the United States, and within every subsequent Term of ten Years, in such Manner as they shall by Law direct.”

  6. “in such Manner as they shall by Law direct.”Public Law 94-171 http://uscode.house.gov/statutes/pl/94/171.pdf

  7. The Census Bureau is also directed to respect confidentiality. 13 U.S. Code § 9 - Information as confidential (a) Neither the Secretary, nor any other officer or employee of the Department of Commerce or bureau or agency thereof, or local government census liaison may…. (2) Make any publication whereby the data furnished by any particular establishment or individual under this title can be identified; or … Copies of census reports, which have been so retained, shall be immune from legal process, and shall not, without the consent of the individual or establishment concerned, be admitted as evidence or used for any purpose in any action, suit, or other judicial or administrative proceeding.

  8. Federal Register / Vol. 82, No. 215 / Nov 8, 2017 / Notices Dec. 31, 2018 We will report (GROUP BY block): • P1. RACE/ETHNICITY Universe: Total population • P2. RACE/ETHNICITY Universe: Total population age 18 and over • H1. OCCUPANCY STATUS • P42. GROUP QUARTERS POPULATION Universe: Population in Group Quarters

  9. Disclosure Avoidance for the 2010 Census

  10. “This is the official form for all the people at this address.” “It is quick and easy, and your answers are protected by law.”

  11. 2010 Census of Population and Housing Basic results from the 2010 Census:

  12. 2010 Census High-Level Database Schema

  13. We now know that the disclosure avoidance techniques we used in the 2010 Census were flawed. We released exact population counts at the block, tract and county level. We did not swap 100% of the households We released ≈25 statistics per person, but only collected six pieces of data per person: • Block • Age • Sex • Race • Ethnicity • Relationship to Householder The Census Bureau did the best possible in 2010. The math for protecting a decennial census did not [yet] exist.

  14. We performed a database reconstruction and re-identification attack for all 308,745,538 people in 2010 Census Reconstructed 308,745,538 microdata records. Used 4 commercial databases of the 2010 US population acquired 2009-2011 in support of the 2010 Census • Commercial database had: NAME, ADDRESS, AGE, SEX Linked reconstructed records to the commercial database records • Linked database now has: NAME, ADDRESS, AGE, SEX , ETHNICITY, & RACE • Link rate: 45% Compared linked database to Census Bureau confidential data • Question: How often did the attack get the all variables including race and ethnicity right? • Answer: 38% (17% of US population)

  15. We will use Differential Privacy to address this issue.

  16. We have created a “Disclosure Avoidance System” Microdata Detail File Census Edited File Disclosure Avoidance System • Features of the DAS: • Operates on the edited Census records • Designed to make records that are “safe to tabulate.” • Drops into the 2020 Census data processing pipeline

  17. The Disclosure Avoidance System allows the Census Bureau to enforce global confidentiality protections. Pre-specified tabular summaries: PL94-171, SF1 DAS CEF CUF DRF MDF Census Unedited File Census Edited File Global Confidentiality Protection Process Disclosure Avoidance System Decennial Response File Microdata Detail File Special tabulations and post-census research Privacy-loss Budget, Accuracy Decisions Confidential data Public data

  18. The Disclosure Avoidance System relies on infusing formally private noise. Global Confidentiality Protection Process Disclosure Avoidance System ε • Advantages of noise injection with formal privacy: • Easy to understand • Provable and tunable privacy guarantees • Privacy guarantees do not depend on external data • Protects against database reconstruction attacks • Privacy operations are composable • Disadvantages: • Entire country must be processed at once for best accuracy • Every use of private data must be tallied in the privacy loss budget • It may not be possible to create some tables from Microdata.

  19. Why generate a differentially private MDF? • Key advantages: • Familiar to internal and external stakeholders • Operates with legacy tabulation systems to produce PL-94 and SF-1 tabulations • Consistency among query answers

  20. Algorithmic challenges in creating a differentially private MDF • It’s NP-hard! • Some statistics will not be generated from microdata • We may have multiple sets of microdata: • Each set designed for a different purpose • Each set possibly mutually inconsistent

  21. Business process challenges in creating a differentially private MDF All queries on MDF with targeted accuracy must be known in advance. All uses of confidential data need to be tracked and accounted for. Data quality checks on tables cannot be done by looking at raw data.

  22. Communications challenges in creating a differentially private MDF • Differential privacy is not widely known or understood. • Many data users want highly accurate data reports on small areas. • Users in 2000 and 2010 didn’t know the error introduced by swapping.

  23. Differential Privacy at the US Census Bureau

  24. Differential Privacy: A Survey of Deployments so far US Census Bureau Rest of the World Google Chrome: Windows Process Names, Chrome Homepages, etc. 135 metrics in total. ε=2 .. 7 per metric. Apple: QuickType suggestions, Emoji suggestions, Lookup Hints, Safari Energy Draining Domains, Safari Autoplay Intent Detection, Safari Crashing Domains, Health Type Usage ε=7 (MacOS); ε=14 (iOS)per day. Microsoft: Telemetry from Windows 10 • Longitudinal Employer Household Database (LEHD) • On The Map ε=8.9 • “Protecting Graduate Earnings and Employment Outcomes” ε=3 Sources: Wired, “How One of Apple’s Key Privacy Safeguards Falls Short,” Sept. 15, 2017. https://www.wired.com/story/apple-differential-privacy-shortcomings/ AccessNow, “Differential privacy, part 3,” Nov. 30, 2017. https://www.accessnow.org/differential-privacy-part-3-extraordinary-claims-require-extraordinary-scrutiny/

  25. The Census Bureau’s experience with OnTheMap did not prepare the organization for the challenge. OnTheMap was a new product, designed from the start to be DP on the residential side and to use a pre-DP noise injection on the business side. The decennial Census of Population and Housing, first performed under the direction of Thomas Jefferson in 1790, is the oldest and most expensive statistical undertaking of the U.S. government. Transitioning existing data products has revealed: The limits of today’s formal privacy mechanisms The difficulty of retrofitting legacy statistical products to conform with modern privacy practice

  26. Scientific Issues for the 2020 Census We needed a novel mechanism that: • Had consistent statistics US->States->Counties->Tracts->Blocks • Provided lower error for larger population geographies. Invariants. For the 2018 End-to-End test, policy makers wanted exact counts for: • Number of people on each block • Number of people on each block of voting age • Number of residences & group quarters on each block

  27. We now know that offering invariants was a mistake! Invariants operate by limiting the range of possible neighboring datasets. This significantly limits the universe of possible database reconstructions. Invariants effectively redefine the privacy guarantees of differential privacy in a way that is not acceptable. For the 2020 Census, we expect to have a single invariant: • The number of people in each state (as implemented for the 2018 End-to-End Census Test). • This is a policy constraint, not a mathematical constraint.

  28. Public issue: Setting Epsilon We want the most efficient technology, which is the one in red.

  29. Where social scientists act like MSC = MSB Where computer scientists act like MSC = MSB

  30. Scientific Issues: Quality Metrics What is the measure of “quality” or “utility” in a complex data product? Options: • L1 error between “true” data set and “privatized” data set • Impact on an algorithm that uses the data (e.g. voting rights the statistic for a statutory purpose

  31. Policy issue:What should be accurate? Reporting Option #1 Sex split more important As enumerated Reporting Option #1 Age split more important

  32. Scientific issues for the American Community Survey The American Community Survey replaced the “Long Form” in 2005. ACS uses a complex multi-stage probability sample of the entire US, with ex post weight adjustments. Differential privacy currently does not handle (well): Non-design weights on survey microdata. (maximum weight is data dependent in this case) Post-processing after design weight determination; e.g., skip-logic/conditional responses.

  33. Operational Issues Obtaining qualified personnel and tools Recasting high-sensitivity queries Identifying structural zeros Obtaining a suitable computing environment Accounting for all uses of the confidential data

  34. Issues Faced by Data Users Access to micro-data • Many users, especially internal Census Bureau users, expect access to microdata. Difficulties arising from increased transparency • Many users were not aware of prior disclosure avoidance practices. • The swap rate was never made public. Misunderstandings about randomness and noise injection

  35. Recommendations Repeated discussions with decision makers Controlled vocabulary Integrated communications strategy

  36. Thank you.Simson.L.Garfinkel@census.gov

More Related