1 / 39

Doug Szajda Mike Pohl * Jason Owen Barry Lawson

Toward a Practical Data Privacy Scheme for a Distributed Implementation of the Smith-Waterman Genome Sequence Comparison Algorithm. Doug Szajda Mike Pohl * Jason Owen Barry Lawson. 1. Large-Scale Distributed Computations. Easily parallelizable, compute intensive

Télécharger la présentation

Doug Szajda Mike Pohl * Jason Owen Barry Lawson

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Toward a Practical Data Privacy Scheme for a Distributed Implementation of the Smith-Waterman Genome Sequence Comparison Algorithm Doug Szajda Mike Pohl* Jason Owen Barry Lawson 1

  2. Large-Scale Distributed Computations Easily parallelizable, compute intensive Divide into independent tasks to be executed on participant PCs Significant results collected by supervisor 2

  3. seti@home Finding Martians folding@home Protein folding GIMPS (Entropia) Mersenne Prime search United Devices, IBM, DOD: Smallpox study DNA sequencing Graphics Exhaustive Regression Genetic Algorithms Data Mining Monte Carlo simulation Examples

  4. A Problem • Code is executing in untrusted environments • Data required for task execution may be proprietary • Can we find a way to have participants execute tasks without divulging data?

  5. Related Work (not exhaustive) • Computing with Encrypted Data • Feigenbaum (1985) • Abadi, Feigenbaum, Killian (1987) • Secure Circuit Evaluation • Abadi and Feigenbaum (1990) • Sander, Young, and Yung (1999)

  6. Related Work (not exhaustive) • Privacy Homomorphisms • Rivest, Adleman, Dertouzos (1978) • Ahituv, Lapid, Neumann (1987) • Brickell and Yacobi (1987) • Multiparty function computation • Yao (1986) • Goldreich, Micali, Wigderson (1987) • Ben-Or, Goldwasser, and Wigderson (1988) • Chaum, Crepeau, and Damgard (1988)

  7. Computing With Encrypted Data • Alice has x, wants Bob to compute f(x), but does not want to divulge x • Alice gives Bob E(x) and f’, tells him to return f’(E(x)) • Alice can determine f(x) from f’(E(x)), but Bob cannot determine x from knowledge of E(x), f’(E(x))

  8. In Present Context • Alice has several x values. Asks Bob to identify those that are significant • Alice doesn’t need f(x), so greater flexibility in definition of f’ (Sufficient Accuracy) • Post-filtering means that some false positives are OK. • Lots of Bobs offering computing services

  9. Adversary (as usual) • Assumed to be intelligent • Can decompile, analyze, modify code • Understands task algorithms and measures used to prevent disclosure of data

  10. The Model • Computation: evaluate f : D -> R • Partition D into subsets Di • Task T(Di): evaluate f(xi) for all xi in Di • Each task assigned filter function Gi • Gi returns indices of interesting xi

  11. Basic Approach • Transform Di, f, Gi into Di’, f’, Gi’ • Replace T(Di) with T(Di’) such that • T(Di’) does not leak additional information about values in Di • Identifiers returned by T(Di’) contains those that would be returned by T(Di) • Difference is reasonably small

  12. Reality • Providing required properties is difficult (impossible for some apps) • Even when possible, implementation is application specific • Bottom line: A potential approach, where few (if any) others exist

  13. An Example: Smith-Waterman Genome Sequence Comparison

  14. Genetic Sequence Alignment • Comparing sequences over alphabet ∑={A,C,G,T} • Biologists track evolutionary changes by writing sequences with columns aligned (called an alignment) • Ex. CTGTTA CAGTTA

  15. Sequence Evolution • Deletion: CTGTTA CTGTA • Insertion: CTGTTA CGTGTTA • Substitution: CTGTTA CAGTTA indels

  16. Sequence Evolution (cont.) • After several “generations”: CTGTTA CTATGCTCG • Note: Number of alignments (for pair of realistic length sequences) is huge

  17. Alignment “Types” • Global alignment • Considers entire sequence • Local alignment • Considers substrings • Biologists usually consider local alignments

  18. Measuring Alignments • Scoring function • +1 if symbols match • -1 if not • Gap penalty • g(k) = a + b(k-1) • k is gap length (# consecutive dashes in single sequence) • Alignment score is sum of column scores minus gap penalties

  19. Smith-Waterman • Dynamic programming algorithm guaranteed to produce an optimal alignment • Global: O(n2); local: O(n3) • Widely used by biologists • Implemented on commercial volunteer distributed computing platforms

  20. Using Smith-Waterman • Significance of Smith-Waterman score based on probabilistic considerations • Empirical Evidence: Similarity scores of randomly generated sequences exhibit an extreme value distribution • Significance threshold p chosen so that probability random score > p is small (typically <0.003)

  21. A Smith-Waterman Task • Pairwise comparison of two sets of sequences, A and B • A : proprietary sequences • B : sequences from public database • Returned: indices of well-matched pairs • Notation: T(A,B,s,g,p)

  22. Our Transformation • Offset sequences: compare relative distances b/w specific nucleotide • U: GCACTTACGCCCTTACGACG • F(U,A) = {3,4,8,3} • F(U,C) = {2,2,4,2,1,1,4,3} • F(U,G) = {1,8,8,3} • F(U,T) = {5,1,7,1}

  23. Modified Tasks • U: GCACTTACGCCCTTACGACG F(U,C) = {2,2,4,2,1,1,4,3} • V: GCACTCGCCACTTAGCACG F(V,C) = {2,2,2,2,1,2,5,2} • Apply S-W to F(U,C) and F(V,C) • Scoring function, gap penalty • “Goodness” threshold

  24. Intuition • Similar sequences should have similar offsets • Consider effects of indels, substitutions • False positives can be reduced • Consider multiple nucleotides • I.e., assign A and C info to distinct participants • Good match if both tasks indicate significance

  25. Using Multiple Nucleotide Literals • Maximum method • One task for each of A,C,G,T • Result significant if any of the four says so • Adding method • One task for each of A,C,G,T, results passed to fifth participant • Result significant if sum of four scores indicates significance • Costs reduced in either case

  26. Security?

  27. Recall… • T(Di’) does not leak additional information about values in Di • Identifiers returned by T(Di’) contains those that would be returned by T(Di) • Difference is reasonably small

  28. Data Privacy? • Property 1 fails: adversary will know all info about a single nucleotide literal • Conditional entropy gives rough estimate of amount of information leaked • Bits leaked: 2N - (N - C∂ ) log 3 • C∂ is # of occurrences of ∂ in sequence • Ex. N = 600, C∂ =N/4  487 bits (of 1200) leaked (713 bits of uncertainty remain)

  29. Analysis • Clearly, our scheme does not provide provable security, but it does suggest two questions: • Can an adversary determine additional symbols (and if so, how many)? • How much information leakage is too much in this context?

  30. “4 out of 5 [Biologists] Agree” • Given only the position of a single nucleotide literal: • No additional elements can be inferred • There is no “biologically useful” information that can be inferred • Given current understanding of the structure and function of the genome

  31. An Extension • Sequences can be “masked” • For each task, choose random binary mask • Remove from sequence all “zeroed” elements • Our experiments suggest mask with “1” in 90% of positions works well

  32. Does it Work? • In general, yes • Strong correlation between our scores and S-W • Not as sensitive as Smith-Waterman • Some weak matches missed • Statistical inference techniques show: • Very few false positives ( < 10-4) • Very few false negatives (often none)

  33. Simulation Results • Well-matched sequences artificially generated • Substring mutated over several generations • Placed at random location into random sequences • Scoring function as given earlier (1, -1) • Gap penalty: g(k) = 2 + 1(k-1)

  34. 10000 comp, no mask, maximum method for determining significance • Sequence length 600-800, matching portion length 300, average of 52.5 subs and 52.5 indels

  35. 10000 comp, no mask, adding method for determining significance • Sequence length 600-800, matching portion length 300, average of 52.5 subs and 52.5 indels

  36. 1000 comp, no mask, maximum method for determining significance • Sequence length 2000, matching portion length 1000, average of 150 subs and 150 indels

  37. 1000 comp, 90% mask, maximum method for determining significance • Sequence length 1000-1300, matching portion length 500, average of 86.25 subs and 86.25 indels

  38. Conclusions • Introduced notion of sufficient accuracy • Presented a strategy for enhancing data privacy in important real-world application • Present important real-world app that requires privacy and is efficiently parallelizable • These are relatively rare • Potential first entry for benchmark suite of apps for privacy study

  39. In the Future • Solution is less than ideal • Lack of formal privacy model / provable security • Need more testing on real genetic data • But it’s a start • General problem is difficult, this is a potential avenue of attack • Smith-Waterman requires more careful study in this context • Application behavior vs. application configurations

More Related