1 / 66

Haplotype assembly

Haplotype assembly. CSCI2820: Medical Bioinformatics Derek Aguiar. Outline. Haplotype assembly workflow Data generation and preparation Error modeling Problem optimizations Algorithms HapCUT Levy et al. 2007 HapCompass Polyploidy and cancer. Haplotype reconstruction. solution space S

karli
Télécharger la présentation

Haplotype assembly

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Haplotype assembly CSCI2820: Medical Bioinformatics Derek Aguiar

  2. Outline • Haplotype assembly workflow • Data generation and preparation • Error modeling • Problem optimizations • Algorithms • HapCUT • Levy et al. 2007 • HapCompass • Polyploidy and cancer

  3. Haplotype reconstruction solution space S k-ploidy n 2’s Sequence reads or genotypes Haplotype assembly Haplotype phasing Haplotypes

  4. The haplotype assembly workflow DNA sequencing read alignment; retain SNPs computational modeling of reads and sequencing errors problem optimization haplotypes

  5. The haplotype assembly workflow DNA sequencing read alignment; retain SNPs computational modeling of reads and sequencing errors problem optimization haplotypes

  6. DNA ...A A C G T C A G T C A G C C T... 5.4 thousand bases 150 billion bases 3.2 billion bases

  7. Variation structural variation Single nucleotide variant (SNV)or polymorphism (SNP) del/no del alleles C/T alleles http://en.wikipedia.org/wiki/Single-nucleotide_polymorphism https://sites.google.com/site/lifesciencesinmaine/5-cell-division-reproduction-and-dna

  8. Genotypes and haplotypes Genotype Haplotype ACGCCA ACGCCA CCTTAA {A,C} {C} {G,T} {C,T} {C,A} {A} TGCGGT CCTTAA GGAATT 000001 101111 2 0 2 2 2 1

  9. High-throughput Sequencing • Also termed next-generation sequencing • Illumina • 454 • SOLiD • DNA is fractured, amplified, fixated onto an array, bases are added • Single molecule or 3rd generation technologies Source of bias Error signature Short reads: ~50-200bp (454 can get up to 1kb) Generally more error prone than Sanger Extremely fast and parallel

  10. Wetterstrand KA. DNA Sequencing Costs: Data from the NHGRI Genome Sequencing Program (GSP) Available at: www.genome.gov/sequencingcosts. Accessed April 2013.

  11. http://www.ncbi.nlm.nih.gov/Traces/sra/

  12. Illumina FASTQ format • Four lines per sequence • identifier line • sequence line • + line • base quality line

  13. Illumina File Snippet • @D4ZHLFP1:10:C03D7ACXX:2:2305:7378:90088 2:N:0:ATCACG • CGTGACAATATTTAGCTTTAGATGGAGTTTACCACCCATTTTGGGCTGCATTCCCAAACAACCCGACTCGTCGAAAGCACATCGTGAACGACCGAGGTCGC • + • BBBDDFFFFHGHH@D@GHIGIFIJJJIEEHHHIHIHGHIJJIJFIHII;BFFFBDGIGIGIJJGIAEEDFDBD>BBBA:<AC@@BDCBCD@BB@>BD5399

  14. ID line in detail • @D4ZHLFP1:10:C03D7ACXX:2:2305:7378:90088 2:N:0:ATCACG

  15. Illumina Base QualityEncoding http://en.wikipedia.org/wiki/FASTQ_format

  16. The haplotype assembly workflow DNA sequencing read alignment; retain SNPs computational modeling of reads and sequencing errors problem optimization haplotypes

  17. sequence read alignment • Report all read alignments between sequence reads ri with lengths li for 1≤i≤n and a genome g of length L

  18. sequence read alignment • The real problem is li<<L for 1≤i≤n • What can we do to overcome this problem?

  19. BLAT • Algorithm • Construct an index of non-overlapping k-mersfor the reference • Use index to quickly find regions of the genome similar to the query • Compute an alignment between read and genomic region

  20. SAM/BAM file format • Used to store read alignments • Tab delimited • samtools program used to manipulate SAM/BAM files • Picard java port

  21. VCF file format • Used to store variant calls • Developed as part of the 1000 Genomes Project

  22. Motivation • Sequence reads are sampled from haploid fragments (fi) gap between sequences covered SNPs f1 C T f2 A C ···········A······C······A······T······G·········· ···········C······G······A······A······T·········· ···········C······G······G······A······T·········· C G T f3

  23. The haplotype assembly workflow DNA sequencing read alignment; retain SNPs computational modeling of reads and sequencing errors problem optimization haplotypes

  24. errors • Two fundamental sources of errors • Sequencing • A base was miscalled • An insertion or deletion was miscalled • Read alignment • A read was aligned to the wrong location or the alignment itself is wrong

  25. errors • We can explicitly model sequencing errors based on the technology used to generate the reads • Also, we can use the phred quality scores

  26. relationship between quality score and base call

  27. fragments • For now, we assume the organism we are sequencing has ploidy=2, or 2 sets of chromosomes • Thus, each SNP can only have two alleles • Our computational model of an aligned read is termed a fragment • Given n SNPs, the ith fragment fiis an n-dimensional vector with alphabet {0,1,-} where 0 and 1 represent the major and minor alleles and – represents a gap in coverage

  28. formal definition of input • The input is a matrix M with m rows corresponding to m fragments and n columns corresponding to the n SNPs. f1 A C ⋅⋅⋅A⋅⋅⋅⋅C⋅⋅⋅⋅A⋅⋅⋅⋅T⋅⋅⋅ ⋅⋅⋅C⋅⋅⋅⋅G⋅⋅⋅⋅G⋅⋅⋅⋅A⋅⋅⋅ G A f2 SNP-fragment matrix s1 s2 s3 s4 f1 f2

  29. The haplotype assembly workflow DNA sequencing read alignment; retain SNPs computational modeling of reads and sequencing errors problem optimization haplotypes

  30. The haplotype assembly problem (diploid) • Given: a set aligned sequence reads and variants • Compute: two haplotypes for each chromosome/scaffold

  31. imagine a world without errors How would you assemble the fragments?

  32. fragment conflict • Let fik be the allele or ‘-’ at position k of fragment i, then two fragments conflict if • Informally, two fragments fi and fj are in fragment conflict if they cover a common SNP and have different alleles at that site So, how would you assemble the fragments?

  33. fragment conflict graph Lancia et al., 2001 1 − 0 • The fragment conflict graph is a representation of the fragments which share a SNP but differ in alleles. • The fragment conflict graph GF(VF,EF) has a vertex for every fragment and an edge between fragments that conflict (where d(fi,fj) is the number of fragment conflicts between fi and fj): 1 0 − − 0 0 0 1 − 0 − 1 − 1 1

  34. fragment conflict graph • Find two sets such that there are no fragment conflicts internal to each set (bipartition of GF) • The shores of a bipartition of GF give the haplotype Haplotype A: 1 0 0 1 − 0 1 0 − − 0 0 0 1 − 0 − 1 − 1 1 Haplotype B: 0 1 1

  35. minimum fragment removal • Optimization: Compute a subset of vertices VS in VF such that G(VF-VS) induces a bipartite graph. • Sometimes referred to as the bipartization problem and is NP-hard Lancia et al., 2001 But, exist in real data, sequencing errors do. Hmmmmm. disney.wikia.com

  36. minimum edge removal • Optimization: Compute a subset of edges ES in EF such that G(EF-ES) induces a bipartite graph. • Also referred to as bipartization and NP-hard • But what exactly does an edge removal in GF mean? • More on this later… Lancia et al., 2001

  37. SNP conflict graph • GS(VS,ES) has a vertex for every SNP and an edge between SNPs that exhibit 3 or more haplotypes: Schwartz 2010

  38. minimum error correction • The most used optimization in the literature is the Minimum Error Correction (MEC) • FastHare4,HAPCUT1, HASH2, He et al.3 • Compute a minimum set of error corrections in the fragments (rows) of M such that GF is bipartite. Lancia et al., 2001 1 Bansal, V. and Bafna, V. (2008). Hapcut: an efficient and accurate algorithm for the haplotype assembly problem. Bioinformatics, 24(16), 153–159. 2 Bansal, V., Halpern, A. L., Axelrod, N., and Bafna, V. (2008). An mcmc algorithm for haplotype assembly from whole-genome sequence data. Genome Research, 18(8), 1336–1346. 3 He, D., Choi, A., Pipatsrisawat, K., Darwiche, A., and Eskin, E. (2010). Optimal algorithms for haplotype assembly from whole-genome sequence data. Bioinformatics, 26(12), i183–i190. 4 Panconesi, A. and Sozio, M. (2004). Fast hare: A fast heuristic for single individualsnp haplotype reconstruction. In Proceedings of the 4th International Workshop on Algorithms in Bioinformatics WABI04, volume 3240, pages 266 – 277.

  39. minimum error correction • A single A->T change creates a bipartite GF Lancia et al., 2001 Schwartz 2010

  40. Input v0 v1 CCTA read 1 AGTA read 2 AGGA consensus CGGA read 3

  41. Fragment conflicts • Two fragments conflict if they share a variant, differ in allele • Haplotype assembly = two sets with no internal conflicts (bipartite) 10 01 f1 f3 v0 v1 11 read 1 f2 read 2 read 3

  42. Optimizations • Remove minimum number of • Fragments • Variants • Correct minimum number of errors f1=01 f2=10 f3=11 f3=10 f1 f3 v0 v1 0- 01 01 1- 10 10 f2

  43. Prior Work • Minimum Error Correction (MEC) • Levy et al. 2007, FastHare4,HAPCUT1, HASH2, He et al.3 • Minimum SNP Removal • Minimum Fragment Removal • HapCompass5 1Bansal, V. and Bafna, V. (2008) 2 Bansal, V., Halpern, A. L., Axelrod, N., and Bafna, V. (2008) 3 He, D., Choi, A., Pipatsrisawat, K., Darwiche, A., and Eskin, E. (2010) 4 Panconesi, A. and Sozio, M. (2004) 5 Derek Aguiar and SorinIstrail, (2012)

  44. Levy et al. 2007 • Venter diploid genome • First diploid genome assembly • Greedy haplotype assembly algorithm

  45. Levy et al. 2007 • Initialize set S containing all reads • while S has a read in which all variants are not currently assigned • choose a read containing the most uncalled variants in the set of haplotypes • use read to seed haplotype; assign alleles to haplotype and the compliment alleles to the compliment haplotype • iterate: • extend haplotypes using read with strongest signal (|number of variant agreements for haplotype A - number of variant agreements for haplotype B|)

  46. Levy et al. 2007 (continued) • Iterate: • For each variant, assign allele by majority rule of the reads used to determine the alleles • For each read, assign the read to the haplotype with minimum distance

  47. Levy et al. 2007 example v0 v1 v2 v3 v4 read 0 read 1 read 2 read 3 read 4 hap A hap A’

More Related