A n MCMC algorithm for haplotype assembly from whole-genome sequence data

An MCMC algorithm for haplotype assembly from whole-genome sequence data VikasBansal, Aaron L. Halpern, Nelson Axelrod, et al. Genome Research. 2008 Aug Presented by KWOK TszPiu (Bill) 5/9/2013

Introduction • Genetic variation is present in the form of single nucleotide polymorphisms(SNPs), insertions/deletions, inversions, translocations, copy number variations, etc. • The abundance of SNPs in human genome and the development of high-throughput genotyping technologies have made SNPs the marker of choice for understanding human genetic variation.

Introduction • Human genome contains a pair of DNA sequences : one from each parent called haploid sequences or haplotypes • Haplotypes differ at SNP/insertion/deletion… positions • SNPs (single nucleotide polymorphism) are single bpmutations (~0.1%; non-uniform) • SNP positions contain one of two possible alleles … ataggtccCtatttcgcgcCgtatacacgggActata … … ataggtccGtatttcgcgcCgtatacacgggTctata … … ataggtccCtatttcgcgcCgtatacacgggTctata …

Haplotypes and Genotypes • Haplotype: description of SNP alleles on a chromosome • 0/1 vector: 0 for major allele, 1 for minor • Diploids: two homologous copies of each autosomal chromosome • One inherited from mother and one from father • Genotype: description of alleles on both chromosomes • 0/1/2 vector: 0 (1) - both chromosomes contain the major (minor) allele; 2 - the chromosomes contain different alleles 021200210 011000110 001100010 genotype + two haplotypes per individual

Gene‐Disease Association Studies • Haplotypes increase power of association

Goal of Haplotype assembly problem • Goal of haplotype assembly(phasing)is to use aligned sequence fragments to reconstruct the two haplotypes

Haplotype assembly problem • For haplotype assembly to be feasible • High sequence coverage • Reads to be long enough to span multiple variant sites • However, even with paired ends, it is not possible to link all variants • A haplotype assembly for a diploid genome is a collection of haplotype segments.

Haplotype assembly problem • In the absence of error in sequenced read, the correct haplotype assembly is unique. • In the real case, the problem become finding the haplotype assembly that optimizes a certain objective function • E.g., minimize the number of conflicts with the sequenced reads. (MEC)

Input • Each sequenced read is mapped to the reference genomic sequence • For a variant, reads with sequence matching the consensus are assigned as 0, otherwise 1 • Each fragment i is represented by a ternary string Xi = {0, 1, -}n • ‘-’ corresponds to the heterozygous loci not covered by the fragment

Error • For the jthvariant site of fragment i, • Quality scores si[j] • Which can be converted to error probability qi[j],

Haplotype • Let H = (h, ) represent the unordered pair of haplotypes • where h is a binary string of length n • is the bitwise complement of h • i.e., h[j] = 1 - h[j]. • h = 100100... • = 011011...

Probability X = input matrix H = H(h,) = pair of haplotype q = error probability matrix • Target: • By Bayes’ rule: • Assume variant calls are independent • δ(Xi[j],h[j]) = 1 if Xi[j] = h[j] and 0 otherwise

Probability X = input matrix H = H(h,) = pair of haplotype q = error probability matrix • Assume each fragment is randomly generated from one of the two haplotypes • Assume fragments are independently generated

MCMC algorithm • States of the Markov chain are the set of possible haplotypes • Transitions of the Markov chain are governed by subsets S of columns of the fragment matrix X.

MCMC algorithm • Γ = {S1, S2, . . . , Sk} is a collection of subsets of columns of X • for each state H, there are k + 1 possible moves to choose from, including the self-loop. • E.g., Γ = {{3, 4, 5, 11}, {1, 2, 3, 4}, … } • Algorithm 1:

MCMC algorithm • What should be Γ? • A natural choice for Γis Γ1= {{1},{2}, . . . , {n}}. • However, the mixing time of it grows exponentially with d, the depth of coverage • Choose by a graph-partitioning approach

A graph-partitioning approach • Node: Variant site • Weight of Edge: +1 between two nodes if the phase suggested by the fragment is consistent with the current haplotype assembly, -1 if not.

A graph-partitioning approach

A graph-partitioning approach • Hence, a cut with low weight corresponds to a subset of columns whose current phase with respect to the rest of the columns is inconsistent with the fragment matrix • Partition the graph G(X) into 2 set of node using min-cut algorithm and add the two subsets to Γ. • Apply the same procedure recursively.

MCMC algorithm • The collection of subsets is dependent upon the haplotype pair H. • Final Algorithm HASH (haplotype assembly for single human) Γ1 = {{1},{2}, . . . , {n}}.

Result • Source: Human genome (HufRef genome assembly) • ~32 million reads, 7.5 coverage • ~1.8 million heterozygous variants • E.g., Chromosome 22 • 24,967 heterozygous variants • 103,356 rows (53,279 links two or more variants) • Partitioned into 607 disjoint haplotypes

Result

Summary • Describe a MCMC algorithm for haplotype assembly that samples haplotypes • Given a list of all heterozygous variants and sequenced reads mapped to a genome assembly • Graph-partitioning approach to make the Markov chain more efficient

Thank you!

A n MCMC algorithm for haplotype assembly from whole-genome sequence data

A n MCMC algorithm for haplotype assembly from whole-genome sequence data

Presentation Transcript

Genome Assembly

Whole-Genome Prokaryote Phylogeny without Sequence Alignment

A New Algorithm for DNA Sequence and Assembly

Genome sequence assembly

Haplotype assembly

Polyploid haplotype assembly

Genome Assembly

Genome Assembly

Genome Assembly

Spooky halloween haplotype assembly

Genome sequence

Whole Genome Assembly Microarray analysis

What do you with a whole genome sequence?

Genome Assembly

Hierarchical Assembly of Genome Sequence

A Hierarchical Clustering Algorithm for Categorical Sequence Data

Genome sequence assembly

Sequence Alignment and Genome Assembly

Whole Genome Sequencing, Assembly and Annotation

Phred / Phrap /Consed Genome/Sequence Assembly

Whole Genome Shotgun Assembly

HapCompass : A Fast Cycle Basis Algorithm for Accurate Haplotype Assembly of Sequence Data