590 likes | 725 Vues
This document, presented by Michael Brudno at the PGA Workshop, delves into various algorithms for aligning genomic sequences. It covers essential models, such as the Edit Distance Model, which calculates the optimal alignment of sequences through insertions, deletions, and mutations. The presentation explores local, global, and glocal alignment methods, including CHAOS and LAGAN, providing insights into alignment efficiency and biological significance. This work lays the groundwork for developing improved computational systems for comparative genomic analysis.
E N D
Algorithms for Alignment of Genomic Sequences Michael Brudno Department of Computer Science Stanford University PGA Workshop 07/16/2004
Conservation Implies Function Gene Exon CNS: Other Conserved
Edit Distance Model (1) Weighted sum of insertions, deletions & mutations to transform one string into another AGGCACA--CA AGGCACACA | |||| || or| || || A--CACATTCA ACACATTCA
Edit Distance Model (2) Given: x, y Define: F(i,j) = Score of best alignment of x1…xi to y1…yj Recurrence: F(i,j) = max ( F(i-1,j) – GAP_PENALTY, F(i,j-1) – GAP_PENALTY, F(i-1,j-1) + SCORE(xi, yj))
Edit Distance Model (3) AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA • F(i,j) = Score of best alignment ending at i,j • Time O( n2 ) for two seqs, O( nk ) for k seqs F(i-1,j-1) F(i,j-1) AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC F(i,j) F(i-1,j)
Overview • Local Alignment (CHAOS) • Multiple Global Alignment (LAGAN) - Whole Genome Alignment • Glocal Alignment (Shuffle-LAGAN) • Biological Story
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC Local Alignment • F(i,j) = max (F(i,j), 0) • Return all paths with a position i,j where F(i,j) > C • Time O( n2 ) for two seqs, O( nk ) for k seqs
Heuristic Local Alignment AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC BLAST FASTA
CHAOS: CHAins Of Seeds • Find short matching words (seeds) • Chain them • Rescore chain
CHAOS: Chaining the Seeds seq1 seed • Find seeds at current location in seq1 seq2 location in seq1
CHAOS: Chaining the Seeds seq1 distance cutoff seed • Find seeds at current location in seq1 seq2 location in seq1
CHAOS: Chaining the Seeds seq1 gap cutoff distance cutoff seed • Find seeds at current location in seq1 seq2 location in seq1
CHAOS: Chaining the Seeds seq1 gap cutoff distance cutoff seed • Find seeds at current location in seq1 • Find the previous seeds that fall into the search box seq2 Search box location in seq1
CHAOS: Chaining the Seeds seq1 gap cutoff distance cutoff seed • Find seeds at current location in seq1 • Find the previous seeds that fall into the search box • Do a range query: seeds are indexed by their diagonal seq2 Search box location in seq1 Range of search
CHAOS: Chaining the Seeds seq1 gap cutoff distance cutoff seed • Find seeds at current location in seq1 • Find the previous seeds that fall into the search box • Do a range query: seeds are indexed by their diagonal. • Pick a previous seed that maximizes the score of chain seq2 Search box location in seq1 Range of search
CHAOS: Chaining the Seeds seq1 gap cutoff distance cutoff seed • Find seeds at current location in seq1 • Find the previous seeds that fall into the search box • Do a range query: seeds are indexed by their diagonal. • Pick a previous seed that maximizes the score of chain seq2 Search box location in seq1 Range of search Time O(n log n), where n is number of seeds.
CHAOS Scoring • Initial score = # matching bp - gaps • Rapid rescoring: extend all seeds to find optimal location for gaps
Overview • Local Alignment (CHAOS) • Multiple Global Alignment (LAGAN) - Whole Genome Alignment • Glocal Alignment (Shuffle-LAGAN) • Biological Story
z x y Global Alignment AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
LAGAN: 1. FIND Local Alignments • Find Local Alignments • Chain Local Alignments • Restricted DP
LAGAN: 2. CHAIN Local Alignments • Find Local Alignments • Chain Local Alignments • Restricted DP
LAGAN: 3. Restricted DP • Find Local Alignments • Chain Local Alignments • Restricted DP
MLAGAN: 1. Progressive Alignment Human Baboon Given N sequences, phylogenetic tree Align pairwise, in order of the tree (LAGAN) Mouse Rat
MLAGAN: 2. Multi-anchoring To anchor the (X/Y), and (Z) alignments: X Z Y Z X/Y Z
Cystic Fibrosis (CFTR), 12 species • Human sequence length: 1.8 Mb • Total genomic sequence: 13 Mb Chicken Zebrafish Cow Pig Chimp Human Dog Rat Fugufish Cat Baboon Mouse
CFTR (cont’d ) % Exons Aligned TIME (sec) MAX MEMORY (Mb) LAGAN Mammals 99.7% 550 90 Chicken & Fishes 96% 862 90 MLAGAN Mammals 99.8% 4547 670 Chicken & Fishes 98%
Automatic computational system for comparative analysis of pairs of genomes http://pipeline.lbl.gov Alignments (all pair combinations): Human Genome (Golden Path Assembly) Mouse assemblies: Arachne, Phusion (2001) MGSC v3 (2002) Rat assemblies: January 2003, February 2003 ---------------------------------------------------------- D. Melanogaster vs D. Pseudoobscura February 2003
Tandem Local/Global Approach • Finding a likely mapping for a contig (BLAT)
Progressive Alignment Scheme Human, Mouse and Rat genomes Pairwise M/R mapping no yes Aligned M&R fragments Unaligned M&R sequences Mapping aligned fragments by union of M&R local BLAT hits on the human genome Map to Human Genome yes no no yes M/H and R/H pairwise alignment Unassigned M&R DNA fragments H/M/R MLAGAN alignment M/R pairwise alignment
Computational Time 23 dual 2.2GHz Intel Xeon node PC cluster. Pair-wise rat/mouse – 4 hours Pair-wise rat/human and mouse/human – 2 hours Multiple human/mouse/rat – 9 hours Total wall time: ~ 15 hours
Overview • Local Alignment (CHAOS) • Multiple Global Alignment (LAGAN) - Whole Genome Alignment • Glocal Alignment (Shuffle-LAGAN) • Biological Story
Evolution at the DNA level Deletion Mutation …ACGGTGCAGTTACCA… SEQUENCE EDITS …AC----CAGTCCACCA… REARRANGEMENTS Inversion Translocation Duplication
Local & Global Alignment AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC Global Local AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC
Glocal Alignment Problem Find least cost transformation of one sequence into another using new operations AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA • Sequence edits • Inversions • Translocations • Duplications • Combinations of above AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC
Shuffle-LAGAN A glocal aligner for long DNA sequences
S-LAGAN: Find Local Alignments • Find Local Alignments • Build Rough Homology Map • Globally Align Consistent Parts
S-LAGAN: Build Homology Map • Find Local Alignments • Build Rough Homology Map • Globally Align Consistent Parts
Building the Homology Map b a d c Chain (using Eppstein Galil); each alignment gets a score which is MAX over 4 possible chains. Penalties are affine (event and distance components) • Penalties: • regular • translocation c) inversion d) inverted translocation
S-LAGAN: Build Homology Map • Find Local Alignments • Build Rough Homology Map • Globally Align Consistent Parts
S-LAGAN: Global Alignment • Find Local Alignments • Build Rough Homology Map • Globally Align Consistent Parts
S-LAGAN Results (CFTR) Local Glocal
S-LAGAN Results (CFTR) Hum/Mus Hum/Rat
S-LAGAN results (HOX) • 12 paralogous genes • Conserved order in mammals
S-LAGAN results (HOX) • 12 paralogous genes • Conserved order in mammals
S-LAGAN Results (Chr 20) • Human Chr 20 v. homologous Mouse Chr 2. • 270 Segments of conserved synteny • 70 Inversions
S-LAGAN Results (Whole Genome) • Used Berkeley Genome Pipeline • % Human genome aligned with mouse sequence • Evaluation criteria from Waterston, et al (Nature 2002)
Rearrangements in Human v. Mouse • Preliminary conclusions: • Rearrangements come in all sizes • Duplications worse conserved than other rearranged regions • Simple inversions tend to be most common and most conserved