360 likes | 373 Vues
Explore the process of unsplicing mRNAs/ESTs and the limitations of using BLAST. Discover new generation tools like est2genome, est_genome, and Sim4. Introducing BLAT, a BLAST-like Alignment Tool that offers superior speed and sensitivity. Learn about the GUS plugin LoadBLATAlignments and the BLATAlignmentQuality filter.
E N D
Overview • BLAT consensus sequences vs genomic • Load alignments w/ 10% cutoff into GUS • Compute alignment “quality”: • 1 = Very good • 2 = Very good with gaps • 3 = Good • 4 = Not so good • Merge selected alignments into “genes”
“Unsplicing” mRNA onto DNA 5’ UTR CDS 3’ UTR mRNA (e.g. DoTS) DNA
“Unsplicing” mRNA onto DNA 5’ UTR CDS 3’ UTR mRNA (e.g. DoTS) DNA exon 1 exon 2 exon 3 *** DRAMATIZATION ***
Tools for unsplicing mRNAs/ESTs • Why not use BLAST? • i.e., We have a hammer, so does the problem look anything like a nail? • We can, but there are problems: • Exon boundaries will not be exact • 1 in 4 match by chance alone • Small exons may be missed • Issues in correctly resolving pseudogenes, gene families, particularly if clustered
Unsplicing tools: beyond BLAST • New generation of special-purpose tools • est2genome (Birney, Durbin) • est_genome (Mott) • Sim4 (Florea et al.) • Their limitations are BLAST’s strengths: • Require single small target sequence • Find only one match (not so bad if correct) • Run slower than molasses (even sim4)
BLAST-sim4: best of both worlds? • Our first solution to the problem: • Use BLAST to localize mRNA • Follow with sim4 on restricted region • Still very slow on whole genomes • Other approaches: • Spidey (Wheelan, Church, Ostell) • BLAT (Kent)
Introducing BLAT • “BLAST-like Alignment Tool” • Jim Kent (also GigAssembler author) • Two major innovations: • Indexed the target db, not the query seq • Incorporated postprocessing steps to: • Take splicing signals into account • Find small exons that it might otherwise miss • Also performed extensive testing
BLAT Overview • The end result: • Orders of magnitude speed-up (500x) • No loss of sensitivity • Better gene “models” generated • Other variants of BLAT: • Client-server version (e.g., UCSC site) • Protein/translated DNA mode
Introducing GUS plugin LoadBLATAlignments • Process raw BLAT output • Perl modules BLAT::Alignment, BLAT::PSL • Load alignments into GUS • BLATAlignment table (not Similarity) • 10% minimum length cutoff applied • Compute and store summary info. • Alignment quality (requires target seq.) • Poly(A) detection (requires query seq.) • max_query_gap, unaligned_3p_bases, etc.
BLATAlignmentQuality • Very good (formerly “consistent”) • >= 95% identity (average) • max_query_gap <= 5 • both ends consistent • no more than 10bp mismatch unless polyA • not polyA on both ends
BLATAlignmentQuality II • Very good with gaps • same as very good but internal and end mismatches allowed if there is a sufficiently large genomic sequence gap (within 10X mismatch length for ends) • Good • same as very good, but with max_query_gap <= 15 (allow large internal gaps if there is a sufficiently large genomic sequence gap), and inconsistent ends allowed if unaligned_bases <= 50 • Not so good • everything else
Why “very good with gaps” and how we arrived at it? • Align Refseqs to hChr22 and mChr5 • Compare consistent (very good) alignments to annotations at UCSC • False positives: close to 0 • False negatives: ~18% and ~35% • With new quality filter, false negatives reduced to ~15% and ~13%
Why “good” and how we arrived at it? • If a Refseq has very good (consistent) alignment, would the Refseq-containing assembly too? • hChr22: 98/255 (38%) did not • mChr5: 109/271 (40%) did not • Mostly due to minor problems at end(s) • With new filter, false negatives reduced to 25/255 (~10%) and 33/271 (~12%)
Some alignment statistics • hDoTS (08/02) vs hGenome (GP 06/02) • Total DoTS sequences: 859,545 • Alignments loaded: 5,544,300 / 8,975,529
Some alignment statistics II • mDoTS (07/02) vs mGenome (GP 02/02) • Total DoTS sequences: 579,906 • Alignments loaded: 3,208,572/4,663,903
“Gene” creation algorithm • Select BLAT alignments • Parameters: min quality, genomic region • Merge overlapping alignments • Merge nearby alignments with at least one EST sequence in each assembly from common clone • Parameter: max distance (default 20kb) • Merge nearby alignments • Parameter: max distance (default off)
“Gene” creation algorithm II • [ygan@zeus ~/dots2gene]$ ./dotsAlignment2Gene.pl -h • Usage: dotsAlignment2Gene.pl --sp sp --chr chr --start start --end end • --qf qf --xs xs --am am --cm cm --lm lm --mis mis • --of out [--test] [--debug debug] [--help], where... • sp: scientific/common name of species, e.g. human, Mus musculus • chr: chromosome of interest, e.g. 5, 22, X, 3_random • start: start genomic position, default to 1 • end: end genomic position, default to chromosome length • qf: quality filter to select blat alignments for gene creation • xs: exclude blat alignments of singletons • am: merge by genomic alignment overlap • cm: merge by shared clone info between gene seeds within specified distance • lm: merge by alignment proximity (within specified distance) • mis: min intron size for a gene to be kept • out: output format, one of s[ummary], v[erbose] or gff • test: test case using DeiGeorge region • debug: specify level of debug output, can be 1, 2, 3, and 4+
Initial Algorithm Calibration • Human chr22q (~34Mb) as test case • Sanger annotation release 2.3: 832 genes (341 gene, 118 gene_segment, 112 related, 109 predicted, 152 pseudogenes) • Focus on DiGeorge Region • DGCR6 to ZNF74 (~ 1.6Mb) • Contains 24-33 genes based on literature (Sanger: 44 genes with 33 known) *Used DoTS 02/02 release vs Golden Path 12/01 release, and old BlatAlignment table (limited quality classes).
Choosing initial parameters # DiGeorge Chromosome Region (DGCR6 - ZNF74, 1.6Mb) # CBIL Gene Param Num CBIL Num Sanger Num Overlap Avg %overlap qf=4, am, cm=10k 27/50 26/44 28 88.7 vs 71.3 qf=4, am, cm=20k 24/47 26/44 27 81.4 vs 75.5 qf=4, am, cm=50k 20/39 25/44 26 63.8 vs 77.6 qf=6, am, cm=10k 26/69 29/44 30 77.7 vs 75.9 qf=6, am, cm=20k 25/66 28/44 31 69.8 vs 80.5 qf=6, am, cm=50k 17/54 24/44 25 53.0 vs 87.4 # Chr22 (Chr22q ~34M) # CBIL Gene Param Num CBIL Num Sanger Num Overlap Avg %overlap qf=4, am, cm=20k 335/737 352/829 383 70.7 vs 72.4 qf=6, am, cm=20k 327/1074 377/829 399 64.9 vs 81.0 * Sanger annotation in different coordinate system, did approximate translation
Initial parameters • Derived from old alignments • qf=4: “spliced” and “consistent” alignments • xs=off: not exclude alignments by singleton assemblies • am=on: merge by alignment overlap • cm=20K: merge by shared clone • lm=off: merge by genomic location proximity • mis=15: filter putative genes by max “intron” size • Adjusted for new alignment quality categorization • qf=7: “spliced” alignment of quality id 1, 2 or 3
Preliminary results • Applied algorithm to new alignments of hChr22 and mChr5 [4-6 hours each] • Displayed as custom tracks at UCSC genome browser • DiGeorge region CBIL and Sanger genes • Human chr22 CBIL and Sanger genes • Mouse chr5 CBIL genes
Preliminary full genome runs • Excluded singletons • Tried lm parameter at off, 30, 50 • ~24 hours per run, with some stress on database server • Per chromosome statistics for mouse • See next slide
Directions • Assessment of results in selected 14Mb on mouse chr5 (Maja Bucan lab)
Directions II • Quantitative evaluation of results • Correlation coefficients – next slide (Science Kapranov et al. 296 (5569): 916.)
Directions III • Fine tune/revise algorithm parameters • qf: recruit more alignments • cm: widen from 20K to 500K? • lm: turn on (lm<75bp or 0<lm<75bp)? • mis: intron size distribution model? • Evidence suggesting new parameters • See next slide • Conservatively assume UCSC ref genes uniformly samples 1/3 of all genes on chr22
Directions IV • Problem fixes • Handle assemblies on the wrong strand - see next slide • Error rate estimations • Simulate effects of sequence/assembly errors on BLAT