The Good, the Bad, and the Ugly…

The Good, the Bad, and the Ugly… Laura Langton Lab meeting 10/18/06

The MGC Pipeline 1. NSCAN-EST (the good) 2. TARGET SELECTION (the ugly) 3. PRIMER DESIGN (the bad but getting better) 4. RT-PCR / SEQUENCING (fantastic) 5. ANALYSIS 6. SUBMIT SEQUENCES TO RTDB / GENBANK Return to number 1 and repeat…….. (or goto 10 for all the BASIC programmers in the audience)

Target Selection (“old”) GOAL –Categorize predictions as known, novel, or partially novel relative to MGC genes, find Full ORFs INPUTS -Refseq, EST, mRNA sequences (.fa) -Experiment hit files (.fa) -MGC, MGC A/B annotation files -Genomic sequence files (.fa) -Predictions (.gtf) OUTPUTS -Categorized predictions (too many to name) -introns truly verified

Target Selection (“new”) GOAL – Categorize predictions into known, novel, partial relative to “known transcribed regions”, and identify novel introns to test. INPUTS -EST, mRNA psl-orient files -MGC, MGC A/B genes (.gtf) -Experiment hit files (.fa and .gtf) -Sequence length file (txt) -Predictions (.gtf) OUTPUTS -categorized predictions (.gtf*, .tx, .ptx) -intron-verified.text* (introns within transcribed regions)

Basic Steps of Target Selection 1.Build covered regions of known transcribed genes 2.Categorize introns as verified or unverified. 3.Categorize predictions as known, partial, novel 4.Remove pseudogenes. 5.Get gtf, tx, ptx of target sets 6.Display tree of results

EVOLUTION OF TARGET SELECTION Originals -Blast! -Too many environmental variables -Complicated directory structure -Complicated to re-run -Leaky -”new” improved but didn’t give full ORFs New and Improved (a la Jeltje) -Faster, more streamlined and more accurate Cluster ESTs with PASA No Blast! -Get list of verified introns and Full ORFs?

PRIMER DESIGN GOAL – to design primers which will amplify regions of cDNA covering novel splice sites INPUTS -predictions (.gtf files) -intron-verified.txt (from new TS) -genome sequence -mispriming libraries (MGCabc, refseqs, mRNAs) -(RTDB) -gene Id list OUTPUTS -span.list (intermediate – input for primer3) -r1.out (predictions for which you successfully designed primers) -pp1.list (id, primer pair, introns covered)

PRIMER DESIGN BASIC STEPS • Select potential spans to make primers for (intron-verified text from new TS) • Check rtdb to check if have failed 2x, been verified, etc • Generate primers (1000 to 100 for each span) • Test for mispriming against other known or predicted genes • Choose (one set of) primers for each prediction • Create final list of primers in plate format

Evolution of Primer Design Original Primer Design -Many cut and paste shell commands. -Slow, generated too many primers (1000 per span) -One primer pair per prediction? -starts/stops, UTR, single exon require separate runs -Mystery criteria for choosing final primers

New and Improved Primer Design (a la Charlie) -More automated -More logical and informative output files to track progress -Faster (generates 100 per span) -starts stops, single exon, etc. in same run -Maximal coverage of each prediction (multiple splice sites per span, multiple spans amplified per experiment) -Intelligent scoring scheme for choosing final primers based on information gained and probability of success.

RT-PCR-SEQUENCING Generate cDNA from pooled RNA Amplify (see animation at http://www.sumanasinc.com/webcontent/anisamples/molecularbiology/pcr.html) Sequence

EVOLUTION OF RT-PCR Original -Farmed out -Methods not optimized New and improved -In-house! – established lab w/robotics -Implemented touchdown procedure, new enzyme, better success

ANALYSIS INPUTS -prediction files -primer plates (target, well, primer sequence) -traces FINAL OUTPUTS -hit lists (gene id, plate, well, F/R/phrap) confirmed unconfirmed control -.fa and .gtf of all hits (for addition to rtdb, use in next TS round)

ANALYSIS STEPS • Rename traces • Run phred on traces to call bases, quality values, trim • Assemble F+R reads with phrap • Align to genome with blat for top 10% alignments • Run est2genome to get best alignment for each seq • Choose high quality splice alignments that hit targets (confirmed hits) and controls • At least70% identity • At least 20 contiguous ATCG bases • At least 80% identity for 10 bp around splice sites • Get .fa and .gtf of all hits

Overall Results Novel regions identified -3101 exons (2265 introns) 535 non-overlapping clusters Overall experimental success rates -Partial – 28% (36% last round) -Novel – 7% (8.5% last round) Full ORFs -812 Full ORFs submitted 210 validated of 553 so far = 38%

In the End….. MGC – the bad -muddy, variable guidelines for “known transcribed regions” -changing personnel -changing methods -pain in the #@$#@ MGC – the good -generated salaries -built the lab -generated a pipeline -forced cleanup of pipeline and propelled it (haltingly) forward -base methods for future projects -publication ?

The Future Pipeline Utilize PASA More database centered? Alter database structure? Flexibility Different projects = different goals (i.e. alternate splices?) Not always human! Keep the user in mind! Document for the simple-minded!

The Good, the Bad, and the Ugly…