150 likes | 266 Vues
This study explores the improvements in Open Reading Frame (ORF) prediction accuracy for E. coli through sophisticated software models that utilize a genetic algorithm approach. After analyzing sequencing errors and missed bases, the predictor's performance was benchmarked against verified ORFs, achieving an impressive accuracy of 94%. Key metrics included the assessment of start and stop codons using a refined Shine-Dalgarno scoring system. The advancements demonstrate significant enhancements over existing programs like Glimmer, establishing a robust framework for genomic analysis in E. coli.
E N D
The Motivation • A certain genome that shall rename anonymous was sequenced • One base was missed True Orf Sequencing error One base dropped
What happened then? True Orf True start True stop Sequencing error One base dropped Out-of-frame stop False orf False orf Out-of-frame start Software to predict orfs uses single-frame start-stop analysis
Predict coding region start sites Test bed: e. coli forward strand Well-studied Extensive set of “verified orfs” Save reverse strand for testing How well is a predictor doing? For each verified stop Compare predicted start to verified start Glimmer was hitting about 87% In e. coli forward strand
Early programs Long-lost, but here’s a middle-era effort: (define startscore (lambda (thisStart previousStop) (- (+ [Shine Dalgarno score] [3 points for ATG, 2 for GTG 1 for TTG] (+ (quotient (- thisStart previousStop) 100) (if (gcrich (string->list (spacer-region thisStart))) 2 0))))) Hmm
Later programs More sophisticated Shine Dalgarno computations, score stored in parameter s together with start location And more: (define startscore (lambda (s n) (let ((sr (string->list (spacer-region (car s))))) (- (+ (cadr s) (if (string=? (substring str (car s) (+ (car s) 3)) "ATG") 5 ; more pref for ATG (if (string=? (substring str (car s) (+ (car s) 3)) "GTG") 2 1))) (+ (* 0.1 (abs (- (length sr) 8))) ; 8 is average spacer length in e.coli verified (/ (- (car s) n) 60.0) ; 1 per 60 wasted orf (* 3 (/ (gcmajority sr) ; exaggerate the gc thing less (length sr))) (if (string=? (substring str (+ (car s) 4) (+ (car s) 6)) "TG") 3 0) ; punish .TG in second codon (if (hasstart sr) (if (lastmeth sr) -1 2) 0)))))) ; reward ATG in -1th codon ; punish starts elsewhere in spacer
Rough Translation (- (+ (* scaling-factor [Shine Dalgarno energy score]) 5 for ATG, 2 for GTG or 1 for TTG) (+ (* 0.1 divergence of spacer length from norm) 1 point per 60 wasted bp before start a score for gc richness in spacer region a score for XTG in second codon a score for having another start in the spacer region ))
So many numbers! • Just plucked from the air • Nevertheless • We’re already outpacing Glimmer on e. coli forward strand • How to fine-tune the numbers
Genetic Algorithm • (define POPULATION 40) ; size of initial population (and subsequent -- this is constant) • (define MUTPROB 10) ; There is a 1 in MUTPROB chance of mutation each generation • (define EXPDATASIZE 50) ; you would like each generation to work on a data set (taken from ocs) of about EXPDATASIZE • (define DATASETSIZECONTROL (round (/ 42000 EXPDATASIZE))) ; used by makedataset to aim for about EXPDATASIZE orfs • (define MINSDSIZE 3) ; the shortest Shine Dalgarno we are willing to contemplate • (define START-CODONS '("ATG" "GTG" "TTG")) • (define STOP-CODONS '("TAG" "TAA" "TGA")) • (define SD-TARGET (string->list "ATTCCTCC")) • ; other global constants modified in the dna • (define BIGNEG -10.0) • (define SPACER-MIN 4) • (define SPACER-MAX 18) • (define SD-REGION-LEN (+ SPACER-MAX (length SD-TARGET))) • (define HALFWINDOWSIZE 4000)
Those numbers placed in a “genome” • (define defaultdna (list SPACER-MIN SPACER-MAX HALFWINDOWSIZE BIGNEG 1.0 5.0 2.0 1.0 -0.1 8.0 60.0 -3.0 -.7 1.0 37.5 14.75)) • And that “genome” can…
Mutate! • (define mutate (lambda (dna) (let ((loc (random (length dna)))) (setnth dna loc (mutilate (nth dna loc)))))) • And …
Breed! (define cross(lambda (l m) (let ((a (random (length l))) (b (random (length l)))) (let ((aa (min a b)) (bb (max a b)))(list (append ((take aa) l) ((take (- bb aa))(nthcdr aa m)) (nthcdr bb l)) (append ((take aa) m) ((take (- bb aa))(nthcdr aa l)) (nthcdr bb m)))))))
Over many Generations (define generation(lambda (pop) (let ( (newpop (sort (lambda (x y) (> (car x) (car y))) (map (lambda (x) (cons (fitness x) x)) pop)))) (print-out-information) (makenewdataset) (generation (map cdr newpop)))))
Results • Kept getting better and better until • 94% predictions correct • 98% predictions had verified start as runner-up
Run on reverse strand • Similar percentages!!