1 / 14

A Study of GeneWise with the Drosophila Adh Region

A Study of GeneWise with the Drosophila Adh Region. Asta Gindulyte CMSC 838 Presentation Authors: Yi Mo, Moira Regelson, and Mike Sievers Paracel Inc., Pasadena, CA. Motivation. Genome annotation Extraction of biologically relevant knowledge from raw genomic sequence data

Télécharger la présentation

A Study of GeneWise with the Drosophila Adh Region

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Study of GeneWise with the Drosophila Adh Region Asta Gindulyte CMSC 838 Presentation Authors: Yi Mo, Moira Regelson, and Mike Sievers Paracel Inc., Pasadena, CA

  2. Motivation • Genome annotation • Extraction of biologically relevant knowledge from raw genomic sequence data • Need faster genome annotation methods • DNA sequences are very long (millions of nucleotides) • Current methods are computationally too expensive • Approach/Solution • GeneMatcher2 hardware acceleration of GeneWise CMSC 838T – Presentation

  3. Outline • Motivation • Genome annotation • GeneMatcher2 • Design • ASIC hardware • Comparison • GeneWise algorithm • HalfWise algorithm • Performance (time, precision) • Observations • Performance improvement • Cost effectiveness CMSC 838T – Presentation

  4. Approach • Problem: make GeneWise run faster • “Embarassingly parallel” algorithm • Computationally too expensive when run in parallel on PC’s • Paracell’s solution: hardware acceleration • Don’t change the algorithm • Produce an implementation on the GeneMatcher2 supercomputer that works as much like the original software as possible • 6LITE algorithm, now also in Wise2 CMSC 838T – Presentation

  5. GeneMatcher Architecture CMSC 838T – Presentation

  6. ASIC Hardware • ASIC – application specific integration circuit • Designed to speed up dynamic programming algorithms • (could be used for Smith-Waterman) • Each ASIC board has 3072 processors • System has up to 9 boards • Cost per board around $40K CMSC 838T – Presentation

  7. GeneWise Algorithm • Perform a search of genomic DNA sequence data using a protein HMM • Build HMMs from protein families • Scan genome using HMM • Look for start codon • “GT” sequence signals possible 5’ splice site • “AG” sequence signals possible 3’ splice site • Dynamic programming used in the scanning process • Obtain probability of the most likely path in HMM generating the sequence • Obtain alignment by backtracking CMSC 838T – Presentation

  8. GeneWise model on GeneMatcher2 CMSC 838T – Presentation

  9. HalfWise Algorithm • Reduce cost by running BLAST to select HMMs with possible hits • Use these HMMs with GeneWise database search and sequence alignment algorithm • May miss some genes due to BLAST misses CMSC 838T – Presentation

  10. Evaluation • Test data set • A genomic DNA sequence contig of about 2.9 Mb from the Drosophila Adh region • Focuss on finding all Pfam (Protein families database of alignments and HMMs) protein profile-HMMs that occur in the Adh genomic sequence CMSC 838T – Presentation

  11. Evaluation: Speed CMSC 838T – Presentation

  12. Evaluation: Score CMSC 838T – Presentation

  13. Evaluation: Sensitivity and Specificity CMSC 838T – Presentation

  14. Observations • Performance improvement • The speedup is several orders of magnitude. • Makes real target applications possible • Accuracy might be improved over HalfWise algorithm • Cost effectiveness • System used costs around $500K • 500K worth Linux PC’s (500 processors at $1K each) would run about 10 times slower • Weaknesses • Cannot modify the algorithm • Not enough data to assess scalability CMSC 838T – Presentation

More Related