190 likes | 311 Vues
Course Web Site. http://www.capsl.udel.edu/courses/eleg667/2000/. Slides Handouts Homeworks Reading Assignments Student Roster. Prerequisites. Basic knowledge of algorithms, statistics and molecular biology. However... The first three lectures will summarize the background required.
E N D
Course Web Site http://www.capsl.udel.edu/courses/eleg667/2000/ Slides Handouts Homeworks Reading Assignments Student Roster
Prerequisites Basic knowledge of algorithms, statistics and molecular biology. However... The first three lectures will summarize the background required.
Introduction: High-Perfromance Computing and Discovery Informatics
Small Fraction of Yeast Genome CCACACCACACCCACACACCCACACACCACACCACACACCACACCACACCCACACACACACATCCTAACACTACCCTAACACAGCCCTAATCTAACCCTGGCCAACCTGTCTCTCAACTTACCCTCCATTACCCTGCCTCCACTCGTTACCCTGTCCCATTCAACCATACCACTCCGAACCACCATCCATCCCTCTACTTACTACCACTCACCCACCGTTACCCTCCAATTACCCATATCCAACCCACTGCCACTTACCCTACCATTACCCTACCATCCACCATGACCTACTCACCATACTGTTCTTCTACCCACCATATTGAAACGCTAACAAATGATCGTAAATAACACACACGTGCTTACCCTACCACTTTATACCACCACCACATGCCATACTCACCCTCACTTGTATACTGATTTTACGTACGCACACGGATGCTACAGTATATACCATCTCAAACTTACCCTACTCTCAGATTCCACTTCACTCCATGGCCCATCTCTCACTGAATCAGTACCAAATGCACTCACATCATTATGCACGGCACTTGCCTCAGCGGTCTATACCCTGTGCCATTTACCCATAACGCCCATCATTATCCACATTTTGATATCTATATCTCATTCGGCGGTCCCAAATATTGTATAACTGCCCTTAATACATACGTTATACCACTTTTGCACCATATACTTACCACTCCATTTATATACACTTATGTCAATATTACAGAAAAATCCCCACAAAAATCACCTAAACATAAAAATATTCTACTTTTCAACAATAATACATAAACATATTGGCTTGTGGTAGCAACACTATCATGGTATCACTAACGTAAAAGTTCCTCAATATTGCAATTTGCTTGAACGGATGCTATTTCAGAATATTTCGTACTTACACAGGCCATACATTAGAATAATATGTCACATCACTGTCGTAACACTCTTTATTCACCGAGCAATAATACGGTAGTGGCTCAAACTCATGCGGGTGCTATGATACAATTATATCTTATTTCCATTCCCATATGCTAACCGCAATATCCTAAAAGCATAACTGATGCATCTTTAATCTTGTATGTGACACTACTCATACGAAGGGACTATATCTAGTCAAGACGATACTGTGATAGGTACGTTATTTAATAGGATCTATAACGAAATGTCAAATAATTTTACGGTAATATAACTTATCAGCGGCGTATACTAAAACGGACGTTACGATATTGTCTCACTTCATCTTACCACCCTCTATCTTATTGCTGATAGAACACTAACCCCTCAGCTTTATTTCTAGTTACAGTTACACAAAAAACTATGCCAACCCAGAAATCTTGATATTTTACGTGTCAAAAAATGAGGGTCTCTAAATGAGAGTTTGGTACCATGACTTGTAACTCGCACTGCCCTGATCTGCAATCTTGTTCTTAGAAGTGACGCATATTCTATACGGCCCGACGCGACGCGCCAAAAAATGAAAAACGAAGCAGCGACTCATTTTTATTTAAGGACAAAGGTTGCGAAGCCGCACATTTCCAATTTCATTGTTGTTTATTGGACATACACTGTTAGCTTTATTACCGTCCACGTTTTTTCTACAATAGTGTAGAAGTTTCTTTCTTATGTTCATCGTATTCATAAAATGCTTCACGAACACCGTCATTGATCAAATAGGTCTATAATATTAATATACATTTATATAATCTACGGTATTTATATCATCAAAAAAAAGTAGTTTTTTTATTTTATTTTGTTCGTTAATTTTCAATTTCTATGGAAACCCGTTCGTAAAATTGGCGTTTGTCTCTAGTTTGCGATAGTGTAGATACCGTCCTTGGATAGAGCACTGGAGATGGCTGGCTTTAATCTGCTGGAGTACCATGGAACACCGGTGATCATTCTGGTCACTTGGTCTGGAGCAATACCGGTCAACATGGTGGTGAAGTCACCGTAGTTGAAAACGGCTTCAGCAACTTCGACTGGGTAGGTTTCAGTTGGGTGGGCGGCTTGGAACATGTAGTATTGGGCTAAGTGAGCTCTGATATCAGAGACGTAGACACCCAATTCCACCAAGTTGACTCTTTCGTCAGATTGAGCTAGAGTGGTGGTTGCAGAAGCAGTAGCAGCGATGGCAGCGACACCAGCGGCGATTGAAGTTAATTTGACCATTGTATTTGTTTTGTTTGTTAGTGCTGATATAAGCTTAACAGGAAAGGAAAGAATAAAGACATATTCTCAAAGGCATATAGTTGAAGCAGCTCTATTTATACCCATTCCCTCATGGGTTGTTGCTATTTAAACGATCGCTGACTGGCACCAGTTCCTCATCAAATATTCTCTATATCTCATCTTTCACACAATCTCATTATCTCTATGGAGATGCTCTTGTTTCTGAACGAATCATAAATCTTTCATAGGTTTCGTATGTGGAGTACTGTTTTATGGCGCTTATGTGTATTCGTATGCGCAGAATGTGGGAATGCCAATTATAGGGGTGCCGAGGTGCCTTATAAAACCCTTTTCTGTGCCTGTGACATTTCCTTTTTCGGTCAAAAAGAATATCCGAATTTTAGATTTGGACCCTCGTACAGAAGCTTATTGTCTAAGCCTGAATTCAGTCTGCTTTAAACGGCTTCCGCGGAGGAAATATTTCCATCTCTTGAATTCGTACAACATTAAACGTGTGTTGGGAGTCGTATACTGTTAGGGTCTGTAAACTTGTGAACTCTCGGCAAATGCCTTGGTGCAATTACGTAATTTTAGCCGCTGAGAAGCGGATGGTAATGAGACAAGTTGATATCAAACAGATACATATTTAAAAGAGGGTACCGCTAATTTAGCAGGGCAGTATTATTGTAGTTTGATATGTACGGCTAACTGAACCTAAGTAGGGATATGAGAGTAAGAACGTTCGGCTACTCTTCTTTCTAAGTGGGATTTTTCTTAATCCTTGGATTCTTAAAAGGTTATTAAAGTTCCGCACAAAGAACGCTTGGAAATCGCATTCATCAAAGAACAACTCTTCGTTTTCCAAACAATCTTCCCGAAAAAGTAGCCGTTCATTTCCCTTCCGATTTCATTCCTAGACTGCCAAATTTTTCTTGCTCATTTATAATGATTGATAAGAATTGTATTTGTGTCCCATTCTCGTAGATAAAATTCTTGGATGTTAAAAAATTATTATTTTCTTCATAAAGAAGCTTTCAAGATATAAGATACGAAATAGGGGTTGATAATTGCATGACAGTAGCTTTAGATCAAAAAGGAAAGCATGGAGGGAAACAGTAAACAGTGAAAATTCTCTTGAGAACCAAAGTAAACCTTCATTGAAGAGCTTCCTTAAAAAATTTAGAATCTCCCATGTCAACGGGTTTCCATACCTCCCCAGCATCATACATCTTTTTTCAAAGAAACTTCAAATGCCTCTTTTATGCAAGGGGCAAAATCCTGAAATGACTTAAACTTAGCAGTTTCGTCTTTTTTCAAAGAGAATGGTTGAAGAAGAATTGTTTTGGACGCTTATTGACAATCTGTTGCATTGATAAAGTACCTACTATCCCAGACTATATTTGTATACAAGTACAAAATTAGGTTTGTTGAAACAACTTTCCGATCATTGGTGCCCGTATCTGATGTTTTTTTAGTAATTTCTTTGTAAATACAGGGAGTTGTTTCGAAAGCTTATGAGAAAAATACATGAATGACAGGTAAAAATATTGGCTCGAAAAAGAGGACAAAAAGAGAAATCATAAATGAGTAAACCCACTTGCTGGACATTATCCAGTAAAGGCTTGGTAGTAACCATAATATTACCCAGGTACGAAACGCTAAGAACCTTGAAAGACTCATAAAACTTCCAGGTTAAGCTATTTTTGAAAATATTCTGAGGTAAAAGCCATTAAGGTCCAGATAACCAAGGGACAATAAACCTATGCTTTTCTTGTCTTCAATTTCAGTATCTTTCCATTTTGATAATGAGCATGTGATCCGGAAAGCTACTTTATGATGTTTCAAGGCCTGAAGTTTGAATATTTATGTAGTTCAACATCAAATGTGTCTATTTTGTGATGAGGCAACCGTCGACAACCTTATTATCGAAAAAGAACAACAAGTTCACATGCTTGTTACTCTCTATAACTAGAGAGTACTTTTTTTGGAAGCAAGTAAGAATAAGTCAATTTCTACTTACCTCATTAGGGAAAAATTTAATAGCAGTTGTTATAACGACAAATACAGGCCCTAAAAAATTCACTGTATTCAATGGTCTACGAATCGTCAATCGCTTGCGGTTATGGCACGAAGAACAATGCAATAGCTCTTACAAGCCACTACATGACAAGCAACTCATAATTTAAGTGGATAGCTTGTGATAAATTGAATTTTCTCTGTTTAGTACTTGCCGAATAGTTACTTGTTAGTTGCAGATGCTTTTTGATGACAAAGTTATCAATCTCAATATTAAACTTTTTAGGCTTTCAGGTTTAATCTTTCTTTGAGGGTGTATTAATTTTCATACAAATATTTGATTCATTATTCGTTTTACTGTTACATTAGACCTGCTCATTACATGGAGTAACTTAAGTTTTCTCAAACGCTTGATAGCATGATTTGATGTAGTAAAAAAAAAGGCAGAGTTTCCAAAAAAAATTGTTAATCGACAAAGTTAATATTATGGTGGTAGTATCTCAAATATCTGGATAACCAGATCGTACATCTCTGATAAACAATCTTTGCCACTGCTTTATCCTTTTAAATTGTATTGAGTGCTTCAGTCATTGCAAAATTTTACGAGATTTAAAATTTGTGAACCCGACCTTACCGAGAAATGATGAGCTAATTTTTATAGGTCGACCCTTCTGTCGCTTACTGGGTTGATTATCTTGTGCTTTCTTAGTATCTATCACAAAGGAGACAAAATCGTTGATAAAAAGTGCATCAACATTCCCAGCCAGAAAATGCACATCATAAAGACATGTTATTCAAGAGCCACGACCGTCTTCAATTTATCTTTTATAAAAAACCCTTGTTCTACTGACAGGATGGAATAGATATTAAATATACATTTTGCATTTTTTTTTTTTTCTGTATTGAAGATTTGTATATGAAAGATGTTTATACATCAAATGCTTTGAATAAAGCCATCTTAATTTCAATTTCATGCCCTCCTTCACCGTTTTCTGTTGGTCTAGAGGTAGCTTGTTGTGGTCACTAATGAGAACTTAAATAGTTTTCAACTGCTGGTGGTAAATCAATAATTTATGTTCTTAACCTAACATTTGATGACCTTTGATGCGTTGGTTATGTTGAAGACAAATTGCCTCTAATCAGTTCC
Solutions Protein/Gene Sequences With Unknown Functions Protein/Gene Sequences With known Functions ? Alignment Predictive Modeling Functions
Solutions Methods Protein/Gene sequences with unknown functions Protein/Gene sequences with known functions/struc • First Principles • Physics • Chemistry ? • Alignment • Sequence similarity • Structure homology • Predictive Modeling • Biological pathway • Structure prediction and fold • Virtual cell, etc. Functions
Sample Bioinformatics Projects • SSR detection and analysis (A. Castelo) • SNP detection pipeline (F. Useche) • Pattvision (Praveen) • ATGC -- whole genome comparison (J. Cuvillo) • Ph.D -- Protein Homology Discovery (M. Mostagir) • Benchmrking Sequence Alignment (R. Kahsay) • Simulation of metablic process (R. Khan) • CORPUS (L. Lacoste) • DARWIN (F. Peixoto)
SSRs in the Arabidopsis Genome Parsing the Arabidopsis Genome in seach of SSRs. In this work we list all the occurrences of SSRs, also known as microsatellites, in the Arabidopsis genome. Analising the resulting data we were able to produce a comprehensive study of these structures in the Arabidopsis, including the distribution of the SSRs along the chromosomes, the distribution in specific regions, such as in the centromeres, exons and introns, and an analisys of the motif distribution, according to motif length.
aSNP Automated SNP Discovery Pipeline Overview User Interface SNP DataBase EST DataBase SNP Detection Assembling Tool SNP Info SNP=Single Nucleotide Polymorphism - New Generation Genetic Marker Pipeline handles ~400.000 ESTs! EST sequence = AACCGCTTCTAGCAGG...
PATTVision - A Visual Data Mining Tool Selected Sequences / Family of Sequences TEIRESIAS Tuppleware MDVLSPGQGNNTTSPPAPFETGGNTTGISDVTVSYQVITSLL L. P. . Q. NN L..P. . Q FET. . NT FET. . . .T Add Sequence Homologs to initial Sequences Sequence Database (Protein/DNA) Query Sequence Database ‘META PATTERN’ Visual Analysis of Motif/Pattern distribution
ATGC: Another Tool for Genome Comparison G A T A T G C T A T G C C G T A P1 P2 P3 P4 P1 P2 P3 P4 High score alignment coordinates (x1,y1) - (x2,y2) related? • Parallel computation of the similarity matrix. • Best alignments located while matrix is computed. • Final alignments reported by Blast or Swat.
Benchmark Server Fasta proteins database CE alignments database Fasta proteins database Web Interactive Benchmarking Program H i g h S p e e d N e t w o r k User 1 User 2 User 3 User 4 Developing Web Interactive Benchmarking System for Sequence Alignment Program Sequence alignment Structural alignment ASVIE-AAVI VIVI-EPAAG A-SVIE-AAV- VIVI-EPAAG Remote users • Download fasta sequences • Produce set of sequence alignment • Submit the resulted alignments • Benchmarking program evaluates parameters
ProteinHomologyDiscovery Mohamed Mostagir, Salim Khan, and Ruomnig Jin Mixed bag of proteins • Goals and motivation for the PHD • Building a library of protein families • Useful for functional and structural prediction • Methodology applied to Bovine database PHD Genes Database Proteins Database Clustering Open reading frame finder BLAST Protein Homology Database Inner components of PHD - Cylinders are used for databases (Raw information) - Blocks are used for operations performed on this information Protein Homologies
3D Simulation of Biological Systems on Parallel Machines Experiment/Model Cycle Model Testing 3D Partial Differential Equations Hypothesis • Simulation • 2D • 3D • Control Systems • Analysis • Perturbation Analysis • MCA • Stability Analysis • Etc. Experiments Model Testing N is the stoichiometric matrix v is the reaction velocity vector [i] is the concentration of molecule i Di is the diffusion function for molecule i ParameterOptimizations UDel Model TJU Spatial Division Approach Matrix Division Approach • For each iteration: • Reaction Step: • Calculate [N][v] • Diffusion Step: • Calculate Di*2[i] • Exchange data between nearest neighbors Represent the work as a huge matrix multiply problem. For each iteration, multiply the matrix and the vector to find the concentrations. Then use the concentrations to recalculate the vector and repeat. Concentration Vector = * Banded Sparse Matrix 1 computer Each computer calculates the multiplication of a small piece of the matrix and vector. Then they pass data between each other. Each computer solved many cubes in 1 iteration. Then they communicate results between computers. Constant integer matrix Velocity Vector 64 computers
The Past: from Sequence to Genomes Computational Genomics: Past, Present and Future The Future: from systems to functions Gene Finding Sequence Diagnostics Database Search Sequence Clustering Functional Annotations Association of function rules Querying Biological databases Classification Of Biological systems Whole Genome alignment Functional Coupling Of Gene clusters Fusion Analysis Genome subtraction Structural Genomics Other developments The Present: from genomes to systems
SIMPLIFIED SYSTEM ARCHITECTURE (metabolites neglected) ??? Rishi/Dan/Jeremy T* Transcript Profile M* Microarray Boris/Jan Boris/Jan Rishi/Dan/Jeremy S** Signaling Pathway Nucleus Jeremy Boris/Jan Receptor R** Cytosol ??? ??? Channels C** Extracellular I* Input ??? Bob *= ALL VARIABLE MEASURED **=SOME VARIBLES MEASURED Ep*
Data collection & Organization Information Mining & Discovery Modeling Database Simulation/ Emulator Result presentation/visualization High-Performance Computing and Bioinformatics Computational Biology Bioinformatics Applications Core Technology Discovery Informatics • EARTH Technology • Programming tools • Runtime system • Hardware • Compiler • Technology • Retargeting • Tools
BIO Transcript expression Profiles across perturbation series Arterial Baroreceptors Cardiac Ganglion Heart and Peripheral Vasculature mRNA in identified cell type (a) Intercellular Level Principal Neuron and other homeostatic control neurons Inter NA/DMN NTS Cardiorespiratory control circuit (b) Signaling Pathway Ca2+ messenger Emulation p3 Threaded programs p1 p2 p4 receptor K+ BIOMEX Platform Modeled by ODE dr dt dp dt (c) Gene Network Threaded Algorithm for ODE/PDE implementation BIOMEX Engine Protein DNA mRNA = f(p) - Vr High-speed net Protein DNA mRNA = Lr - Up Bioinformatics Server Visualization Server Model Future: BioMEX Model and Testbed