Bioinformatics

Bioinformatics Cindy Burklow, Kyle Eli, Clay Harris

What is Bioinformatics? • “Any use of computers to handle biological information.” • Or, more specifically: • “The use of computers to characterize the molecular components of living things.”

What is Bioinformatics? • Biomolecules • “Doing Bioinformatics” • And simulate! • Classical bioinformatics deals primarily with sequence analysis • Polymers • Monomers • Macromolecules • Sequences

What is Bioinformatics? • “Post-genomic” era • Comparative genomics • New technologies to measure gene expression • Large-scale methods for identifying gene function • A shift to finding gene products • Proteomics • Structural Genomics

Bioinformatic Fields Biophysics Cheminformatics Computational Biology Genomics Mathematical Biology Medical informatics/Medinformatics Pharmacogenomics Pharmacogenetics Proteomics

BLAST • Basic Local Alignment Search Tool (BLAST) • Collection of Software Program Tools • Software version 2.1.13 offered by National Center for Biotechnology Information at the National Institutes of Health (NCBI) • Compares nucleotide or protein sequences to sequence databases • Finds regions of local similarity between sequences • Calculates the statistical significance of matches • Helps infer functional relationships between sequences and identify members of gene families

BLAST • Offers different program tools & databases • Provides Guide to help users decide on which BLAST tool to used based on Nature & size of the input query Primary goal of the search • BLAST search comprises four components:QueryDatabaseProgramSearch purpose/goal

BLAST

Ways to interface with BLAST • Uses Standardized application program interface (API) for accessing the NCBI QBIAst system • Uses direct HTTP-encoded requests to NCBI web server • Blast utilities allow you to run searches on your own computer • NetBlast has command-line network clients that allow you to submit searches to NCBI

A Case Study of High-Throughput Biological Data Processing on Parallel Platforms San Diego Supercomputer Center and Department of Pharmacology, University of California

History • Work has been done in this area for over the past 20 years developing structure comparison algorithms for proteins structures • Traditionally uses conventional functionally-driven structure determination • Algorithm Classifications to build alignments:Single ResiduesFragments of multiple residuesSecondary Structure Elements • CHALLENGE: Highly redundant datasets requiring very large computations to be performed to gain insight into the meaning of the data

Protein Structures Used for protein classifications, better understanding of function and clear explanation of distant homologous relationships not possible from sequence alone since sequence is more variable than structure • What is important about Protein Structures? • Comparing a single data sequence string against a very large sequence database called Protein Data Bank (PDB)Types of Comparisons • Sequence-Sequence • Sequence-Structure • Structure-Structure

Scale of Problem • Protein Data Bank of 35,000 chains • Pairwise comparison = average ~3 seconds. • Without considering redundancy or chain size a complete computation would take average…. ((35,000 * 35,000)/2) * 3 seconds 21,000 processor-days or 58 YEARS!!!!TIME IS A BIG PROBLEM!!!

Problems • Determination & Comparison of 3-D protein structures • Massively parallel computations are needed

Background • Looking for more efficient way to analyze large data sets • Taking advantage of redundancy present in data sets • KEY: Data Preprocessing Step & Organization of data being searched BEFORE begin passed to PARALLEL COMPUTERS

Other Issues to Consider • Algorithm should give optimal performance • Scale with the number of processors involved.

Optimization Procedures • Dynamic Programming • Monte-Carlo • Graph Theory • Combinatorial Search

What does CEPAR stand for? CE PAR Combinatorial Extension Algorithm Parallel Mode

What is Combinatorial Extension Algorithm? • Method of automatically aligning pairs of structures • Compiles an alignment of a give pair of protein chains by considering the chains sectioned intoall possible octapeptide fragments, as defined by the backbone α-carbons • Those octapeptide pairs that have high distance-based similarity score are deemed “aligned fragment pairs” & used in the next step of analysis • Then the CE algorithms tries to join each Alignment Fragment Pairs (AFP) to a maximal number of other AFPs in order to create the longest possible alignment path through the two proteins in consideration (w/ allowance for gaps of up to 30 residues in either protein chain). Switch together a set of AFPs covering contiguous region.

What is Combinatorial Extension Algorithm? • After possible paths through two proteins are determined, CE uses additional heuristics to try to improve the final alignment • The 20 best scoring paths are compiled & proteins are directly compared based upon the super-imposition of the aligned residues. • The path that yields the lowest Root Mean Square Deviation (RMSD) is retained as the “optimal path”. • Then this path is subjected to dynamic programming on structural alignment directly between the two structures, which test all possible residue equivalences & resulting RMSD from their superposition.

Parallel Algorithm • CEPAR uses coarse-grain parallel implementation involving a master/worker strategy suitable for a massively parallel computer architecture. • A parallel algorithm, as opposed to a traditional serial algorithm, is one which can be executed a piece at a time on many different processing devices, and then put back together again at the end to get the correct result.

What does CEPAR do? • Finds pairwise protein structure similarities • Pairwise 3D protein structure comparison • Aligns protein structure from Protein Data Bank • Matches protein structure-to-structure • Runs on a large number of processors

How does CEPAR work? • Optimizes the use of Combinatorial Extension algorithm for the pairwise alignment of polypeptide chains to manage comparative structural information • Builds a structurally representative set of protein chains & reveals structure similarities in the Protein Data Bank that scale with fast growing source of data

How does CEPAR work? • Only one master processor was used. It was not advantageous to use more than one master processor, because communication issues. • Each worker receives work assignment from master compares 2 entities contained in the assignment using CE algorithm, returns results of the comparison to the master & is ready to receive another assignment • Workers only need to communicate with the Master processor and not each other • Program written in C++ and uses MPI for communication between master & workers

Computer • “Blue Horizon” – IBM SP parallel computer at the San Diego Supercomputer Center • 1152 Power3+ processors each running at 375MHz • Sun Enterprise 10,000 server & Linux PC cluster • Software can work on any parallel machine or PC cluster with Message Passing Interface (MPI)

Assignments & Problem Formulation • Entity list of N entities where each entity is protein polypeptide chain characterized by amino acid sequence & a set of 3D coordinates • Algorithm for pairwise comparison of entities (CE) • Select Representative Protein Structure • Order of Operations

Representation Criteria Notes • Looking for similarity criterion between representatives • Alignments not satisfying this criterion are not recorded • Output: List of representatives as well as entities represented by them & detailed information on alignment satisfying either representative or similarity criterion • It is not vector quantization (so to minimize computer time) • Representatives are randomly chosen instead of calculating the centroid of a cluster • Applied criteria is believed to adequately describes the structural space of the Protein Data Bank

Representation Criteria Sequence Lengths of two entities: L1 & L2 Length difference threshold parameter: Lthr Number of aligned positions: Lali Alignment length threshold parameter: Athr

Representation Criteria Gap threshold parameter: Gthr Number of residues in gaps: Lgap Final RMSD of the alignment RMSD < Rthr, where Rthr is the RMSD threshold parameter

Order of Operation • Entity-first (2-step) • Family-first (2-step) • Family-first (1-step)

New problems uncovered…. • Running CEPAR in one step produces limited scalability causes….Limited Scalability • WHY? At High processor count…1. Number of idle workers 2. Time taken for communication operations Result of load imbalance at the end of the runBecause at this point most of the worker processors run out of tasks while only a few finish their last assignment. • Resource reservation systems on most public supercomputer reserve a block of processors making it impossible to release them one by one.

How to deal with Limited Scalability Issue • Idea Production Mode:Number of processors assigned should not be more than Process Number < Threshold Number • Use Alternative: Two Steps instead of one • Utilizes early stopping condition, which causes the 1st of the two runs to abort when accumulated avg. idle time of workers exceeds a predefined amount (such as 20% of the total run time). • Then the remaining part of the calculation is then completed on a smaller number of processors.

Two other problems…. • Master processor congestion • Redundancy in assignments • How to avoid congestion…. • Improve communications between processors • Implement advance buffering of assignments• Decrease amount of disk I/O• Implement single-CPU optimization techniques

Keys to success • Detecting a match between rep & entity to avoid redundancy. • Important to sort rep in decreasing order of chance of being similar to the given entity. • Estimate chance by giving priority to those reps having a number of residues with 10% of the current entity AND by using similarity in amino acid content based on frequency profiles. • The approach is approximate but provides performance gains over a random/sequential choices of reps.

MPI Communication • At first it appears that the efficiency of MPI Communication appear to play an insignificant role in overall performance since communication time is small fraction of the overall CEPAR computation time. However time does add up and MPI does help. • Key: Select appropriate MPI send function for the hardware/software in hand. • Example: IBM’s implementation of MPI’s blocking send function MPI_Send() is not appropriate because this implementation does not buffer the msg for large msg sizes. • MPI Implementation that avoid buffering message can cause deadlock in some cases. • In CEPAR no deadlocks occur. However, master processor can be blocked while waiting for some worker processors to finish. MPI_BSend() function for buffered sends solves this problem.

Results • Family-First approach outperformed the Entity-first approach. • End-of-run load imbalance and allocation of processors were addressed with two-steps • Careful Selection of MPI implementation • Overall CEPAR performance….

Advantages of CEPAR • Ensure high performance computing optimal use • Analysis of large amounts of data • Can be used on any distribute-memory platform • Can scale with the number of processors involved • Saves time & computational resources

Summary • Efficient use of resource depends on meticulous design of the algorithm and software with performance & scalability given a high priority. • Organization of data being feed to processors • Optimization of algorithm for distribution of assignments

Proteomics

What is Proteomics? • The study of the proteome. • A proteome is “the set of proteins that can be expressed by the genetic material of an organism.” • In other words, the study of all proteins, the interactions between them, and “their role in physiological and pathophysiological functions”. • Hopefully will directly contribute to a full description of cellular function.

Challenges in Proteomics Research • Limited and variable sample material. • Sample Degradation. • Vast dynamic range. • For example, in human serum the concentration of albumin is 10 billion times greater than the concentration of the signaling protein interleukin-6.

Challenges in Proteomics Research (cont’d) • Plethora of post-translational modifications. • Nearly boundless tissue. • Developmental and temporal specificity. • Disease and drug perturbations. • “…these difficulties render any comprehensive proteomics project an inherently intimidating and often humbling exercise.”

Five Pillars of Proteomics Research • Mass spectrometry-based. • Proteome-wide biochemical arrays. • Systematic structural biology and imaging techniques. • Proteome informatics. • Clinical applications.

Mass spectrometry-based Proteomics • A primary driving force in proteomics. • Advancements allow the identification of smaller proteins in more complex mixtures. • Initially, research required separation of protein by two-dimensional gel electrophoresis before using mass spectrometry. • Limited to the most abundant proteins.

Mass spectrometry-based Proteomics (cont’d) • Now, mass spectrometric analysis is used directly. • Advancements are increasing sensitivity, robustness and data handling. • Plenty of work to do… • Much higher throughput and sensitivity is needed for observing proteome dynamics and cellular response. • More complete sequence coverage. • Process and workflow refinement. • Automated protein identification. • Detection of post-translational modification.

Array-based Proteomics • Array of immobilized proteins on a support surface. • One of the most active areas in biotechnology. • Sensitive, high-throughput. • Wide range of applications. • Diagnostics. • Protein-protein interaction. • Protein expression profiling on a small or large scale. • Target identification and validation in the pharmaceutical industry.

Bioinformatics

Bioinformatics

Presentation Transcript

Bioinformatics

Bioinformatics:

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

BIOINFORMATICS

Bioinformatics