440 likes | 571 Vues
This document provides an overview of foundational concepts in both computational and molecular biology. It covers essential topics including strings and alphabets, basic graph theory notations, significant algorithms and their complexities, and key elements of DNA and protein structures. It further explores the Central Dogma of molecular biology encompassing transcription and translation processes, alongside methods used in handling NP-complete problems and other graph-related issues. This insight into molecular biology emphasizes the interplay between computational techniques and biological data.
E N D
Computational Molecular Biology Introduction and Preliminaries
Preliminaries in Computer Science • Strings and alphabet • Basic notations in graph theory • Algorithms and Complexity My T. Thai mythai@cise.ufl.edu
Strings • Consist of a sequence of letters: • DNA: four nucleotides A, C, G, T • Proteins: 20 symbol alphabet of animo acids • Given a string s, we have the following notations: • Length: |s| • Substring: ACT is a substring of ATGACTG • Superstring: ATGACTG is a superstring of ACT • Index and interval: s[i] and s[i..j] • Prefix and suffix: s[1..j] and s[i..|s|] My T. Thai mythai@cise.ufl.edu
Graphs • G = (V, E) where V is a set of vertices and E is a set of edges • Undirected graph: edges are undirected • Directed graph: edges are directed • Weighted graph G = (V, E, w) where each edge has some weight • Some special graphs: complete graph, bipartite graph, tree, and interval graph • Subgraph, spanning tree, steiner tree My T. Thai mythai@cise.ufl.edu
Interval Graphs • Intersection graph of a set of intervals on the real line • A vertex represents an interval and an edge (u, v) exists if intervals u and v intersect My T. Thai mythai@cise.ufl.edu
Some Problems in Graphs • Euler circuit: Given a graph, find a cycle that passes through each edge exactly once • Hamiltonian circuit: Given a graph, find a cycle that passes through each vertex exactly once • Minimum Spanning Tree: Given a weighted undirected graph, find a spanning tree with minimum total weight • Maximum Matching: Given an undirected graph, find a maximum cardinality matching, which is a subset of edges such that no two edges in the subset share an endpoint My T. Thai mythai@cise.ufl.edu
P vs. NP • Class of P: Set of problems solvable by polynomial-time algoirthms • Class of NP: Set of problems whose solutions, once found, can be verified in polynomial time • NP-complete (NP-hard) problems: cannot obtain an optimal solutions in polynomial time My T. Thai mythai@cise.ufl.edu
Some approaches for NP-complete Problems • Special-case method: Work on the problem with a restricted class of inputs • Exhaustive search: Design an exponential-time algorithms that may perform well in practice • Approximation algorithms: Design a polynomial-time algorithm that is guaranteed to find near-optimal solutions (with a good approximation ratio) • Heuristics: Fast algorithms that produce satisfactory solutions most of the time but without guarantee My T. Thai mythai@cise.ufl.edu
Preliminaries in Molecular Biology My T. Thai mythai@cise.ufl.edu
DNA and Base Pairs • Double helix consisting of two dual strands • Has four types of nucleotides: Adenine, Thymine, Guanine, Cytosine • Base Pairs: A↔T, C↔G • Two ends of a strand are marked with 3’ and 5’ • The entire DNA of a living organism is called its genome My T. Thai mythai@cise.ufl.edu
DNA Sequences My T. Thai mythai@cise.ufl.edu
DNA Replication • Strands are separated • Each strand is replicated using one of the parental strands as a template My T. Thai mythai@cise.ufl.edu
Cell, Chromosome, and DNA My T. Thai mythai@cise.ufl.edu
Cell Classification My T. Thai mythai@cise.ufl.edu
Chromosomes • Consists of a DNA molecule associated with proteins that fold and pack the DNA thread into a more compact structure and proteins required for the process of gene expression, DNA replication and DNA repair. • Human genome is distributed over 24 chromosomes • Each cell contains 46 chromosomes • 22 pairs common to both males and females • 2 sex chromosome X and Y in males and two Xs in female My T. Thai mythai@cise.ufl.edu
Genes • Segments of DNA • Functional and physical unit of heredity passed from parent to offspring • Contain the information for making a specific protein My T. Thai mythai@cise.ufl.edu
Proteins • Shorts strings in the amino acid 20-letter alphabet • Human genome: about 100,000 proteins, with each protein a few hundred amino acids long • Bacteria make 500-1500 proteins • Made by genes (fragments of DNA) that are roughly three times longer than the corresponding proteins. • Why? Every 3 nucleotides in the DNA alphabet code one letter in the protein alphabet of amino acids My T. Thai mythai@cise.ufl.edu
Central Dogma of Molecular Biology My T. Thai mythai@cise.ufl.edu
Transcription My T. Thai mythai@cise.ufl.edu
Translation • Translation • mRNA (after exported out of the nucleus and reaching the cytosol) directs the synthesis of the protein by joining together amino acids in the order encoded by the mRNA • Genetic code • Defines a mapping between codons and amino acid. • Codon • Triplet of nucleotides specifies a single amino acid in a corresponding protein • 64 codons and 20 amino acids • Translation is carried out by ribosomes My T. Thai mythai@cise.ufl.edu
Polymerase Chain Reaction (PCR) • Primer • Nucleic acid strand • Serves as a starting point of DNA replication My T. Thai mythai@cise.ufl.edu
Plasmid Vector • Vector • an agent that can carry a DNA fragment into a host cell • Plasmid • Circular and double-stranded DNA • Antibiotic resistance • Automatic replication • Exists in bacteria My T. Thai mythai@cise.ufl.edu
DNA Cloning Using Plasmids as Vectors • (a) DNA recombination • (b) Transformation My T. Thai mythai@cise.ufl.edu
DNA Cloning Using Plasmids as Vectors (Cont) • (c) Selective amplification • (d) Isolation of desired DNA clones My T. Thai mythai@cise.ufl.edu
DNA Library Screening • Probe: • Labeled with radioisotope or fluorescence • Used to detect specific DNA sequences by hybridization • Hybridization: • Binding of two nucleic acid chains by base paring • DNA Library Screening • To identify each clone whether it contains a probe from a given set of probes • Positive clone:contains a probe My T. Thai mythai@cise.ufl.edu
Some Computational Problems • Pooling Design • Non-unique probe selection • Sequence Alignment, Multi Sequence Alignment • DNA sequencing • Genome Rearrangement • Protein Structure Prediction and Recognition • Protein-Protein Interactions • Functional Groups, Modules My T. Thai mythai@cise.ufl.edu
Pooling Designs • Problem Definition • Given a set of n clones with at most d positive clones • Identify all positive clones with the minimum number of tests • Pool:a subset of clones • Positive pool: a pool contains at least one positive clone My T. Thai mythai@cise.ufl.edu
Pooling Designs clones c1 c2 cj cn p1 0 0 … 0 … 0 … 0 … 0 0 p2 0 1 … 0 … 0 … 0 … 0 1 pools . . . . pi 0 0 … 0 … 1 … 0 … 0 1 . . . . pt 0 0 … 0 … 0 … 0 … 0 0 txn tx1 M[i, j] = 1 iff the ith pool contains the jth clone Decoding Algorithm: Given M and V(D), identify all positive clones V(D) Testing Mtxn = My T. Thai mythai@cise.ufl.edu
Challenges • Challenge 1: How to construct the binary matrixM such that: • Outputs of any union of d columns are distinct • Challenge 2: How to design a decoding algorithm with efficient time complexity [O(tn)] My T. Thai mythai@cise.ufl.edu
Probe Selection • Problem Definition: • Given a biological sample (e.g., blood) and a set of probes • Identify the presence (or absence) of some biological objects (e.g., viruses or bacteria) with the minimum number of probes My T. Thai mythai@cise.ufl.edu
Unique Probes VS. Non-unique Probes • Unique probes • Gene-specific probes or signature probes. • Difficult to find • Non-unique probes • Hybridize to more than one target. • Difficult to decode the results My T. Thai mythai@cise.ufl.edu
Probe-Target Matrix • 12 probe candidates. • 4 targets (genes). • For target set S, define P(S) as set of probes reacting to any gene in S. • P({1, 2}) = {1, 2, 3, 4, 7, 8, 9, 10, 12}. • P({2, 3}) = {1, 3, 4, 5, 6, 7, 8, 9, 12}. • Symmetric set difference: P({1, 2})∆P({2, 3}) = {2, 5, 6, 10}. Probes that separate two sets. My T. Thai mythai@cise.ufl.edu
Sequence Alignment • Problem Definition: • Given: 2 DNA or protein sequences • Find: Best match between them • What is an Alignment: • Given: 2 Strings S and S’ • Goal: The lengths of S and S’ are the same by inserting spaces into these strings My T. Thai mythai@cise.ufl.edu
Matches, Mismatches and Indels • Match: two aligned, identical characters in an alignment • Mismatch: two aligned, unequal characters • Indel: A character aligned with a space A A C T A C T -- C C T A A C A C T -- -- -- -- C T C C T A C C T -- -- T A C T T T 10 matches, 2 mismatches, 7 indels My T. Thai mythai@cise.ufl.edu
Basic Algorithmic Problem • Find the alignment of the two strings that: • Max m where m = (# matches – mismatches – indels) • m defines the similarity of the two strings, also called Optimal Global Alignment • Biologically: a mismatch represents a mutation, whereas an indel represents a historical insertion or deletion of a single character My T. Thai mythai@cise.ufl.edu
Multiple Sequence Alignment • Problem Definition: • Similar to the sequence alignment problem but the input has more than 2 strings • Challenges: • NP-hard • Guarantee factor: 2 – 2/k where k is the number of the input sequences. • More work to reduce the time and space complexity My T. Thai mythai@cise.ufl.edu
DNA Sequencing • Problem Definition: • Given a set of fragments that are contained in a DNA string S • Goal: Determine the string S • NP-complete • Further complicated due to the existence of repetitive sequences in the genome • Can cast this as a Hamiltonian path or Euler path problem (was introduced by Pavel Pavzner) My T. Thai mythai@cise.ufl.edu
Genome Rearrangement • Problem Definition: • Given genomes of 2 different species • Goal: Find a sequence of evolutionary events that turn the first genome to the second one. • Biological reasons: How close between these species, how much evolution separate these species. • E.g.: We usually test new drugs on mice before humans. However, how close is a mouse to a human? My T. Thai mythai@cise.ufl.edu
Genome Rearrangement • Can we use the solutions of sequence alignment to solve this problem? • Answer: NO, because: • Genome is a very long strings (3 million letters for a human genome • Model of sequence alignment is not appropriate for human genome comparison since the differences are not in terms of insertions/deletion/mutations of a nucleotide, but a rearrangement of a long DNA regions • The basic comparison is gene My T. Thai mythai@cise.ufl.edu
An Example • If we compare these two strings by sequence alignment, it’s impossible • However, the second string is the first string after reverse the fragment AATGGT…CCC. My T. Thai mythai@cise.ufl.edu
Main Evolutionary Events • Deletions: A fragment is removed • Duplications: create many copies of a fragment and insert into different positions • Transpositions: A fragment is removed and re-inserted into a different position • Inversions: A fragment is removed, reversed, and then reinserted into the same position • Translocations: A pair of fragments are exchanged between the ends of two chromosomes My T. Thai mythai@cise.ufl.edu
My T. Thai mythai@cise.ufl.edu
Protein Structure Prediction • Problem Definition: • Given: A sequence of amino acids • Goal: Predict the 3D structure of the protein • Some approaches: • Determine the position of a protein’s atoms so as to minimize the total free energy • Find the similarities to some known proteins My T. Thai mythai@cise.ufl.edu
Community Structure • Problem Definition: • Given a graph G = (V, E) representing a network • Partition G into a set of subgraph (community structure) so that nodes in each subgraph are highly connected • Biological reason: Genes with similar expression data may have similar functions. Identify the community structure can help us to reduce the number of tests • Others: Community structure is also studied in different fields My T. Thai mythai@cise.ufl.edu