1 / 44

Computational Molecular Biology

Computational Molecular Biology. Introduction and Preliminaries. Preliminaries in Computer Science. Strings and alphabet Basic notations in graph theory Algorithms and Complexity. Strings. Consist of a sequence of letters: DNA: four nucleotides A, C, G, T

lindsay
Télécharger la présentation

Computational Molecular Biology

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Computational Molecular Biology Introduction and Preliminaries

  2. Preliminaries in Computer Science • Strings and alphabet • Basic notations in graph theory • Algorithms and Complexity My T. Thai mythai@cise.ufl.edu

  3. Strings • Consist of a sequence of letters: • DNA: four nucleotides A, C, G, T • Proteins: 20 symbol alphabet of animo acids • Given a string s, we have the following notations: • Length: |s| • Substring: ACT is a substring of ATGACTG • Superstring: ATGACTG is a superstring of ACT • Index and interval: s[i] and s[i..j] • Prefix and suffix: s[1..j] and s[i..|s|] My T. Thai mythai@cise.ufl.edu

  4. Graphs • G = (V, E) where V is a set of vertices and E is a set of edges • Undirected graph: edges are undirected • Directed graph: edges are directed • Weighted graph G = (V, E, w) where each edge has some weight • Some special graphs: complete graph, bipartite graph, tree, and interval graph • Subgraph, spanning tree, steiner tree My T. Thai mythai@cise.ufl.edu

  5. Interval Graphs • Intersection graph of a set of intervals on the real line • A vertex represents an interval and an edge (u, v) exists if intervals u and v intersect My T. Thai mythai@cise.ufl.edu

  6. Some Problems in Graphs • Euler circuit: Given a graph, find a cycle that passes through each edge exactly once • Hamiltonian circuit: Given a graph, find a cycle that passes through each vertex exactly once • Minimum Spanning Tree: Given a weighted undirected graph, find a spanning tree with minimum total weight • Maximum Matching: Given an undirected graph, find a maximum cardinality matching, which is a subset of edges such that no two edges in the subset share an endpoint My T. Thai mythai@cise.ufl.edu

  7. P vs. NP • Class of P: Set of problems solvable by polynomial-time algoirthms • Class of NP: Set of problems whose solutions, once found, can be verified in polynomial time • NP-complete (NP-hard) problems: cannot obtain an optimal solutions in polynomial time My T. Thai mythai@cise.ufl.edu

  8. Some approaches for NP-complete Problems • Special-case method: Work on the problem with a restricted class of inputs • Exhaustive search: Design an exponential-time algorithms that may perform well in practice • Approximation algorithms: Design a polynomial-time algorithm that is guaranteed to find near-optimal solutions (with a good approximation ratio) • Heuristics: Fast algorithms that produce satisfactory solutions most of the time but without guarantee My T. Thai mythai@cise.ufl.edu

  9. Preliminaries in Molecular Biology My T. Thai mythai@cise.ufl.edu

  10. DNA and Base Pairs • Double helix consisting of two dual strands • Has four types of nucleotides: Adenine, Thymine, Guanine, Cytosine • Base Pairs: A↔T, C↔G • Two ends of a strand are marked with 3’ and 5’ • The entire DNA of a living organism is called its genome My T. Thai mythai@cise.ufl.edu

  11. DNA Sequences My T. Thai mythai@cise.ufl.edu

  12. DNA Replication • Strands are separated • Each strand is replicated using one of the parental strands as a template My T. Thai mythai@cise.ufl.edu

  13. Cell, Chromosome, and DNA My T. Thai mythai@cise.ufl.edu

  14. Cell Classification My T. Thai mythai@cise.ufl.edu

  15. Chromosomes • Consists of a DNA molecule associated with proteins that fold and pack the DNA thread into a more compact structure and proteins required for the process of gene expression, DNA replication and DNA repair. • Human genome is distributed over 24 chromosomes • Each cell contains 46 chromosomes • 22 pairs common to both males and females • 2 sex chromosome X and Y in males and two Xs in female My T. Thai mythai@cise.ufl.edu

  16. Genes • Segments of DNA • Functional and physical unit of heredity passed from parent to offspring • Contain the information for making a specific protein My T. Thai mythai@cise.ufl.edu

  17. Proteins • Shorts strings in the amino acid 20-letter alphabet • Human genome: about 100,000 proteins, with each protein a few hundred amino acids long • Bacteria make 500-1500 proteins • Made by genes (fragments of DNA) that are roughly three times longer than the corresponding proteins. • Why? Every 3 nucleotides in the DNA alphabet code one letter in the protein alphabet of amino acids My T. Thai mythai@cise.ufl.edu

  18. Central Dogma of Molecular Biology My T. Thai mythai@cise.ufl.edu

  19. Transcription My T. Thai mythai@cise.ufl.edu

  20. Translation • Translation • mRNA (after exported out of the nucleus and reaching the cytosol) directs the synthesis of the protein by joining together amino acids in the order encoded by the mRNA • Genetic code • Defines a mapping between codons and amino acid. • Codon • Triplet of nucleotides specifies a single amino acid in a corresponding protein • 64 codons and 20 amino acids • Translation is carried out by ribosomes My T. Thai mythai@cise.ufl.edu

  21. Polymerase Chain Reaction (PCR) • Primer • Nucleic acid strand • Serves as a starting point of DNA replication My T. Thai mythai@cise.ufl.edu

  22. Plasmid Vector • Vector • an agent that can carry a DNA fragment into a host cell • Plasmid • Circular and double-stranded DNA • Antibiotic resistance • Automatic replication • Exists in bacteria My T. Thai mythai@cise.ufl.edu

  23. DNA Cloning Using Plasmids as Vectors • (a) DNA recombination • (b) Transformation My T. Thai mythai@cise.ufl.edu

  24. DNA Cloning Using Plasmids as Vectors (Cont) • (c) Selective amplification • (d) Isolation of desired DNA clones My T. Thai mythai@cise.ufl.edu

  25. DNA Library Screening • Probe: • Labeled with radioisotope or fluorescence • Used to detect specific DNA sequences by hybridization • Hybridization: • Binding of two nucleic acid chains by base paring • DNA Library Screening • To identify each clone whether it contains a probe from a given set of probes • Positive clone:contains a probe My T. Thai mythai@cise.ufl.edu

  26. Some Computational Problems • Pooling Design • Non-unique probe selection • Sequence Alignment, Multi Sequence Alignment • DNA sequencing • Genome Rearrangement • Protein Structure Prediction and Recognition • Protein-Protein Interactions • Functional Groups, Modules My T. Thai mythai@cise.ufl.edu

  27. Pooling Designs • Problem Definition • Given a set of n clones with at most d positive clones • Identify all positive clones with the minimum number of tests • Pool:a subset of clones • Positive pool: a pool contains at least one positive clone My T. Thai mythai@cise.ufl.edu

  28. Pooling Designs clones c1 c2 cj cn p1 0 0 … 0 … 0 … 0 … 0 0 p2 0 1 … 0 … 0 … 0 … 0 1 pools . . . . pi 0 0 … 0 … 1 … 0 … 0 1 . . . . pt 0 0 … 0 … 0 … 0 … 0 0 txn tx1 M[i, j] = 1 iff the ith pool contains the jth clone Decoding Algorithm: Given M and V(D), identify all positive clones V(D) Testing Mtxn = My T. Thai mythai@cise.ufl.edu

  29. Challenges • Challenge 1: How to construct the binary matrixM such that: • Outputs of any union of d columns are distinct • Challenge 2: How to design a decoding algorithm with efficient time complexity [O(tn)] My T. Thai mythai@cise.ufl.edu

  30. Probe Selection • Problem Definition: • Given a biological sample (e.g., blood) and a set of probes • Identify the presence (or absence) of some biological objects (e.g., viruses or bacteria) with the minimum number of probes My T. Thai mythai@cise.ufl.edu

  31. Unique Probes VS. Non-unique Probes • Unique probes • Gene-specific probes or signature probes. • Difficult to find • Non-unique probes • Hybridize to more than one target. • Difficult to decode the results My T. Thai mythai@cise.ufl.edu

  32. Probe-Target Matrix • 12 probe candidates. • 4 targets (genes). • For target set S, define P(S) as set of probes reacting to any gene in S. • P({1, 2}) = {1, 2, 3, 4, 7, 8, 9, 10, 12}. • P({2, 3}) = {1, 3, 4, 5, 6, 7, 8, 9, 12}. • Symmetric set difference: P({1, 2})∆P({2, 3}) = {2, 5, 6, 10}. Probes that separate two sets. My T. Thai mythai@cise.ufl.edu

  33. Sequence Alignment • Problem Definition: • Given: 2 DNA or protein sequences • Find: Best match between them • What is an Alignment: • Given: 2 Strings S and S’ • Goal: The lengths of S and S’ are the same by inserting spaces into these strings My T. Thai mythai@cise.ufl.edu

  34. Matches, Mismatches and Indels • Match: two aligned, identical characters in an alignment • Mismatch: two aligned, unequal characters • Indel: A character aligned with a space A A C T A C T -- C C T A A C A C T -- -- -- -- C T C C T A C C T -- -- T A C T T T 10 matches, 2 mismatches, 7 indels My T. Thai mythai@cise.ufl.edu

  35. Basic Algorithmic Problem • Find the alignment of the two strings that: • Max m where m = (# matches – mismatches – indels) • m defines the similarity of the two strings, also called Optimal Global Alignment • Biologically: a mismatch represents a mutation, whereas an indel represents a historical insertion or deletion of a single character My T. Thai mythai@cise.ufl.edu

  36. Multiple Sequence Alignment • Problem Definition: • Similar to the sequence alignment problem but the input has more than 2 strings • Challenges: • NP-hard • Guarantee factor: 2 – 2/k where k is the number of the input sequences. • More work to reduce the time and space complexity My T. Thai mythai@cise.ufl.edu

  37. DNA Sequencing • Problem Definition: • Given a set of fragments that are contained in a DNA string S • Goal: Determine the string S • NP-complete • Further complicated due to the existence of repetitive sequences in the genome • Can cast this as a Hamiltonian path or Euler path problem (was introduced by Pavel Pavzner) My T. Thai mythai@cise.ufl.edu

  38. Genome Rearrangement • Problem Definition: • Given genomes of 2 different species • Goal: Find a sequence of evolutionary events that turn the first genome to the second one. • Biological reasons: How close between these species, how much evolution separate these species. • E.g.: We usually test new drugs on mice before humans. However, how close is a mouse to a human? My T. Thai mythai@cise.ufl.edu

  39. Genome Rearrangement • Can we use the solutions of sequence alignment to solve this problem? • Answer: NO, because: • Genome is a very long strings (3 million letters for a human genome • Model of sequence alignment is not appropriate for human genome comparison since the differences are not in terms of insertions/deletion/mutations of a nucleotide, but a rearrangement of a long DNA regions • The basic comparison is gene My T. Thai mythai@cise.ufl.edu

  40. An Example • If we compare these two strings by sequence alignment, it’s impossible • However, the second string is the first string after reverse the fragment AATGGT…CCC. My T. Thai mythai@cise.ufl.edu

  41. Main Evolutionary Events • Deletions: A fragment is removed • Duplications: create many copies of a fragment and insert into different positions • Transpositions: A fragment is removed and re-inserted into a different position • Inversions: A fragment is removed, reversed, and then reinserted into the same position • Translocations: A pair of fragments are exchanged between the ends of two chromosomes My T. Thai mythai@cise.ufl.edu

  42. My T. Thai mythai@cise.ufl.edu

  43. Protein Structure Prediction • Problem Definition: • Given: A sequence of amino acids • Goal: Predict the 3D structure of the protein • Some approaches: • Determine the position of a protein’s atoms so as to minimize the total free energy • Find the similarities to some known proteins My T. Thai mythai@cise.ufl.edu

  44. Community Structure • Problem Definition: • Given a graph G = (V, E) representing a network • Partition G into a set of subgraph (community structure) so that nodes in each subgraph are highly connected • Biological reason: Genes with similar expression data may have similar functions. Identify the community structure can help us to reduce the number of tests • Others: Community structure is also studied in different fields My T. Thai mythai@cise.ufl.edu

More Related