A Course on Bioinformatics

A Course on Bioinformatics Ming Li Bioinformatics Lab Computer Science Dept., UCSB

Chapter 0. Why Study Bioinformatics? • The trend of genetic data growth • Many experts predict “21st century will be a century of biotechnology”. • Genomes: Human, Rice, Mouse, Yeast, 50 species bacteria, biggest digital gold mine ever. 30 billion in year 2005

Protein DNA (Genotype) Chapter 1. Biology Phenotype

Animal CELL Mitochondrion Cytoplasm Nucleolus (rRNA synthesized) Nucleus Plasma membrance Cell coat Chromatin Lots of other stuff/organelles/ribosome

Two kinds of Cells • Prokaryotes – no nucleus (bacteria) • Their genomes are circular • Eukaryotes – have nucleus (animal,plants) • Linear genomes with multiple chromosomes in pairs. When pairing up, they look like Middle: centromere Top: p-arm Bottom: q-arm

Mitosis and Meiosis • Mitosis: homologous chromosomes pair up, duplicate, one set to each cell. • Meiosis: chromosomes split – to haploid (reproductive) cells. • Haploid Number of Chromosomes: • Human: 23 • Rice: 12 • Fruit Fly: 4 • Corn: 10 • Chimpanzee: 24 • House mouse: 20

The DNA Sequences • All are made of H (hydrogen), C (carbon), N (nitrogen), O (oxygen), S (sulfur), P (phosphorus). • A (single chain) DNA sequence looks like: 5’ o o o CH Sugar --- base A,C,G,T,U 2 c c Phosphate (PO ) c c 4 Sugar --- base A nucleotide o o 3’ phosphate Ribose –RNA Deleting circled O,  Deoxyribose --DNA Sugar --- base etc

The 4 bases Thymine Adenine H H H C A-T o N H H H N C C C c C N H N C H N C N C N C o H H Note: this is flat! H o H N H N C C Uracil replaces T in RNA C c C C N H N H N C N C G-C N C o N H Cytosine Guanine H

Human: 3 billion bases, 30k genes. E. coli: 5 million bases, 4k genes T A A T C G cDNA T A reverse transcription A T translation transcription G C Protein mRNA C G (20 amino acids) (A,C,G,U) G C Codon: three nucleotides encode an amino acid. 64 codons 20 amino acids, some w/more codes A T C G T A A T

Genes and Proteins • One gene encodes one* protein. • Like a program, it starts with start codon (e.g. ATG), then each three code one amino acid. Then a stop codon (e.g. TGA) signifies end of the gene. • Sometimes, in the middle of a (eukaryotic) gene, there are introns that are spliced out (as junk) during transcription. Good parts are called exons. This is the task of gene finding.

Glycine (GLY) GG* Alanine(ALA) GC* Valine (VAL) GT* Leucine (LEU) CT* Isoleucine (ILE) AT(*-G) Serine (SER) AGT, AGC Threonine (THR) AC* Aspartic Acid (ASP) GAT,GAC Glutamic Acid(GLU) GAA,GAG Lysine (LYS) AAA, AAG Start: ATG, CTG, GTG Arginine (ARG) CG* Asparagine (ASN) AAT, AAC Glutamine (GLN) CAA, CAG Cysteine (CYS) TGT, TGC Methionine (MET) ATG Phenylalanine (PHE) TTT,TTC Tyrosine (TYR) TAT, TAC Tryptophan (TRP) TGG Histidine (HIS) CAT, CAC Proline (PRO) CC* Stop TGA, TAA, TAG A.A. Coding Table

Bad Genes --> Genetic Diseases • Hemophilia: on X chromosome. • Cystic firosis: on chr. 7, CFTR gene. 4% Caucasians carry the defective gene. (recessive) • Sickle-Cell Anemia: single nucleotide mutation in the first exon of beta-globin gene (removes a cutting site). 1 in 12 African Americans are carriers. (sick for homozygotes) • BRCA1 gene (chr. 17q) – responsible for ½ inherited breast cancer (10% of breast cancer) • Fragile X syndrome (mentally retard) – 1 in 1250 males, 2500 females (dominate, but females have partially expressed good gene). FMR-1 gene: tri-nucleotide repeats >200 causes disease. • P53 gene: chr. 17p, responsible for ½ of all cancers • X-rated: XX, XY, XO(f), XXY(m), XYY(m)

Where to get data? • Larry Miller: GenBank • http://www.ncbi.nlm.nih.gov • Dr. Huandong Sun: • SWISS-PROT: http://www.expasy.ch/sprot • PDB: http://www.pdb.bnl.gov/ • Go to our Bioinformatics Lab website: www.cs.ucsb.edu/~mli/Bioinf/resources/index.html for all the information.

Chapter 2. Research and Industry • Bioinformatics Labs and education programs established in almost all major universities, scattered in Biology, Physics, Computer Science • Research topics: • Gene-finding • Mapping • Sequencing/sequence assembly. • Sequence Comparison: alignment, database search • Mining DNA/proteins, finding motifs • Genome arrangement • DNA arrays • Computational proteomics, protein folding • Phylogeny reconstruction and analysis

The Commercial Market • Current bioinformatics market is worth 300 million / year. Half data, half software. • $2 billion / year bioinformatics market in 5 years, predicted by Oscar Gruss & Son. • Wide range of different demands • Bioinformatics software require sophisticated algorithm support, ranging from probabilistic/ approximation algorithms, data mining, learning algorithms, databases, to GUI design

Celera + many Lion BioScience Netgenics DoubleTwist, eBioinformatics BUSINESS Landscape Universities & Research Labs Pharmaceutical Companies Hospitals, Biotech firms Sell Data & Service Sell large Systems Sell Web Service

Bioinformatics Companies • Genomatrix Software, Genaissance Pharmaceuticals, Lynx, Lexicon Genetics, DeCode Genetics, CuraGen, AlphaGene, Bionavigation, Pangene, InforMax, TimeLogic, GeneCodes, LabOnWeb.com, Darwin, Celera, Incyte, BioResearch Online, BioTools, Oxford Molecular, Genomica, NetGenics, Rosetta, Lion BioScience, DoubleTwist, eBioinformatics, Prospect Genomics, Neomorphic, Molecular Mining, GeneLogic, GeneFormatics, Molecular Simulations, • Total: ~50.

Challenges to Computer Science • Enormous size of genomics data suitable for asymptotic analysis • Many NP-hard problems defy simple minded/non-professional approaches: multiple alignment, distant homology, motif finding, protein folding, phylogeny, gene relationship in expression data, mining and learning, … • Approaching these problems from computer science perspective will be fruitful.

Algorithmic Techniques in CB • Dynamic programming (alignment, etc) • Divide and conquer (Protein folding) • Approximation algorithms (sequence assembly, phylogeny) • Greedy algorithms (sequence assembly) • Heuristics • Linear programming and relaxation

Some CS jargons: • NP-Complete: easy to guess, hard to find. • Approximation algorithm: If the minimal solution is f, your solution is g>f, then the approximation ratio is g/f. • PTAS (Polynomial time approximation scheme): this is best kind of approximation algorithms: for any e, we can achieve (1+e)optimal in polynomial time (exponent might depend on e).

Chapter 3. DNA Sequencing • Two ways to copy DNA: • 1. Polymerase Chain Reaction (1986): make many copies of a fragment (~500). Needs primers (end segments). Cleave into 2. Anneal primers (5’- 3’, and 3’- 5’). Make two double chains. Repeat: 1,2,4,8,16, … 3’ … 5’ 5’ 3’ 5’ 3’ … 5’ 3’

2. Cloning. Insert a large piece of DNA into a cloning vector (virus, bacteria, yeast -YAC). Then the vector is inserted into a host cell to duplicate naturally. • DNA Sequencing: • Make many copies (single strand) • Cut them into fragments of lengths ~500. • For each fragment of length L, use some process like PCR, generating all lengths 1 ~ L with some fluorescence dye. In old scheme, you generate all fragments end with A, then with C, then G, then T, run them in 4 lanes of electrophoresis gel. In the new scheme, you have 4 colors (of the dye) all fragments in 1 lane. • Then assemble all fragments into the shortest common superstring by GREEDY: repeatedly merge the pair with max overlap until finish.

Shortest Common Superstring • In FOCS 1990, we started formalize and analyze the following learning problem: Infer orginal DNA sequence from fragments. Or: given n strings, find the shortest common superstring. (1994, J. ACM) • The problem was proved to be NPC, 1980. • Open for 10 years: does GREEDY work? (I.e. does it give linear approximation?) • We solved this, proving 3, STOC’91. • Improvements by many people to: 2.89, 2.81, 2.79, 2.75, 2.66, … • Formal Statement: Given S={s1, … sn}, find a shortest s such that for all i, si is a substring of s. • E.g. alf ate half lethal alpha alfalfa  lethalphalfalfate

Theorem. GREEDY achieves 4. Proof. Given S={s1, … ,sm}, construct G: • Nodes are s1, … ,sm • Edges: if then add edge: where pref is the pref length. I.e. |si|=pref+overlap length with sj • |SCS(S)| = length shortest Hamilton cycle in G • (Modified) Greedy restated: find all cycles with minimum weights in G, then open cycle, concatenate to obtain the final superstring. sj pref si pref si sj

This minimum cycle exists C si • Assuming initial Hamilton cycle has w(C) = n • Then merging si with sj is equivalent to breaking into two cycles. We have: w(C1)+ w(C2) <= n • Proof: We merged (si, sj) because they have max overlap. Picture shows: d(si,sj)+d(s’’,s’)<d(si,s’)+d(s’’,sj) • Continue this process end with self-cycles: C1, C2, C3, C4, … Sw(Ci) <= n. sj s’ s’’ … S” sj si C1 S’ si sj s’ s’’ C2

Then we open cycles Let Wi=w(Ci) Li =| longest string in Ci | |open Ci|<= Wi + Li n >= SWi Lemma. S1 and S2 overlap <= w1+w2 S(Li-2Wi) <= n, by lemma, since Li’s must be in final SCS. |S|<S(Li+Wi) =S(Li-2Wi)+S3Wi <= n + 3n =4n. QED w1 w1 w1 w1 s1 w2 w2 w2 s2

Longest Common Subsequence (LCS). V=v1v2 … vn W=w1w2 … wm s(i,j) = length of LCS of V[1..i] and W[1..j] Dynamic Programming: s(i-1,j) s(i,j)=max s(i,j-1) s(i-1,j-1)+1,vi=wj Sequence Alignment s(i,j)=max s(i-1,j) + d(vi,-) s(i,j-1)+d(-,wj) s(i-1,j-1)+d(vi,wj) * No gap penalties, where d, for proteins, is either PAM or BLOSUM. d(-,x)=d(x,-)= -a, d(x,y)=-u. When a=0, u=infinity, it is LCS. Chapter 4. Sequence Comparison

Misc. Concerns • Local sequence alignment, add s[i,j]=0. • Gap penalties. For good reasons, we charge first gap cost a, and then subsequent continuous insertions b<a. • Space efficient sequence alignment. Hirschberg alg. in n2 time, O(n) space. • Multiple alignment of k sequences: nk

BLAST / Psi-BLAST / Gap-BLAST • Popular software, using heuristics. By Altschul, Gish, Miller, Myers, Lipman, 1990. • E(epected)-value: e= dmne -lS, here S is score, m is database length and n is query length. • Meaning: e is number of hits one can “expect” to see just by chance when searching a database of size m. • Basic Strategy: For all 3 a.a. (and closely related) triples, remember their locations in database. Given a query sequence S. For all triples in S, find their location in database, then extend as long as e-value significant. Similar strategy for DNA (7-11 nucleotides).

Here is an example of a BLAST match (E-value 0) between gene 0189 in C. pneumoniae and gene 131 in C. trachomatis. Query: CPn0189 Score (bits) E-value Aligned with CT131 hypothetical protein 1240 0.0 Query: 1 MKRRSWLKILGICLGSSIVLGFLIFLPQLLSTESRKYLVFSL I HKESGLSCSAEELKISW 60 MKR W KI G L + L L LP+ S+ES KYL S+++KE+GL E+L +SW Sbjct: 1 MKRSPWYKIFGYYLLVGVPLALLALLPKFFSSESGKYLFLSVLNKETGLQF EIEQLHLSW 60 Query: 61 FGRQTARKIKLTG-EAKDEVFSAEKFELDGSLLRLL I YKKPKGITLSGWSLKINEPASID 119 FG QTA+KI++ G ++ E+F+AEK + GSL RLL+Y+ PK + TL+GWSL+I+E S++ Sbjct: 61 FGSQTAKKIRIRGIDSDSEIFAAEKI IVKGSLPRLLL YRFPKALTLTGWSLQIDESLSMN 120 Etc. Note: Because of powerpoint character alignment, I inserted some white space in the alignment that are not in the BLAST output – I will explain in class. These white spaces were inserts so that things line up right (the k-th letter goes with the k-th letter).

Chap. 4’. Tools From CS • Suffix Array. Give sequence: she_sells_seashell_on_the_sea_shore Suffix array is an array of pointers pointing to suffices of the sequence … sorted. E.g., all suffices of above are: e, re, ore, hore, shore, _shore, a_shore, ea_shore, sea_shore, and so on. Then these are sorted and their pointers (pointing to a suffix location in sentence) are stored in an array (the suffix array).

Why Suffix Array? • It can be built very fast. • It can answer queries very fast: • How many times ATG appears (their pointers are all jammed together). • What is G-C contents. • Disadvantages: • Can’t do approximate matching • Hard to insert new stuff (need to rebuild the array) dynamically. • Pointers can cost too much space. 3G pointers?

Fast Pattern Matching • Given T (text) and P (pattern), is P in T? • Slow algorithm: for each position in T, check if P is in T. This costs O(|P||T|). • Fast algorithm: no back tracking. P is short, we can calculate a table for P first: if we have matched P=P1 … Pk with T halfway fail to match at Pi with Tj, since P1 … Pi-1 already matched the text, we should know where to start in P and continue to match at Tj+1 … The table will indicate for each position Pi if we fail at i, where to start next. O(|P|2 + |T|). • Knuth-Morris-Pratt, Boyer-Moore: O(|P|+|T|).

Chapter 5: Multiple alignment • To do phylogenetic analysis: • Same protein from different species • Optimal multiple alignment probably implies history • Discover irregularities, such as Systic Fibrosis gene • To find conserved regions • Local multiple alignment reveals conserved regions • Conserved regions usually are key functional regions • These regions are prime targets for drug developments

Definitions … • Given sequences s1, … sn, multiple alignment M puts them in n rows, one sequence per row, with spaces inserted to get supersequences S1, …, Sn, |Si|=L. • SP-Alignment: Minimize sum of Hamming(Si,Sj) over all pairs i, j. • Star-Alignment: find center sequence S, minimize sum of Hamming(S,Si) over all i. • We will concentrate on SP-alignment.

CLUSTAL-W • Standard popular software • It does multiple alignment as follows: • Align 2 • Repeat: keep on adding a new sequence to the alignment until no more, or do tree-like heuristics. • Problem: It is simply a heuristics. • Alternative: dynamic programming nk for k sequences. This is simply too slow. • We need to understand the problem and solve it right.

Making the Problem Simpler! • Multiple alignment is very hard • For k sequences, nk time, by dynamic programming • NP hard in general, not clear how to approximate • Popular practice -- alignment within a band: the p-th letter in one sequence is not more than c places away from the p-th letter in another sequence in the final alignment – the alignment is along a diagonal bandwidth 2c. • Used in final stage of FASTA program.

Literature (for details, see our STOC’00 paper) • NP hardness under various models: Wang-Jiang (JCB), Li-Ma-Wang (STOC99), W. Just • Approximation results: Gusfield (2- 1/L), Bafna, Lawler, Pevzner (CPM94, 2-k/L), star alignment. • Sankoff, Kruskal discussed “within a band” • Pearson showed alignment within a band gives very good results for a lot protein superfamilies. • Altschul and Lipman, Chao-Pearson-Miller, Fickett, Ukkonen, Spouge (survey) all have studied alignment within a band.

SP-Alignment NP hard PTAS for constant band PTAS for constant number of insertion/deletion gaps per sequence on average (for coding regions, this assumption makes a lot of sense) Star-Alignment PTAS in constant band PTAS for constant number of insertion/deletion gaps per sequence on average The following were proved:

We will do only SP-alignment • Notation: in an alignment, a block of inserted “---” is called a gap. If a multiple alignment has c gaps on the average for each sequence, we call it average c-gap alignment. • We first design a PTAS for the average c-gap SP alignment. • Then using the PTAS for the average c-gap SP alignment, we design a PTAS for SP-alignment within a band.

Average c-gap SP Alignment • Key Idea: choose r representative sequences, we find their “correct” alignment in the optimal alignment, by exhaustive search. Then we use this alignment as reference. • Then we align every other sequence against this alignment. • Then choose the best. • All we have to show is that there are r sequences whose letter frequencies in each column of their alignment approximates the complete alignment.

Some over-simplified reasoning • If M is optimal average c-gap SP alignment • In this alignment, many sequences have less than cl gaps. • So if we take r of these sequences, and try every possibly way, one way coincides with M. • Then hopefully, its letter frequencies in each column “more or less” approximates that of M’s • Then we can simply optimally align all the rest of the sequences one by one according to this frequency matrix.

Sampling r sequences Complete Alignment Alignment with r sequences j j We also expect this column has ~ k percent a’s If this column has k percent a’s

AverageSPAlign for L=m to nm { for any r sequences { for all possible alignment M’ of length L and with no more than cl gaps { align all other sequences to M’ //one alignment }}} Output the best alignment.

Basic Idea Dynamically cut seq-uences into segments Each segment satisfies the average c-gap condition. Hence use previous algorithm Then assemble the segments together Divide-Conquer. SP Alignment within c-Band Cutting these sequences into 6 segments, each segment has c-gaps per sequence on average in optimal alignment.

The final Algorithm: diagonalAlign while (not finished) { find a maximum prefix for each sequence (same length) such that AverageSPAlign returns “low”cost. Keep the multiple alignment for this segment } Concatenate the multiple alignments for all segments together to as final alignment.

Discussion • Current PTAS is extremely slow. • However, the design of PTAS might provide useful hints to design fast programs. • It is an interesting project to implement this PTAS (combined with some heuristics) and test this idea.

Chapter 6. Motif Finding • Finding motifs/conserved regions in proteins is important in drug design and proteomics. • The problem is also called local alignment. • Many programs exist -- all based on heuristics • We proved in STOC’99: it is NP-hard. • We provided a polynomial time approximation scheme with guaranteed performance in polynomial time in STOC’99. • Based on this theoretical result, we have implemented the COPIA system.

Given k protein sequences, find a “conserved” region: L K sequences Red regions are conserved regions, or, motifs. The don’t have to be exactly same, they match with higher scores than other regions.

The PTAS (Li, Ma, Wang, STOC’99): • Input: S1, … , Sm, integer L. • Output: t1, … ,tm, |ti|=L (motifs) • For every r length L substring, compute the consensuse t of them. • In all strings, find closest substring to t of length L. From these, find the new consensus s. • Choose the best. Theorem: This algorithm outputs a solution no worse than 1+1/sqrt(r) optimal. Time complexity is (n+m)O(r).

A Course on Bioinformatics

A Course on Bioinformatics

Presentation Transcript

Introductory course in Bioinformatics

Structure Bioinformatics Course – Basel 2004

Interdisciplinary Introductory Course in Bioinformatics

A core Course on Modeling

A core Course on Modeling

A core Course on Modeling

Practically Genomic A hands-on bioinformatics IAP

BIO337 Systems Biology/Bioinformatics (course # 50524)

MNW2 course Introduction to Bioinformatics

Bioinformatics on Cloud Cyberinfrastructure

Bioinformatics Course Day 4

Bioinformatics PhD. Course

Questions on Teaching Bioinformatics

BIOINFORMATICS ON NETWORKS

Hands-on session on Bioinformatics

Incorporating Bioinformatics in an Algorithms Course

Bioinformatics On Genomics

Bioinformatics on Proteomics

Bioinformatics and Comparative Genome Analyses Course

EMBO Practical Course on Metabolomics Bioinformatics for Life Scientists

Course Sequence Analysis for Bioinformatics Master’s

Bioinformatics Market - Global Report on Bioinformatics 2020

Sea Ice

Sea Ice