Bioinformatics

Bioinformatics CSC 391/691; PHY 392; BICM 715

Importance of bioinformatics • A more global perspective in experimental design • The ability to capitalize on the emerging technology of database-mining--the process by which testable hypotheses are generated regarding the function or structure of a gene or protein of interest by identifying similar sequences in better characterized organisms.

Amino acids: chemical composition or digital symbols for proteins http://wbiomed.curtin.edu.au/teach/biochem/tutorials/AAs/AA.html Link found on the Research Collaboratory for Structural Biology web site: www.rcsb.org/pdb/education.html See also Table 2.2 (Mount)

Nucleotides: chemical composition or digital symbols for nucleic acids http://ndbserver.rutgers.edu/NDB/archives/NAintro/ http://www.web-books.com/MoBio/Free/Ch3A.htm Link found on the Research Collaboratory for Structural Biology web site: www.rcsb.org/pdb/education.html See also Table 2.1 (Mount)

The Genetic Code: how DNA nucleotides encode protein amino acids http://www.accessexcellence.org/AB/GG/genetic.html

Biologists think it’s a lot of data, but maybe its really not He made fun of biologists for complaining that the human genome, which takes up about 3 gigabytes, is "a lot of data". He offered the comparison of the DVD movie "Evita", which is about 12 gigabytes, with the genome of Madonna. (3 gigabytes). "The movie contains four times more information than Madonna's genome. And Madonna shares 99% of her DNA with a chimp...And 90% with Craig Venter’s dog.” More proof that the genome is not a lot of data: About 90-something percent of genetic information is common to all humans. "The unique part of you will fit on a floppy disk." Nathan Myhrvold, former Chief Technology Officer for MicrosoftKeynote Speech at NIH Digital Biology Meeting 2003

Review of Lab 1 • What did you learn about the sites you visited: SGD, SwissProt, EntrezRefSeq, EntrezNeighbor, EntrezProtein, PIR-US • Can you define the term protein function? • Does the term gene function have any meaning? • Questions?

Biologists think it’s a lot of data, but maybe its really not He made fun of biologists for complaining that the human genome, which takes up about 3 gigabytes, is "a lot of data". He offered the comparison of the DVD movie "Evita", which is about 12 gigabytes, with the genome of Madonna. (3 gigabytes). "The movie contains four times more information than Madonna's genome. And Madonna shares 99% of her DNA with a chimp...And 90% with Craig Venter’s dog.” More proof that the genome is not a lot of data: About 90-something percent of genetic information is common to all humans. "The unique part of you will fit on a floppy disk." Nathan Myhrvold, former Chief Technology Officer for MicrosoftKeynote Speech at NIH Digital Biology Meeting 2003

Biologists think it’s a lot of data, and maybe it really is • The genome is not a static, one-time picture • Genome changes over time—mutations and other changes • Genes expressed to make proteins • Set of genes that are expressed changes with cell type • Set of genes that are expressed changes over time and state

Definition of a Biological Database A biological database is a large, organized body of persistent data, usually associated with computerized software designed to update, query, and retrieve components of the data stored within the system.

Sources of sequence data • GenBank at the National Center of Biotechnology Information, National Library of Medicine, Washington, DC (nucleotides and proteins) http://www.ncbi.nlm.nih.gov/Entrez • European Molecular Biology Laboratory (EMBL) Outstation at Hixton, England http://www.ebi.ac.uk/embl/index.html • DNA DataBank of Japan (DDBJ) at Mishima, Japan http://www.ddbj.nig.ac.jp/ • Protein International Resource (PIR) database at the National Biomedical Research Foundation in Washington, DC (see Barker et al. 1998) http://www-nbrf.georgetown.edu/pirwww/ • The SwissProt protein sequence database at ISREC, Swiss Institute for Experimental Cancer Research in Epalinges/Lausanne http://www.expasy.ch/cgi-bin/sprot-search-de • The Sequence Retrieval System (SRS) at the European Bioinformatics Institute allows both simple and complex concurrent searches of one or more sequence databases. The SRS system may also be used on a local machine to assist in the preparation of local sequence databases. http://srs6.ebi.ac.uk Table 2.5. Mount

Sources of protein structure data • RCSB Protein Data Bank (PDB): www.rcsb.org • BioMagResBank: http://www.bmrb.wisc.edu/ • MMDB: http://www.ncbi.nlm.nih.gov/Structure/MMDB/mmdb.shtml

Review of Lab 2 • What did you learn about the RCSB web page? • What are your thoughts about the PDB file format? • Was RasMol easy or hard to use? Is there anything you tried to do, but couldn’t figure out how? • What is the difference between the two glutaredoxin structures (1aaz and 1die)? • MMDB: database of protein structures, ASN.1 format (http://www.ncbi.nlm.nih.gov/Structure/MMDB/mmdb.shtml) • Other questions?

Levels of protein structure • Primary structure • Secondary structure • (Super secondary structure) • Tertiary structure • Quaternary structure

Databases of protein structure classification • SCOP: Murzin A. G., Brenner S. E., Hubbard T., Chothia C. (1995). J. Mol. Biol. 247, 536-540. scop@mrc-lmb.cam.ac.uk • CATH: Orengo, C.A., Michie, A.D., Jones, S., Jones, D.T., Swindells, M.B., and Thornton, J.M. (1997) Vol 5. No 8. p.1093-1108. http://www.biochem.ucl.ac.uk/bsm/cath/ • Dali: L. Holm and C. Sander (1996) Science 273:595-602. http://www.bioinfo.biocenter.helsinki.fi:8080/dali/index.html • VAST: S. H. Bryant and C. Hogue. http://www.ncbi.nlm.nih.gov/Structure/VAST/vast.shtml

RNA Structure • Primary structure: sequence of GACU nucleotides • Secondary structure: stem-loop structures • Tertiary structure • http://www.rnabase.org/

DNA structure • Primary structure: sequence of GACT nucleotides • Secondary structure: double helix • Higher levels of structure… nucleosome… chromatin… chromosome

An example of pairwise alignment • ./wwwtmp/lalign/.17728.1.seq Glutaredoxin, T4, 1AAZ.pdb - 87 aa • (B) ./wwwtmp/lalign/.17728.2.seq Unknown protein - 93 aa • using matrix file: BL50, gap penalties: -14/-4 • 27.0% identity in 89 aa overlap; score: 101 E(10,000): 0.0014 10 20 30 40 50 Glutar KVYGYDSNIHKCVYCDNAKRLLTVKKQPFEFINIMPEKGV---FDD—EKIAELLTKLGR ..:: .. :: : .: :: : .:.: .. . . :: ::. : .. . Unknow EIYGIPEDVAKCSGCISAIRLCFEKGYDYEIIPVLKKANNQLGFDYILEKFDECKARANM 10 20 30 40 50 60 60 70 80 Glutar DTQIGLTMPQVFAPDGSHIGGFDQLREYF .:. ..:..:. ::..::.. :... .Unknow QTR-PTSFPRIFV-DGQYIGSLKQFKDLY 70 80 90

Pairwise Sequence Alignment • The alignment of two sequences (either protein or nucleic acid) based on some algorithm • What is the “right answer”? • Align (pairwise) the following words: instruction, insurrection, incision • There is NO unique, precise, and universally applicable method of pairwise alignment

An example of pairwise alignment • ./wwwtmp/lalign/.17728.1.seq Glutaredoxin, T4, 1AAZ.pdb - 87 aa • (B) ./wwwtmp/lalign/.17728.2.seq Unknown protein - 93 aa • using matrix file: BL50, gap penalties: -14/-4 • 27.0% identity in 89 aa overlap; score: 101 E(10,000): 0.0014 10 20 30 40 50 Glutar KVYGYDSNIHKCVYCDNAKRLLTVKKQPFEFINIMPEKGV---FDD—EKIAELLTKLGR ..:: .. :: : .: :: : .:.: .. . . :: ::. : .. . Unknow EIYGIPEDVAKCSGCISAIRLCFEKGYDYEIIPVLKKANNQLGFDYILEKFDECKARANM 10 20 30 40 50 60 60 70 80 Glutar DTQIGLTMPQVFAPDGSHIGGFDQLREYF .:. ..:..:. ::..::.. :... .Unknow QTR-PTSFPRIFV-DGQYIGSLKQFKDLY 70 80 90

Global vs Local Alignment Figure 3.1, Mount

Pairwise Sequence Alignment Websites Table 3.1, Mount

What is multiple sequence alignment? • Multiple sequence alignment is the alignment of more than two nucleotide or protein sequences • Compare pairwise sequence alignment multiple sequence alignment

Issues with multiple sequence alignment • Try creating a multiple sequence alignment of the three words: • Insurrection • Incision • Instruction

Issues with multiple sequence alignment • What’s the right answer? • Computational complexity • What is reasonable method for obtaining cumulative score? • Placement and scoring of gaps in cisioninsurrec tioninstr uc tion in cisioninsurrectioninstr uction in cisioninsurrectionins truction inci sioninsurrectionins truction

Pairwise sequence alignment: LALIGN of OVCA2 and DYR_SCHPO (global) ./wwwt MAAQRPLRVLCLAGFRQSERGFREKTGALRKALRGRAELVCLSGPHPVPDPPGPEGARSD :. .::.:::: :. :: : .: :...: : ::: .:: . . :. . dihydr MS—KPLKVLCLHGWIQSGPVFSKKMGSVQKYLSKYAELHFPTGPVVADEEADPNDEEEK 10 20 30 40 50 70 80 90 100 110 120./wwwt FGSCPPEEQPRGWWFSEQEADVFSALEEPAVCRGLEESLGMVAQALNRLGPFDGLLGFSQ . : :. :.. :. . . .::: . : ... ::::::.::::dihydr KRLAALGGEQNGGKFGWFEVEDFKN-----TYGSWDESLECINQYMQEKGPFDGLIGFSQ 60 70 80 90 100 110 130 140 150 160 170 ./wwwt GAALAALVCALGQAGDPRFPL---P—RFILLVSSFCPRGIGFKESILQRPLSLPSLHVF ::...:.. . : :.: : : .:...:..: . : . . . :. ::::. dihydr GAGIGAMLAQMLQPGQPPNPYVQHPPFKFVVFVGGFRAEKPEF-DHFYNPKLTTPSLHIA 120 130 140 150 160 170 180 190 200 210 ./wwwt GDTDKVIPSQESVQLASQFPGAITLTHSGGHFIPA-------------AAP--------- : .: ..: .: ::. . .: .: : : :..: .:: dihydr GTSDTLVPLARSKQLVERCENAHVLLHPGQHIVPQQAVYKTGIRDFMFSAPTKEPTKHPR 19.2% sequence identity; score -413

Multiple sequence alignment

What is multiple sequence alignment used for? • Consensus sequences: which residues can be used to identify other members of the family? • Gene and protein families: which residues are functionally important; functional families • Relationships and phylogenies: contains evolutionary “history” of sequences • Data underlying some protein structure prediction algorithms • Genome sequencing: sequence random, overlapping fragments; automation of assembly (in this case, there is a RIGHT answer)

Consensus sequences and important functional residues Baxter, et al, Mol Cell Prot 2003

Relationships and phylogenies • Serine-threonine protein phosphatases • Same biochemical function • Clustering clearly shows PP1, PP2a and PP2B families • What is different about these families? Fetrow, Siew, Skolnick, FASEB J, 1999

Possible redox site in PP1 family Only a clustering, not a true phylogenetic tree

Methods to solve computational complexity • Progressive global alignment • Iterative methods • Alignments based on locally conserved patterns • Statistical methods and probabilistic models

Multiple Sequence Alignment: Global Table 4.1, Mount

Multiple Sequence Alignment: Interative Table 4.1, Mount

Multiple Sequence Alignment: Local Table 4.1, Mount

Methods to solve computational complexity • Progressive global alignment • Start with most related sequences • Problem is that these errors in initial alignments are propagated • Iterative methods • Iterative alignment of subgroup of sequences to find “best”; then align subgroups • Alignments based on locally conserved patterns • Block analysis • Statistical methods and probabilistic models • Expectation maximum; Gibbs sampler; Hidden Markov Models;

Profile Methods • Perform a global multiple sequence alignment on a group of sequences • Extract more highly conserved regions • Profile = scoring matrix for these highly conserved regions • Used to search unknown sequences for membership in the family Figures 4.11 (p. 162) and 4.12 (p. 166-167)

Limitations of such profiles • Limited by sequences in original msa: • Sequence bias (too many of one type of sequence) • Sequences in msa not representative of entire family

Blocks • Blocks are conserved regions of msa (like profiles) but no gaps allowed • Servers for producing Blocks: • Blocks server • eMotif server • Block libraries for database searching • Blocks (Henikoff and Henikoff) • Prosite (Bairoch) • Prints (Attwood)

Blocks that might be extracted from an msa Baxter, et al, Mol Cell Prot 2003

Database searching • Identify a new sequence by experimental methods: what is it? • Search databases to find similar sequences • If “enough similarity”, can say that function of new sequence is same as known sequence: function annotation transfer • What is “enough similarity”? • What is “function”? Chapter 7, Mount

Relationships between family members • Sequence relationships between family members • Not all members of family have significant sequence similarity to all others • Can be represented by nodes and edges of a graph Z F E A D C B

Beware of issues with function annotation transfer • Multiple domains • High sequence identity, but functional residues not conserved • Sequence repeats (low complexity regions) New Function B Function A Function A H S D Known serine hydrolase New sequence S D L

Methods for database searching • Sequence similarity with query sequence: FASTA, BLAST (Fig 7.5, p. 305) • Profile search: ProfileSearch • Position-specific scoring matrix: MAST • Iterative alignment (combination of sequence searching and profile search): PSI-BLAST • Patterns: Prosite, Blocks, Prints, CDD/Impala Table 7.1, Mount

The problem with speed • Dynamic programming • Guaranteed to find optimal answer • Too slow (number of searches performed and number of sequences in databases that are searched): Smith-Waterman dynamic programming algorithm 50X slower than BLAST or FASTA • faster hardware has made this problem feasible • Heuristic methods • FASTA: short, common patterns in query and database searches • BLAST: similar, but searched for more rare and significant patterns

Searches on DNA vs Protein Sequences • 20-letter alphabet vs 4-letter alphabet • Fivefold larger variety of sequence characters in proteins: easier to detect patterns • Searches with DNA sequences produce fewer significant matches • What if you don’t know reading frame? • Sometimes must do nucleic acid searches (searching for similarities in non-coding regions)

Sensitivity vs selectivity • Sensitivity: method’s ability to find most members of the protein family • Selectivity: method’s ability to distinguish true members from non-members • Want a method to have high sensitivity (get all true positives) and high selectivity (not get false positives) • Can be a difficult test with biological data sets: not all true positives are known

Scoring matrices commonly used • PAM250: point accepted mutation; Dayhoff, M., Schwartz, R. M., and Orcutt, B. C., Atlas of Protein Sequence and Structure (1978) 5(3):345 • BLOSUM62: blocks amino acid substitution matrices; Henikoff and Henikoff, Amino acid substitution matrices from protein blocks. (1992) Proc. Natl. Acad. Sci. USA 89:10915-10919.

PAM250 • Calculated for families of related proteins (>85% identity) • 1 PAM is the amount of evolutionary change that yields, on average, one substitution in 100 amino acid residues • A positive score signifies a common replacement whereas a negative score signifies an unlikely replacement • PAM250 matrix assumes/is optimized for sequences separated by 250 PAM, i.e. 250 substitutions in 100 amino acids (longer evolutionary time)

Bioinformatics

Bioinformatics

Presentation Transcript

Bioinformatics

Bioinformatics:

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics