Genomics and bioinformatics summary

Genomics and bioinformatics summary 1. Gene finding: computer searches, cDNAs, ESTs, Microarrays Use BLAST to find homologous sequences Multiple sequence alignments (MSAs) Trees quantify sequence and evolutionary relationships Protein sequences are evolutionary clocks Some public databases and protein sequence analysis tools

Finding genes -- computer searches Computer searcheslocate most genes in prokaryotes, Archeae, and yeast, but only ~1/3 of human genes are identified correctly. Criteria Protein start, stop signals, splicing signals . . . Codon bias Comparisons to other genomes (mouse, rat, fish, fly, mosquito, worm, yeast . . .) Some hard problems: small genes, post-translational modifications, unique genes, spliced genes, alternative splicing, gene rearrangements (e.g. IgGs) . . .

Finding genes -- cDNA synthesis Synthesizing “cDNA” (complementary DNA) Extract RNA Hybridize polyT primer Synthesize DNA strand 1 using reverse transcriptase. 4. Fragment RNA strand using RNaseH. 5. Synthesize DNA strand 2 using DNA pol Sequences of random cDNAs provide ESTs (Expressed Sequence Tags)

Microarrays quantify expressed genes by hybridization Label cDNAs with red fluorophore in one condition and green fluorophore in another reference condition. Mix red and green DNA and hybridize to a “microarray”. Red genes enriched in reference Yellowgenes (green + red) = Greengenes enriched in experiment Each spot is a different synthetic oligonucleotide complementary to a specific gene.

“Cluster analysis” identifies patterns of gene expression Genes Conditions Similar patterns of expression are placed next to each other. Groups of genes with similar patterns form a hierarchical “tree”. For example the two major branches of the tree comprise activated (left, green) or repressed genes (right, red). Genes with similar expression patterns (e.g. A-E) often function together.

“Tiling” microarrays can find transcribed sequences Microarray coding capacity ~16 M bases Each spot has a different synthetic oligonucleotide complementary to a different segment of the genome (E.g every 100 bps). Spots that hydridize reveal transcribed regions.

Find similar sequences (homologs) with BLAST The most related human protein identified by a BLAST search of the human genome using the sequence of M. tuberculosis PknB Ser/Thr protein kinase is . . . ELKL motif kinase 1. Query = the part of the PknB sequence that matches ELKL-1. Subject = ELKL-1. Expect = expectation value = the number of hits of this quality expected by chance in a database of this size (5e-24 = 5 x 10-24; is this a big number or small?) Identities = # of exact amino acid matches in the alignment. Positives = # of conservative changes as defined by the residues that tend to replace each other in homologous proteins. NP_00495.2 = sequence ID for ELKL-1. >ref|NP_004945.2| ELKL motif kinase 1 [Homo sapiens] Length = 691 Score = 108 bits (270), Expect = 5e-24 Identities = 87/296 (29%), Positives = 135/296 (45%), Gaps = 21/296 (7%) Query: 11 YELGEILGFGGMSEVHLARDLRLHRDVAVKVLRADLARDPSFYLRFRREAQNAAALNHPA 70 Y L + +G G ++V LAR + ++VAVK++ S FR E + LNHP Sbjct: 20 YRLLKTIGKGNFAKVKLARHILTGKEVAVKIIDKTQLNSSSLQKLFR-EVRIMKVLNHPN 78 Query: 71 IVAVYDTGEAETPAGPLPYIVMEYVDGVTLRDIVHTEGPMTPKRAIEVIADACQALNFSH 130 IV +++ E E Y+VMEY G + D + G M K A A+ + H Sbjct: 79 IVKLFEVIETEKTL----YLVMEYASGGEVFDYLVAHGRMKEKEARAKFRQIVSAVQYCH 134 Query: 131 QNGIIHRDVKPANIMISATNAVKVMDFGIARAIADSGNSVTQTAAVIGTAQYLSPEQARG 190 Q I+HRD+K N+++ A +K+ DFG + GN + G+ Y +PE +G Sbjct: 135 QKFIVHRDLKAENLLLDADMNIKIADFGFSNEFT-FGNKLD---TFCGSPPYAAPELFQG 190 Query: 191 DSVDA-RSDVYSLGCVLYEVLTGEPPFTGDSPVSVAYQHVREDPIPPSARHE-GLSADLD 248 D DV+SLG +LY +++G PF G + + +RE + R +S D + Sbjct: 191 KKYDGPEVDVWSLGVILYTLVSGSLPFDGQN-----LKELRERVLRGKYRIPFYMSTDCE 245 Query: 249 AVVLKALAKNPENRYQTAAEMRADLVRVHNGEPPEAPKV-----LTDAERTSLLSS 299 ++ K L NP R M+ + V + + P V D RT L+ S Sbjct: 246 NLLKKFLILNPSKRGTLEQIMKDRWMNVGHEDDELKPYVEPLPDYKDPRRTELMVS 301

Ser/Thr Protein kinases diverge rapidly Multiple Sequence Alignment (MSA) of the N-terminal ~90 residues of M. tuberculosis PknB (bottom) and Ser/Thr protein kinases of known structure. The histogram at the bottom shows % identity at each position. Only a few residues are absolutely conserved (functional sites!). The MSA defines the beginning of the kinase domain. Insertions often occur in loops.

Histones evolve slowly Tree MSA = Multiple Sequence Alignment Core H3 proteins (that have the same function) are nearly identical in eukaryotes (left). Archaeal H3s and specialized H3 proteins that bind at centromeres show much more divergence (bottom sequences and tree branches, right).

Protein sequences are evolutionary clocks Slow Assuming that organisms diverged from a common ancestor and sequence changes accumulate at constant rates, the number of changes in homologous proteins gives information about the time that each sequence has been evolving independently. Fast Average rate of change of proteins of different function.

Tree of life (Sequences = biological clocks) A tree derived by clustering sequences of a typical protein family (pterin-4a-hydroxylase) recapitulates the tree of life. Evolutionary relationships are seen at the molecular level in virtually every shared protein and RNA!

Some web sites for bioinformatics Nucleic acid sequences http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?CMD=search&DB=nucleotide Protein sequences http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?db=Protein Structure Coordinates: Protein Data Bank http://www.rcsb.org/pdb/ Programs BLAST sequence similarity calculation http://www.ncbi.nlm.nih.gov/BLAST/ BLAST bacterial genomes http://www.ncbi.nlm.nih.gov/sutils/genom_table.cgi PHD secondary structure predictor and motif search http://www.embl-heidelberg.de/predictprotein/predictprotein.html PHYRE fold predictor http://www.sbg.bio.ic.ac.uk/~phyre/ Multicoil: Coiled coil prediction http://multicoil.lcs.mit.edu/cgi-bin/multicoil/ Many nucleic acid and protein sequence-analysis tools http://au.expasy.org/ Predict transmembrane helices http://www.cbs.dtu.dk/services/THMM-2.0/ Predict signal sequences http://www.cbs.dtu.dk/services/SignalP/

Genomics and bioinformatics summary 1. Gene finding: computer searches, cDNAs, ESTs, Microarrays Use BLAST to find homologous sequences Multiple sequence alignments (MSAs) Trees quantify sequence and evolutionary relationships Protein sequences are evolutionary clocks Lots of public databases and protein sequence analysis tools

Genomics and bioinformatics summary