480 likes | 576 Vues
Introduction in Bioinformatics. Dr. Chris Evelo Department of Bioinformatics –BiGCaT Maastricht University. A translational product path: Small Molecules. Drug Design. Choose a protein target? But which one?. Cells are protein factories.
E N D
Introduction in Bioinformatics Dr. Chris Evelo Department of Bioinformatics –BiGCaT Maastricht University
Cells are protein factories Differences in protein production (= gene expression regulation)determine the cell type, its function, its health.
Figure 3-15.The transfer of information from DNA to protein.The transfer proceeds by means of an RNA intermediate called messenger RNA (mRNA). In procaryotic cells the process is simpler than in eucaryotic cells. In eucaryotes the coding regions of the DNA (in the exons,shown in color) are separated by noncoding regions (the introns). As indicated, these introns must be removed by an enzymatically catalyzed RNA-splicing reaction to form the mRNA. Alberts et al. Molecular Biology of the Cell, 3rd edn.
Step 1: transcriptional control Binding of transcription factors determines expression
Two steps… • Find the regulated proteins • Find out how they are regulated
Find the regulated proteins? Different conditions show different levels of gene expression for specific genes
What about the human genome? Copied chromosomal sequences to hard discs. So now you can read it (although I still prefer a good novel) If you are good at it (and care to read it 6 times over) you can even predict genes But even if you are among the best you can’t predict protein structure or function
And this week...Tweets for #cgc2011 Ion Torrent did EHEC in a day, soon can do Human Genome in 2hr (incl sample prep), for $ 5000. Illumina: $4000 for a full human genome. Noblegen: 96 human genomes in 17 hours Complete genomics 55x coverage (covering 98% of genome >10x) for $5000.They did 1500 so far.
Here is the challenge Take a 5 minute break… Think of something useful to do with a human genome. Describe what other info you need to make it work.
About proteins and mRNA Biochemists and physiologists spent over a century describing proteins, their function, structure and sequence (see: UniProt) Molecular biologists used decadesfound huge amounts of expressed mRNA sequences (ESTs)tried to relate them to functionand failed Cluttering up the databases with things like “EST found in very seldom tumor so and so” (could be myoglobin mRNA) (see: Genbank, EMBL)
UniProt a combined database SwissProt (EU) and PIR (US)highly expert curated trEMBL (translated EMBL)automatically translated from RNA
UniGene an historic database Clusters of mRNA (ESTs). Basis for transcript info in RefSeq and ENSEMBL.
Cluster sizes in UniGene This is a gene with 10 ESTs associated; the cluster size is 10
Using the information Take the EST sequences and cluster them to full mRNA sequences (Unigene!) Build the full coding sequences from this (RefSeq and Ensembl) Translate that into hypothetical proteins (UniProt/trEMBL) Check for known proteins (UniProt/SwissProt) Use to find microarray reporter sequences for known and hypothetical proteins BLAST is against the genome to find the location.
DNA sequence useful? Yes, if you know from population genetics or animal experiments about loci (QTLs) important for trades. Your gene might be in such a locus.(check OMIM, RGD) to find regulatory sequences to compare genomes (e.g. tumor and healthy)This weeks oncology conference in the US:“ It is unethical not to sequence a tumor before treatment”
Two steps… • Find the regulated proteins • Find out how they are regulated
Changes in gene expression • Comparison of gene expression shows important pathways and receptors which can be influenced • Different gene expression e.g. • Between healthy and sick conditions • At different stages of disease progression • At different stages of healing • As a response to successful treatment • Between more and less vulnerable individuals
Gene expression DNA mRNA protein • Changes in mRNA (transcriptomics) • Differential expression libraries • Gene expression microarrays • Changes in protein levels (proteomics) • 2D electrophoresis • antibody arrays • GC-MS and HPLC-MS • Epigenetic changes (e.g DNA methylation) • Changes in regulatory proteins (e.g. ChIP) • Changes in activity
mRNA processing • Genes contain: • Expressed regions (exons) • Non expressed regions (introns) • During gene splicing introns are removed and exons connected • A poly-adenosine (poly-A) tail is added • Complete mRNA’s leave the nucleus • mRNAs are “attacked” by miRNAs
Figure 9-87. Control of the poly-A tail length affects both mRNA stability and mRNA translation. (A) Most translated mRNAs have poly-A tails that exceed a minimum length of about 30 As. The tails on selected mRNAs can be either elongated or rapidly cleaved in the cytosol, and this will have an effect on the translation of these mRNAs. (B) A model proposed to explain the observed stimulation of translation by an increase in poly-A tail length. The large ribosomal subunits, on finishing a protein chain, may be directly recycled from near the 3' end of an mRNA molecule back to the 5' end to start a new protein by special poly-A-binding proteins (red). Alberts et al. Molecular Biology of the Cell, 3rd edn.
Layout of a microarray experiment • Get the cells • Isolate RNA • Incorporate fluorescent dye • Hybridize • Laser read out • Analyze image
Whennot to usemicroarrays • Expression changes of single known genes (cheaper alternatives) • Visible tissue changes (e.g. inflammation, collagen). Arrays would just be expensive microscopes! Useful at early stages.
Getting the cells Critical aspects • We need a controls (but controls can be pooled) • Cell isolation must be fast (mRNA should be kept) • About 5 µg total RNA needed (with amplification) • Microdissection possible • Tissue changes will result in RNA changes
Understanding Array data • Typical procedure • Annotate the reporters with something useful (UniProt!) • Sort based on fold change • Search for your favorite genes/proteins • Throw away 95% of the array the European Nutrigenomics Organisation
Secondary Analyses • Gene clusteringOrder the genes according to behavior • Pathway and function findingUse pathways and Gene Ontology
Understanding Array data • “Advanced” procedures • Gene clustering or principal component analysis • Get groups of genes with parallel expression patterns • Useful for diagnosis • Not adding much to understanding (unless combined) the European Nutrigenomics Organisation
Functional Mapping Annotation/coupling the European Nutrigenomics Organisation
That was Step 1… • Find the regulated proteins • Find out how they are regulated
Finding the TF binding sites Sequence determines binding of transcription factors the European Nutrigenomics Organisation
TF binding site motifs the European Nutrigenomics Organisation
Conserved GCNF binding site If it is important it should be conserved 390 400 410 420 430 440 human TTGGACCTTGAACTTATGTATCATGTGGAGA-AGAGCCAATTTAACAAACTAGGAAGATG :||||:|||||||:|||::||||:||::| |||||||||:|:|||:|||||:|| rat --AGACCATGAACTTCTGTGCCATGGGGCAACAGAGCCAATGTCACATACTAGAAA---- 360 370 380 390 400 Result of rVista (Transfac Pro) analysis the European Nutrigenomics Organisation
ChIP technology Immunoprecitation of DNA withcrosslinked TF’s.Detect DNA withPCR or arrays the European Nutrigenomics Organisation
SNPs: sequence variations the European Nutrigenomics Organisation
SNP in TF binding site? the European Nutrigenomics Organisation
ClustallW alignment (relevant part shown only), arrow = SNP location: HUMAN CAAGGTTTTTTGGAGGCTT--TTT-GTAAATTGTGA-----TAGGAACTTTGGACCTTG- 395 CHIMP CAAGGTTTTTTGGAGGCTT--TTTTGTAAATTGTGA-----TAG-AACTTTGGACCTTGC 396 RHESUS_MACAQUE CAAGGTTTCTTGGAGGCTT--TGT-GCAAATTGTGA-----TAACCACTTTGGACCTTC- 395 RAT CAAGGTGTTTTG----TTT--TGAAGGGAATT-----------AAAAGAACAGACCATG- 362 MOUSE CAAGGT-TTTTG----TTT--TAAAGGGACTTTTAAATTGTCTAAAATATCAGTAGACC- 379 STICKLEBACK TCACGC--TACG----TTT--CTGAGTAAGCTGT--------CGCTTCTACGGAGTCAAG 277 TETRAODON CGAGGAGTCCCGCTG-TTT--CTTTGTAGCCACTTTAGTACTTTACGGTTGGGGCCAAGC 274 ZEBRAFISH TTATATCATGCATCACTCAAGTTAAATGTGTTTTTGTCATATTACCGATGCTGTTTCAGG 315 * * HUMAN AACTTATGTATC----ATGTGG-AGAAGAGCCAATTTAACAAACTAGGAAGATGAAAAGG 450 CHIMP AACTTATGTATC----ATGTGG-AGAAGAGCCAATTTAACAAACTAGGAAGATGAAAAGG 451 RHESUS_MACAQUE AACTTATGTATCTATCATGTGG-AAAAGAGCCAATTTAGCAAACTAGGAACATGAAAAGG 454 RAT AACTTCTGTGCC----ATGGGGCAACAGAGCCAATGTCACATACTAG------AAAAAGA 412 MOUSE ATCATCTGTGCC----ATGGGG-GACAGAGCCAATTTCA--------------------- 413 STICKLEBACK GCGCTCAGGGTCT--CACTCCCCTTCTCAGCCACTTTATGACTTTGCCTTGGGGGGCCGA 335 TETRAODON CTCCGCGACTCCGCCCCCTGGCCTGCTGGGACATGGGAGA----TGGTTTCTGCCAAGGA 330 ZEBRAFISH GCCTGAAAGAGGGCACAAGGGCTGTTTGGTGTGCTGTATTTCATTATATTT--GAGCTGC 373 ▲ T[AC][TC]GT[AG][CT]C T M Y GT R Y C
Off course that was just part of step 2… • Find the regulated proteins • Find out how they are regulated • We found the transcription factor • We need the whole path up to the receptor • It might help if part of that path itself showed up in gene expression studies.