Genomics and Personalized Care in Health Systems Lecture 1: Introduction

Genomics and Personalized Care in Health SystemsLecture 1: Introduction Leming Zhou, PhD Department of Health Information management School of Health and Rehabilitation Sciences The University of Pittsburgh

Text Books • Jonathan Pevsner, Bioinformatics and Functional Genomics, Second Edition, Wiley-Blackwell, 2009. • Ebook: Genes and Disease, searchable and freely available http://www.ncbi.nlm.nih.gov/books/bookres.fcgi/gnd/gnd.pdf or http://www.ncbi.nlm.nih.gov/disease/

Course Description • This course will focus on general introduction to genomics, gene structure and annotation, and gene and disease association. • Other topics such as RNA and protein structure, and microarray experiments will also be briefly covered. • Students will understand gene structure and be familiar with various genome analysis tools by working on novel gene annotation projects.

Course Objectives (1/2) • Explain eukaryotic gene structure and molecular biology central dogma • Demonstrate the skills of annotating eukaryotic genes using online tools • Demonstrate the skills of performing sequence similarity search using blast • Demonstrate the skills of collecting evidence from UCSC genome browser • Describe major DNA and protein databases and the method of extracting data from them

Course Objectives (2/2) • Explain major gene finding methods, their advantages and disadvantages • Describe different types of genetic diseases and the relationship between genetic variations and diseases • Demonstrate the skills of determining protein and RNA secondary structures using online tools • Explain basic ideas behind microarray and DNA sequencing technologies

Method of Presentation • Lectures • In-Class Laboratory Sessions • Student Projects and Presentations • Term Paper (graduate students)

Course Outline (Tentative)

Basic Concepts

DNA (1/3) • DNA (Deoxyribonucleic Acid), a helical molecular comprising a sequence of four nucleotides (bases) • Adenine (A) – purine; Thymine (T) – pyrimidine • Guanine (G) – purine; Cytosine (C) - pyrimidine Thymine Adenine Cytosine Guanine

DNA (2/3) • A is always paired with T, while G always with C

DNA (3/3) • A DNA sequence can be either single-stranded or double-stranded • DNA sequences have an orientation: from 5’ to 3’ or from 3’ to 5’ (chemical conventions)

Nucleotides

RNA • RNA (RiboNucleic Acid), usually a single-stranded molecular • It comprises four nucleotides • A, C, G, and U (Uracil) • Produced by copying one of the two strands of a DNA molecule in the 5’ to 3’ direction • Different types of RNAs • Messenger RNA (mRNA) • Transfer RNA (tRNA) • Ribosomal RNA (rRNA) • … Uracil

Protein • A molecule comprising a long chain of amino acids connected by peptide bonds • There are 20 standard amino acids encoded by the universal genetic code Molecular Biology of the Cell,Alberts et al. 2002

Cell Types • Prokaryotes: a group of organisms that lack of nucleus membrane, such as blue-green algae and common bacteria (Escherichia coli). It has two major taxa: Archaea and Bacteria • Eukaryotes: unicellular and multicellular organisms, such as yeast, fruitfly, mouse, plants, and human

Gene • A stretch of DNA containing the information necessary for coding a protein/polypeptide • Promoter region • Transcription Factor Binding Site • Translation Start Site • Exon: coding (informative) regions of the DNA • Intron: noninformative regions between exons • Untranslated region (UTR) • Codons

Eukaryotic Gene Structure http://www.nslij-genetics.org/pic/dna-rna-protein.jpg

Eukaryotes In eukaryotes, transcription is complex: Many genes contain alternating exons and introns Introns are spliced out of mRNA mRNA then leaves the nucleus to be translated by ribosomes Genomic DNA: entire gene including exons and introns The same genomic DNA can produce different proteins by alternative splicing of exons Complementary DNA (cDNA): spliced sequence containing only exons cDNA can be manufactured by capturing mRNA and performing reverse transcription

Central Dogma of Molecular Biology • DNA  RNA  Protein Translation Transcription DNA RNA protein

DNA Transcription • RNA molecules synthesized by RNA polymerase • RNA polymerase binds to promoter region on DNA • Promoter region contains start site • Transcription ends at termination signal site • Primary transcript: direct coding of RNA from DNA • RNA splicing: introns removed to make the mRNA • mRNA: contains the sequence of codons that code for a protein • Splicing and alternative splicing • Post-transcriptional modification

DNA Translation • Ribosomes is made of protein and rRNA • mRNA goes through the ribosomes • Initiation factors: proteins that catayze the start of transcription • tRNA brings the different amino acids to the ribosome complex so that the amino acids can be attached to the growing amino acid chain • When a STOP codon is encountered, the ribosome releases the mRNA and synthesis ends • An open reading frames (ORF): a contiguous sequence of DNA starting at a start codon and ending at a STOP codon http://www.youtube.com/watch?v=5bLEDd-PSTQ

Chromosomes A chromosome is a long and tightly wound DNA string (visible under a microscope) Chromosomes can be linear or circular Prokaryotes usually have a single chromosome, often a circular DNA molecule Eukaryotic chromosome appear in pairs (diploid), each inherited from one parent Homologous chromosomes carry the same genes Some genes are the same in both parents Some genes appear in different forms called alleles, e.g., human blood type has three alleles: A, B, and O All genes are presented in all cells, but a give cell types only expressed a small portion of the genes

Chromosomal Location

Genome • The genome is formed by one or more chromosomes • A genome is the entire set of all DNA contained in a cell • A human genome has 46 chromosomes • The total length of a human genome is 3 billion bases

Genome Sequences Retrieved on 1/8/2012 http://www.ncbi.nlm.nih.gov/genomes/static/gpstat.html

Genome Sequence Sizes DNA Sequence size is measured as base pairs (bp) Phage phiX174 5,368 HIV virus 9,193 SARS 29,751 Haemophilus influenzae (bacteria) 1,830,000 Escherichia coli K12 4,600,000 Saccharomyces cerevisiae (yeast) 12,500,000 Drosophila melanogaster (fruit fly) 180,000,000 Arabidopsis thaliana (thale cress) 125,000,000 Homo sapiens (human) 3,000,000,000

The Whole Picture

Genomics • The definition of genomics may be different from person to person • Genomics involves large data sets (whole genome sequences) and high-throughput methods (DNA sequencing technologies) • Genetics research focuses on one or a set of genes • Genomics may or may not include other specific research areas, such as proteomics, transcriptomics, variomics, metabolomics, etc. • In this course, genomics includes DNA sequence analysis, genomics variations, gene expression, and proteomics.

Topics in This Course • Molecular Biology Databases • Sequence Alignment • Blast Search • Genome Browser • Gene Finding Methods • Genomic Variations and Disease • Protein and RNA Secondary Structure • High-throughput Technologies

Molecular Biology Databases

Important Databases Genome NCBI European Molecular Biology Lab ( EMBL ) DNA Database of Japan ( DDBJ ) Go ( Gene Ontology ) Consortium of databases Flybase, Mouse Genome Database (MGD) Protein Protein Data Bank (PDB) ENBL-EBI ( European Bioinformatics Institute ) Uniprot, Expasy, Swiss-Prot KEGG: Kyoto Encyclopedia of Genes and Genomes

NCBI (www.ncbi.nlm.nih.gov) NCBI – National Center for Biotechnology Information Established in 1988 as a national resource for molecular biology information NCBI creates public databases, conducts research in computational biology, develops software tools for analyzing genome data, and disseminates biomedical information Databases GenBank, dbSNP, RefSeq, etc. PubMed, OMIM, MMDB, UniGene The Taxonomy Browser Tools Blast, Cn3D, etc. Entrez is NCBI’s search and retrieval system that provides users with integrated access to sequence, mapping, taxonomy, and structural data

PDB (www.pdb.org) The Protein Data Bank (PDB) is the single worldwide depository of information about the 3D structures of large biological molecules, including proteins and nucleic acids. Understanding the shape of a molecule helps to understand how it works. The PDB was established in 1971 at Brookhaven National Lab and originally contained 7 structures In 1998, the Research Collaboratory for Structural Bioinformatics(RCSB) became responsible for the management of the PDB PDB provides Sequence, atomic coordinates, derived geometric data, secondary structure, and annotations about protein literature references

KEGG KEGG: Kyoto Encyclopedia of Genes and Genomes Contains Pathway information as well as (1/10/2011) KEGG PATHWAY: 126,336 pathways generated from 379 reference pathways KEGG GENES: 6,121,933 genes in 139 eukaryotes + 1144 bacteria + 94 archaea KEGG GENOME: 1,508 organisms KEGG DISEASE: 375 disease KEGG DRUG: 9,316 drugs

Sequence Alignment

Sequence Similarity • Similarity: The extent to which nucleotide or protein sequences are related. It is based upon identity plus conservation. • Identity: The extent to which two sequences are invariant. • Conservation: Changes at a specific position of a DNA or amino acid sequence that preserve the properties of the original residue. • The distance between two sequences, based on an evolutionary model, describes when the two sequences had a common ancestor

Sequence Alignment Sequence alignment is the procedure of comparing two or more DNA or protein sequences by searching for a series of individual characters or character patterns that are in the same order in the sequences. Given two sequences A and B, an alignment is a pair of sequences A’ and B’ such that: 1. A’ is obtained from A by inserting gap character ‘-’ 2. B’ is obtained from B by inserting gap character ‘-’ 3. A’ and B’ have some length: |A’|=|B’| 4. No position has gap characters in both A’ and B’ Example: A = ATGGCT B = TGCTA A’= ATGGCT- B’= -TG-CTA Goal: given two sequences, find the “best” alignment according some scoring function

Types of Sequence Alignment Pairwise Alignment – compare two sequences Multiple Alignment – compare one sequence to many others For each of the above we can do Local Alignment – compare similar parts of two sequences Global Alignment – compare the whole sequence For the different types of alignments there are different assumptions and methods

Global Alignment vs. Local Alignment Local alignment: finds continuous or gapped high-scoring regions which do not span the entire length of the sequences being aligned Global alignment: finds the optimal full-length alignment between the two sequences being aligned

Pairwise Alignment • The process of lining up two sequences to achieve maximal levels of identity/similarity for the purpose of assessing the degree of similarity and the possibility of homology. • It is used to decide if two genes are structurally or functionally related • It is used to identify domains or motifs that are shared between proteins • It is used in the analysis of genomes

An Example of Pairwise Alignment 1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP . ||| | . |. . . | : .||||.:| : 1 ...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 LAC 51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP : | | | | :: | .| . || |: || |. 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 LAC 98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC 136 RBP || ||. | :.|||| | . .| 94 IPAVFKIDALNENKVL........VLDTDYKKYLLFCMENSAEPEQSLAC 135 LAC • Symbols between two sequences (Ssearch format): • Bar: identical; One dot: somewhat similar; Two dots: very similar • Dots in sequences: gaps

Multiple Sequence Alignment Multiple sequence alignment is an alignment of three or more sequences such that each column of the alignment is an attempt to represent the evolutionary changes I one sequence position, including substitutions, insertions, and deletions. It is believed that over time the functional components embedded within the sequences are conserved in order to retain function One of the most important elements of sequences is the phylogenetic information that similarities represent The sequence similarities gives insight into the evolution of families of protein or DNA sequences

An Example of Multiple Sequence Alignment fly GAKKVIISAP SAD.APM..F VCGVNLDAYK PDMKVVSNAS CTTNCLAPLA human GAKRVIISAP SAD.APM..F VMGVNHEKYD NSLKIISNAS CTTNCLAPLA plant GAKKVIISAP SAD.APM..F VVGVNEHTYQ PNMDIVSNAS CTTNCLAPLA bacterium GAKKVVMTGP SKDNTPM..F VKGANFDKY. AGQDIVSNAS CTTNCLAPLA yeast GAKKVVITAP SS.TAPM..F VMGVNEEKYT SDLKIVSNAS CTTNCLAPLA archaeon GADKVLISAP PKGDEPVKQL VYGVNHDEYD GE.DVVSNAS CTTNSITPVA fly KVINDNFEIV EGLMTTVHAT TATQKTVDGP SGKLWRDGRG AAQNIIPAST human KVIHDNFGIV EGLMTTVHAI TATQKTVDGP SGKLWRDGRG ALQNIIPAST plant KVVHEEFGIL EGLMTTVHAT TATQKTVDGP SMKDWRGGRG ASQNIIPSST bacterium KVINDNFGII EGLMTTVHAT TATQKTVDGP SHKDWRGGRG ASQNIIPSST yeast KVINDAFGIE EGLMTTVHSL TATQKTVDGP SHKDWRGGRT ASGNIIPSST archaeon KVLDEEFGIN AGQLTTVHAY TGSQNLMDGP NGKP.RRRRA AAENIIPTST fly GAAKAVGKVI PALNGKLTGM AFRVPTPNVS VVDLTVRLGK GASYDEIKAK human GAAKAVGKVI PELNGKLTGM AFRVPTANVS VVDLTCRLEK PAKYDDIKKV plant GAAKAVGKVL PELNGKLTGM AFRVPTSNVS VVDLTCRLEK GASYEDVKAA bacterium GAAKAVGKVL PELNGKLTGM AFRVPTPNVS VVDLTVRLEK AATYEQIKAA yeast GAAKAVGKVL PELQGKLTGM AFRVPTVDVS VVDLTVKLNK ETTYDEIKKV archaeon GAAQAATEVL PELEGKLDGM AIRVPVPNGS ITEFVVDLDD DVTESDVNAA

Evolutionary Basis of Sequence Comparison The simplest molecular mechanisms of evolution are substitution, insertion, and deletion If a sequence alignment represents the evolutionary relationship of two sequences, residues that are aligned but do not match equal substitutions Residues that are aligned with a gap in the sequence represent insertions or deletions

Homology • Homology:Similarity attributed to descent from a common ancestor. • There are two type of homology: Paralogs and Orthologs • Orthologs: • Homologous sequences in different species that arose from a common ancestral gene during speciation; • May or may not be responsible for a similar function. • Members of a gene family in various organisms • Paralogs: • Homologous sequences within a single species that arose by gene duplication. • Members of gene family within a species • Genes either are homologous, or they are not. There are no degrees of homology

Blast Search

Similarity Search • Find statistically significant matches to a protein or DNA sequence of interest. • Obtain information on inferred function of the gene • Sequence alignment algorithms • Dynamic Programming • Needleman-Wunsch Global Alignment (1970) • Smith-Waterman Local Alignment (1981) • Guaranteed to find the best alignment • Slow, especially search against a large database

FASTA and BLAST Sequence Alignment Heuristics FASTA and BLAST: heuristic approximations to Smith-waterman Fast and results comparable to the Smith-Waterman algorithm FASTA and BLAST also calculate significance of the search results alignments

BLAST • Basic Local Alignment Search Tool: A sequence comparison algorithm optimized for speed used to search sequence databases for optimal local alignments to a query. • Expected Value (E) • The number of matches expected to occur randomly with a given score. • The number of different alignments with scores equivalent to or better than S that are expected to occur in a database search by chance. • The lower the E value, more significant the match. • The Expect value can be any positive real number.

Genomics and Personalized Care in Health Systems Lecture 1: Introduction