360 likes | 486 Vues
Genome Analysis. Introduction. A major application of bioinformatics is the analysis of full genomes of organisms that have been sequenced Traditional genetics has focused on understanding the role of a particular gene or protein in biological process
E N D
Genome Analysis Lecture 14
Introduction • A major application of bioinformatics is the analysis of full genomes of organisms that have been sequenced • Traditional genetics has focused on understanding the role of a particular gene or protein in biological process • Availability of genome sequences provides the sequences of all the genes of an organism • Important genes influencing metabolism, cellular differentiation and development, and disease processes in animals can be identified and relevant genes manipulated • Challenge is to identify those genes that are predicted to have a particular biological function Lecture 14
Genomic Sequences • Availability of genome sequences facilitates the discovery and utilization of sequence polymorphisms used to trace genes among individuals in a population • Some types of genetic variation are best understood at the genome-wide level. • Availability of genome sequences provides opportunity to explore genetic variability both between organisms and within the individual organism • Web resources - ch_10_t_1.html Lecture 14
Prokaryotic Genomes • Genomes of 31 prokaryotic organisms have been sequenced • Organisms were selected on the basis of three criteria • They had been subjected to a good deal of biological analysis and thus were model prokaryotic organisms • They were an important human pathogen – Mycobacterium tuberculosis and Mycoplasma pneumoniae • They were of phylogenetic interest • Sequences were annotated as they were sequenced Lecture 14
Gene Structure Varies Lecture 14
Steps of Genome Analysis • Genome sequence assembled • Identify repetitive sequences – mask out • Gene prediction – train a model for each genome • Look for EST and cDNA sequences • Genome annotation • Microarray analysis • Metabolic pathways and regulation • Protein 2D gel electrophoresis • Functional genomics • Gene location/gene map • Self-comparison of proteome • Comparative genomics • Identify clusters of functionally related genes • Evolutionary modeling Lecture 14
Comparative Genomics • Includes a comparison of gene number, gene content and gene location in both prokaryotic and eukaryotic groups of organisms • Availability of genome makes possible a comparison of all the proteins (proteome) encoded by one organism with those of another • Genes in two organisms that are so similar that they must have the same function and evolutionary history are orthologs • Two or more proteins in the same proteome that share a high degree of similarity because they share the same set of domains are likely to be paralogs Lecture 14
Comparative Genomics of Eukaryotes • Drosophila has core proteome only twice the size of that of yeast • Complexity apparent in metazoans is not achieved by sheer number of genes • Despite the large differences between fly and worm in terms of development and morphology, they use a core proteome of similar size • Comparative analysis of the predicted proteins encoded by these genomes suggests that nearly 30% of fly genes have putative orthologs in the worm • There are some signs that Drosophila proteome is more similar to mammalian proteomes than those of worm or yeast • Some of the human disease genes absent in Drosophila reflect clear differences in physiology between the two organisms – hemoglobins • Population of multidomain proteins is larger and more diverse in the fly than in the worm • Genome sequencing effort of the fly has revealed a number of previously unknown counterparts to human genes involved in cancer and neurological disorders Lecture 14
Functional Classifications of Genes • Classify annotated genes by function • Early classification scheme for E. coli genes included categories for enzymes, transport elements, regulators, membranes, structural elements, protein factors, leader peptides and carriers – based on sequence similarity • Another classification scheme is based on biochemical activity • Can also classify proteins that physically interact in a structure or biochemical pathway Lecture 14
Physical Mapping Databases • Access to maps produced by multiple groups is available at NCBI which attempts to integrate several genetic and physical maps with DNA and protein-sequencing information – http://www.ncbi.nlm.nih.gov/Entrez/ • Genome Data Base (GDB) – is limited to human data, contains no sequence data. http://gdbwww.gdb.org • Whitehead Institute is primary source for of two genome-wide physical maps – STS content map of more than 10,000 markers assigned to YACS and a radiation hybrid map of 12,000 markers. http://www.genome.wi.mit.edu Lecture 14
Structural Genomics • Full understanding of the biological role of the proteins identified in genomes will require knowledge of their structure and function • Structural genomics of single proteins combined with protein structure prediction may contribute substantially to efficient structural characterization of large macromolecular assemblies • The structure of most proteins will be modeled, not determined by experiment • Will need to determine protein structures so that most of the remaining sequences are related to at least one known structure of higher than 30% sequence identity • Focus on proteins will be moving from structural genomics to functional genomics Lecture 14
Human Genome Project Facts • Since it began in 1990, the HGP is estimated to have cost $3 billion • A rough draft of the human genome was completed in June 2000. The final draft is expected sometime in 2003 • For the HGP, researchers collected blood(female) or sperm(male) samples from a large number of donors. Only a few samples were processed as DNA resources. Neither the donors nor scientists know whose DNA is being sequenced • Genome from Celera was based on DNA samples from 5 donors who identified themselves only by race and sex • 97 % of DNA in human genome consists of non-genetic sequences • Human DNA is 98 percent identical to chimpanzee DNA • Average amount of difference between any two humans is 0.2 percent • Humans have approximately 30,000 genes, roundworm has 19,098 genes and fruit fly has 13,602 genes, yeast has 6,034 genes Lecture 14
More HGP Facts • Human genome is the largest genome to be extensively sequenced • The genomic landscape shows marked variation in the distribution of a number of features, including genes, transposable elements, GC content, CpG islands and recombination rate • Hundreds of human genes appear likely to have resulted from horizontal transfer from bacteria at some point in the vertebrate lineage • Although about half of the human genome derives from transposable elements, there has been a marked decline in the overall activity of such elements in the hominoid lineage • Segmental duplication is much more frequent in humans than in yeast, fly or worm • The mutation rate is about twice as high in male as in female meiosis, showing that most mutation occurs in males • More than 1.4 million single nucleotide polymorphisms(SNPs) have been identified Lecture 14
Background to the HGP • HGP arose from two insights that emerged in the 1980s • The ability to take global view of genomes could greatly accelerate biomedical research • Creation of a global view would require a communal effort • Sequencing of bacterial viruses and human mitochondrion between 1977 and 1982 proved the feasibility of assembling small sequence fragments into complete genomes • The program to create a human genetic map to make it possible to locate disease genes based solely on their inheritance patterns • The programs to create physical maps of clones covering the yeast and worm genomes to allow isolation of genes and regions based solely on their chromosomal position • The development of random shotgun sequencing of complimentary DNA fragments for high-throughput gene discovery (ESTs) Lecture 14
Timeline of Large-Scale Genomic Analysis Lecture 14
Technology for Large-Scale Sequencing • Laboratory innovations included four-color fluorescence-based sequence detection, improved fluorescent dyes, dye-labeled terminators, polymerases specifically designed for sequencing, cycle sequencing and capillary gel electrophoresis • Important advances in the development of software packages for the analysis of sequence data • PHRED makes it possible to monitor raw data quality and also assist in determining whether two similar sequences truly overlap • PHRAP systematically assembles the sequence data using the base-quality scores from PHRED. • Another key innovation for scaling up sequencing was the development by several centers of automated methods for sample preparation. This typically involved creating new biochemical protocols suitable for automation, followed by construction of appropriate robotic systems. Lecture 14
Human Sequence in the High Throughput Sequence Division of GenBank Lecture 14
Genome Browser http://genome.ucsc.edu/ Lecture 14
Classes of Interspersed Repeats Lecture 14
Gene Content of Human Genome • Genes (or at least their coding regions) comprise only a tiny fraction of human DNA, but they represent the major biological function of the genome and the main focus of interest by biologists • Human genes tend to have small exons (encoding an average of only 50 codons) separated by long introns (some exceeding 10 kb) • This creates a signal-to-noise problem, with the result that computer programs for direct gene prediction have only limited accuracy • Computational prediction of human genes must rely largely on the availability of cDNA sequences or on sequence conservation with genes and proteins from other organisms • This approach is adequate for strongly conserved genes (such as histones or ubiquitin), but may be less sensitive to rapidly evolving genes (including many crucial to speciation, sex determination and fertilization) Lecture 14
Characteristics of Human Genes Lecture 14
Applications to Medicine • A key application of human genome research has been the ability to find disease genes of unknown biochemical function by positional cloning • This method involves mapping the chromosomal region containing the gene by linkage analysis in affected families and then scouring the region to find the gene itself • The human genomic sequence in public databases allows rapid identification in silico of candidate genes, followed by mutation screening of relevant candidates, aided by information on gene structure • For a mendelian disorder, a gene search can now often be carried out in a matter of months with only a modestly sized team Lecture 14
Drug Targets • A recent compendium lists 483 drug targets as accounting for virtually all drugs on the market • Only a minority of human genes may be drug targets. It has been predicted that the number will exceed several thousand, and this prospect has led to a massive expansion of genomic research in pharmaceutical research and development • Serotonin receptors – mood disorders and schizophrenia • Leukotriene pathway – asthma • Amyloid precursor protein - Alzheimer's disease. Lecture 14
Next Steps • Finishing the human sequence • Developing the Integrated Gene Index (IGI) and Integrated Protein Index (IPI) • Large-scale identification of regulatory regions • Sequencing of additional large genomes • Completing the catalogue of human variation • From sequence to function Lecture 14
Future Technology Development • Functional genomics - aims to understand how genes are regulated and what they do, largely through massively parallel studies of gene expression in a variety of tissues • Proteomics – promises to make the identity of each protein known and elucidate protein-protein interactions • Bioinformatics – enhance the ability of researchers to manipulate, collect and analyze data more quickly and in new ways Lecture 14