900 likes | 915 Vues
Discover the merging of mathematics and biology in computational management of biological information. Explore applications such as sequence assembly, gene prediction, and genome annotation. Unravel the functions of genes and study genome evolution with Structural Genomics techniques.
E N D
BioinformaticsICES 2006 Introduction Revised 29/09/06
Interesting book • Bioinformatics (Sequence And Genome Analysis) • Type : Paperback • Publisher : Cold Spring Harbor Laboratory Press • Publication date : 05/09/2001 • Weight : 1606 gr. • Pages : 560 • Format : 27.99 x 20.52 x 2.92 cm • Number of books : 1 • ISBN : 0879696087
Introduction BIOINFORMATICS • What is Bioinformatics • What is annotation • What is a high throughput measurement • What is Systems Biology • What is the future of molecular biology and bioinformatics • What is the impact of bioinformatics on industry
What is Bioinformatics Computational management of all kinds of biological information (computational biology) • Organization of biological information (databases) • Analyzing biological data • Heterogeneous research field with many subfields • Alignments • Phylogeny • Protein structure modeling,…
What is Bioinformatics • Merge between mathematics and biology is not new • Phylogeny, molecular modeling, population genetics • Acquired new attention since 94, invention of the term “bioinformatics” First usage of “bioinformatics” Trends Biotechnol 1993 Ann N Y Acad Sci 1993
Bioinformatics: driving force Driving force: • Development of new technologies in molecular biology • Genetic entities are analyzed simultaneously at high throughput level: genomics, transcriptomics, translatomics interactomics, metabolomics) • Information flow poses challenges • IT management • Dataintegration/datamining • Impact on Molecular Biology • Changes way of biological thinking as will be illustrated • Gave rise to different disciplines at the intersection between bioinformatics/molecular biology bioinformatics
Subfields in bioinformatics Structural genomics Functional genomics Comparative genomics Molecular modeling (not in this course)
Structural Genomics/Annotation Comparative Genomics/ evolutionary biology Functional genomics/ Systems Biology
Structural Genomics • Input: raw sequence data • Applications: Sequence assembly; Gene, promoter splice site prediction • Biological goal: annotation
Structural Genomics Genome Assembly Distinct methods to sequence a genome Based on a physical map (top down): • Cut genome into pieces • Subclone sequence in BACs • Automated laboratory procedure to screen for overlapping fragments (contigs) and to produce physical map • Identify unique overlapping clones and subclone • Sequence and assemble • Method used for complex genomes e.g. Human Genome Consortium
Structural Genomics Top down sequencing 1. 2. Genome fragmentation BAC library 3. 4. Physical map Subclone library
Structural Genomics Top down sequencing 5. Genome assembly
Structural Genomics Genome Assembly Shot gun sequencing (bottom up) • Fragment genome (long (10000 bp) and short pieces (2000 bp)) • Generate plasmid libraries • Derive sequence from overlaps in large numbers of random sequences (500 bp from each end to create overlaps) • Assemble the sequences without using the guide of a physical map. Contigs are assembled based on an alignment of all possible sequence pairs in the computer • Method used by Celera Genomics (C. Venter)
Shot Gun Sequencing 1. Genome fragmentation 2. Library 3. Sequences 4. Genome assembly Structural Genomics
Structural Genomics Annotation The whole bioinformatics research aims at • unraveling functions of novel genes • studying the evolution of genomes As genomes are collected they need to be annotated. This means that we will have • To identify the location of the genes on the genome (structural annotation) • To assign a function to each of the potential genes (functional annotation)
Structural Genomics Structural annotation
FEATURES EXTRACTION STEP Structural Genomics Ab initio gene prediction (cont.) Structural annotation statistically significant features are extracted from the training set HMM SVM Neural Networks based on the extracted features a model is constructed to be used for the prediction in the next step MODEL CONSTRUCTION genes are predicted according to model obtained PREDICTION
Structural Genomics Structural annotation Ab initio gene prediction Uses sequence properties only • Codon usage, splice site recognition
Structural Genomics Chromosome mapping • Construction of chromosome maps (LocusLink) • Study of chromosome rearrangements
Structural Genomics Case study 1 Sequencing projects
Sequence data • http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Genome • http://www.ensembl.org/index.html • http://www.cbs.dtu.dk/services/GenomeAtlas/ • http://www.ncbi.nlm.nih.gov/Genomes/index.html
Nuclear genome 3300 Mb ~80 000 genes Structural Genomics Structural annotation location of the genes, the introns, the exons, splice sites, the promoters, the repeated elements Human genome Mitochondrial genome 16.6 kb 37 genes ~25% ~75% Two rRNA genes 22 tRNA genes 13 polypeptide- encoding genes Genes and gene- related sequences Extragenic DNA ~90% ~60% ~10% ~40% Unique or low copy number Moderate to highly repetitive Coding DNA Noncoding DNA Gene Fragments Introns, unstranslated sequences, etc Pseudogenes Tandemly repeated or clustered repeats Interspersed repeats
… also other organisms… 2002 2000 1998 Genome of Sars, April 2003 (3 weeks !) 2002: Rat & Rice
Chimpanzee genome The human and chimp genomes are about 98.8% identical Where do the dramatic behavioural and phenotypic differences that originated since they divergence 7 million years ago come from? Blood samples from single chimp, called Clint, provided 98% of the genome data. • Genes involved in smell and hearing are significantly different between humans and chimpanzees • Changes in regulatory binding sites might have contributed to the divergence between both species
Chimpanzee genome Donaldson et al., 2006 Genome Biol
Chimpanzee genome Donaldson et al., 2006 Genome Biol
Structural Genomics/Annotation Comparative Genomics/ evolutionary biology Functional genomics/ Systems Biology
Comparative Genomics • Input: annotated sequences • Applications: Blast, ClustalW, tree construction • Biological goal: explaining evolution, metagenomics
Comparative Genomics Comparison of sequences between genomes Based on sequence alignment tools • Aid in gene prediction: extrinsic gene prediction • Homology based prediction of gene function • Study of protein families (evolutionary modeling, duplications, see later in course) • Phylogenetic footprinting (see later in course)
Extrinsic gene prediction Fielden et al. 2002
Comparative Genomics Extrinsic gene prediction Introns en splicing
Comparative Genomics Extrinsic gene prediction
Comparative Genomics Homology based function prediction Primary sequence Homologs in related organisms Families of proteins Multiple sequence alignment Features characteristic for the protein family
Comparative Genomics Study of evolution Ancestral gene Function 1 Time Gene duplication Copy 1 Copy 2 Function 1 Function 1 Copy 1 Copy 2 Function 1 New Function !
Comparative Genomics Study of evolution birds (reptiles) mammals amphibians Ray-finned fish 1 genome duplication ? Vertebrates 1, 2, 3 genome duplications ?
Comparative Genomics Case study 1 Metagenomics
Metagenomics Many species are difficult to study in isolation because they fail to grow in laboratory culture, depend on other organisms for critical processes, or have become extinct. Metagenomics: DNA can be isolated directly from living or dead cells in various contexts and directly sequenced (shot gun sequencing)
Sargasso seas Boston (04/16/04)—This Spring, J. Craig Venter is sailing around the French Polynesian Islands scooping up bucketfuls (figuratively) of seawater in an ambitious voyage to sample microbial genomes found in the world's oceans. His 95-foot yacht, Sorcerer II, has been outfitted with all manner of technical equipment to accommodate the task, as well as a few surfboards should that opportunity arise. 5% of ncbi consists of Sargasso sequences Go to ncbi and type sargasso posed challenges to genome assembly Allows building environmental fingerprints 70,000 entirely novel genes, from an estimated 1,800 genomic species, including 148 novel bacterial phylotypes.
Comparative genomics Case study 2 Genome evolution (Yves Van de Peer)
Gene duplications Ancestral gene Function 1 Time Gene duplication Copy 1 Copy 2 Function 1 Function 1 Copy 1 Copy 2 Function 1 New Function !
Sub-, neo-, en ‘nonfunctionalization’ Neofunctionalization (Ohno) duplicated genes Subfunctionalization Of the coding region Loss of one subfunction Gene preservation by subfunctionalization gene loss by nonfunctionalization Regulatory regions Protein Coding Domain Van de Peer et al.
A 1 2 3 4 5 6 7 8 9 10 11 duplication A 1 2 3 4 5 6 7 8 9 10 11 B 1 2 3 4 5 6 7 8 9 10 11 Gene loss, rearrangements, translocation, etc … 2 A 1 3 4 6 7 10 11 B 1 2 4 6 7 8 9 11 retained homologs (anchor points) Genome scale duplications Time
0.050 Frog (283822) 100 Frog (2119679) 90 Chicken (2119682) 93 RARa Human (4160009) 100 100 Mouse (133484) Zebrafish (215026) 98 Zebrafish (704370) Gene duplication Van de Peer et al.
Impact of duplication on evolution birds (reptiles) mammals amphibians Ray-finned fish 1 genome duplication ? Vertebrates 1, 2, 3 genome duplications ? Van de Peer et al.
Impact of duplication on evolution Duplicated genes and diversity of fishes: is there a correlation?
Bioinformatics & Genome Evolution Segment A 4 1 3 6 7 10 11 • Map-based approach • Gene Homology Matrix • Start from genome annotations • Represent chromosomes as sorted gene lists • Identify all homologous gene pairs between and within chromosomes (all-against-all BLAST) • Score pairs of homologs in matrix • Duplicated regions appear as diagonals • Test significance of a cluster 1 2 4 Segment B 6 7 8 9 11