470 likes | 652 Vues
EnsEMBL. Opening up the whole Genome Philip Lijnzaad lijnzaad@ebi.ac.uk. Overview. what how (science, hardware software) results families and descriptions tour people. What is EnsEMBL. Automatic Annotation of complete Human Genome genes other: markers, SNPs, homologies, etc.
E N D
EnsEMBL Opening up the whole GenomePhilip Lijnzaad lijnzaad@ebi.ac.uk
Overview • what • how (science, hardware software) • results • families and descriptions • tour • people
What is EnsEMBL • Automatic Annotation of complete Human Genome • genes • other: markers, SNPs, homologies, etc. • completely open • data, software, discussions • portable, downloadable • ‘the Linux of the Human Genome’
From ... TCTTCTCCTTCAAGGCATCCAGGTTACCCCGGACAATAAGAGGGGAACAAGCTCTTTGTT TTGCCAAGCGGTGGAAGCTTCAGGAAAGGTGCCCGGCCCCTTAGGAGGAAAACCGGGGAA CAAGACCCGCAGTTTTTGCCTTCCCAACTTCCAGTGGGCCCAAAAAAACTTGGGGCGCCC AGGGTCCCCAAAAGAGAGAGCCACGCTGGGGCCGGGTTCCTGCTTTTAATATCCAGGAAA AGGGGGGGAGGGGTATTCCCCCTTCCTCATTAAGATAAAAGACTCCCCCTCGTACTTATG GGTCCTTTACGGTTGGGCATGGGGCGAAAAAAGGGAGCGCCCCGGTGGACTTAATCGTAT TTTAACACACCCCCCGGGATATTTAAAGTCGGGGTAGGGCTGTTTGAAAATATTCAATGT GGGGGGCTTTTTGACACGCCCGTTTATATTGTTCTGGGACGCGCGTGAGGGGGGTAGACA AGAGGTGTGTAAGCCGTGCTTTATTATCCTCGCGTAGACACGCGTTAGCATGTAGTGGTG TTACCTGGTCGCGCNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCCCCTT CTCTACTAAAAACCCAAAAATTTGCCAGACACGTGGAGAGCGAGACTTCATCTCAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGCTAAGAGTTGTTATTTCTGAGGTAGAAT AACTAATGATCTTATCTTCTCTTTTTTCTTTTCTTCAAGATGGGGTCTTGCTTTGTCACC CAGGCCAGAGGGCAGTGGCACAATCATAGTTCACTGCAGCTTCAAACACCTGAACTCAAG CAATCTTCCCCGCTCATACTGCTCCCCAGCACCAGGAGCTGGGACTACAGGCACACGTCA CCACATCCGGCTAATTTTTTTTTTCTTTTGGTGGGTAGAGACGGGGGCCTCACTATGTTG
… to: MHSSGSSGKGAGPLRGKTSGTEPADFALPSSRGGPGKLRCYQTNLSSFS SPRKGVSQTGTPVCEEDGDAGLGIRQGGKAPVTPRGRGRRGRPPSRTTG TRETAVPGPLGIEDISPNLSPDDKSFSRVVPRVPDSTRRTDVGAGALRR SDSPEIPFQAAAGPSDGLDASSPGNSFVGLRVVAKWSSNGYFYSGKITR DVGAGKYKLLFDDGYECDVLGKDILLCDPIPLDTEVTALSEDEYFSAGV VKGHRKESGELYYSIEKEGQRKWYKRMAVILSLEQGNRLREQYGLGPYE AVTPLTKAADISLDNLVEGKRKRRSNVSSPATPTASSSSSTTPTRKITE SPRASMGVLSGKRKLITSEEERSPAKRGRKSATVKPGAVGAGEFVSPCE SGDNTGEPSALEEQRGPLPLNKTLFLGYAFLLTMATTSDKLASRSKLPD GPTGSSEEEEEFLEIPPFNKQYTESQLRAGAGYILEDFNEAQCNTAYQC
Take: • Draft human genome • clones and contigs from public databases • not finished • errors • gaps • Golden Path • assembly of all contigs into (nearly) complete chromosomes
Then: • Get rid of repeats • Targetted searches • pmatch to ‘find back’ known proteins from SWISSPROT, SP-TrEMBL and RefSeq • GeneWise and EST2Genome to build the genes • fill in coding sequences and UTR’s
And then: • Similarity searches • GenScan on raw contigs • its peptides are searched against protein, mRNA and EST databases • genes are built using GeneWise on promising regions • additonally, exons can be used • All predictions supported by evidence!
Add cross references: • HUGO (HGNC) • SwissProt/Trembl, RefSeq • EMBL • OMIM • LocusLink • InterPro
Add yet more • GeneTribe families • Gene descriptions • Markers • SNPs • external annotations (EMBL) • mouse traces • ...
Hardware • 360 Alpha’s: DS10, dual EV6 processors, 1GByte memory • 200 other nodes • 10 days to do a complete blast + gene build • ~ 30 million jobs • ~ 30 GB
Software • Digital Unix • Apache • relational database (MySQL) • mostly perl, some C and Java • BioPerl, BioJava, BioCORBA • LSF • AltaVista
Software (2) • Wiki Web • CVS (~100 Mb) • Code review, data review • Testing conventions • Interfaces • VirtualContigs • CORBA/Java
ID’s • for genes, transcripts, exons, peptides, families • ENSXnnnnnnnnnnn (eg: ENSG00000067369) • X denotes which type: • G = gene • T = transcript • E = exon • P = peptide (translation) • F = family
ID’s (2) • ID’s should be stable • difficult, because underlying data keeps being refined! • ID mapping • version numbers
Results • Latest release: ,1.1 (17. July) • Web code version: 1.1.1 (1 Aug.) • April 2001 dataset • 4,318,661,441 basepairs • 143,479 exons • 23,931 transcripts • 21,921 genes (‘confirmed’)
Errors • Missing data • Misassembly • Misidentification (pseudo-gene, paralog) • Sequencing errors • in Human Genome Data • in supporting databases • Bugs • GenScan tuning • GeneWise tuning
Gene Families • Cluster EnsEMBL peptides together with SwissProt and SPTrembl • vertebrate • GeneTRIBE - Automatic Protein Family detection using Markov Clustering. Enright, van Dongen & Ouzounis (in preparation)
Family descriptions • distill consensus descriptions • using SwissProt DE-lines • may not work => unknown • Transfer peptide’s family assignment to gene • resolve conflicts: choose family that has best description • unknown < hypothetical < fragment < cDNA
Family statistics: • 13,811 families • 7284 ‘unknown’ description • 128,828 members • 21,894 ENS genes • 23,867 ENS peptides
Family statistics (2) • 6759 1 member • 3457 2-10 members • 215 10-100 • 4 > 100 • max is 483 (zinc finger)
Gene descriptions • Use SwissProt DE-line if known • use Family if not • Statistics: • 18053 descriptions • 13202 from SwissProt • 4851 from family description • 3868 still UNKNOWNs
Entry points • http://www.ensembl.org • ID search • text search • OMIM disease • Browse chromosomes • BLAST
Recent developments • HelpDesk • DAS • Adding annotations from anywhere • Apollo • Genome viewer • Expression data • SAGE
Future • Better genes! • Alignments • Other genomes • Comparative Genomics • CORBA/Java • More protein-structural links • Scop profiles • IGI/IPI • Entity infra-structure
Links • http://www.ensembl.org • dev.ensembl.org • http://www.ensembl.org/genome/central • http://genome.ucsc.edu • http://compbio.ornl.gov/channel • http://ncbi.nlm.gov/genome/guide/human • http://www.biodas.org • http://www.bio{perl,xml,corba,python,java}.org
Acknowledgements • Ewan Birney, Michele Clamp,Tim Hubbard,Tony Cox,Elia Stupka,Arek Kasprzyk, Arne Stabenau, James Stalker, James Cuff, James Smith, Simon Potter, Manu Mongin, Val Curwen, Guy Slater, Richard Durbin, Craig Melsopp, Alistair Rust, Chriss Mungall, Jim Kent and many, many more
Join! • http://www.ensembl.org • mailing lists • ensembl-dev@ebi.ac.uk • ensembl-announce@ebi.ac.uk • (see http://www.ensembl.org/Dev/Lists )