Ensembl Compara Perl API

Stephen Fitzgerald http://www.ebi.ac.uk/~stephenf/edinburgh-workshop/ EBI - Wellcome Trust Genome Campus, UK compara Ensembl Compara Perl API

What is Ensembl Compara? A single database which contains precalculated comparative genomics data Access via perl API and mysql A production system for generating that database (not in this presentation)‏

Compara data Raw genomic sequence • Whole genome alignments • (tBLAT, BlastZ-net, PECAN) • Syntenic regions (based on BlastZ-net) Protein Sequences • Raw Protein Alignments • Protein Family clusters • Protein trees • Gene orthology / paraology predictions 46 species in Ensembl release-52

Compara database & the Ensembl core databases Since there is minimal primary data inside Compara, to gain full access to the data external links with core DBs must be re-established Example: compara_52 must be linked with the Ensembl core_52 databases Proper REGISTRY configuration is critical Or load_registry_from_db is probably the best choice here

The Compara Perl API • Written in Object-Oriented Perl • Used to retrieve data from and store data into ensembl-compara database • Generalized to extend to non-ensembl genomic data (Uniprot) • Follows same ‘Data Object’ & ‘Object Adaptor’ DBAdaptor design as the other Ensembl APIs

NCBITaxon PRIMARY DATA GenomeDB Member DnaFrag MethodLinkSpeciesSet ANALYSIS GenomicAlignBlock SyntenyRegion ProteinTree Homology Family RESULTS GenomicAlign DnaFragRegion AlignedMember Attribute Compara object model overview

Primary data • GenomeDB: relates to a particular Ensembl core DB • name(), assembly(), genebuild(), taxon() • fetch_by_name_assembly(), fetch_by_registry_name(), fetch_by_Slice(), fetch_all() • DnaFrag: represents a “top level” SeqRegion • name(), length(), genome_db(), slice(), coord_system_name() • fetch_by_Slice(), fetch_by_GenomeDB_and_name() • Member: list all Ensembl genes + SwissProt + SPTrEMBL • source_name(), stable_id(), genome_db(), taxon(), sequence(), get_all_peptide_Members(), get_longest_peptide_Member(), gene_member() • fetch_by_source_stable_id()

Analysis • MethodLinkSpeciesSet provides a handle to isolate specific data from the shared tables (homology, genomic_align_block) • MethodLink: Each individual analysis in compara is tagged with a unique name called a method_link_type • BLASTZ_NET, TRANSLATED_BLAT, PECAN, SYNTENY, FAMILY, ENSEMBL_ORTHOLOGUES, ENSEMBL_PARALOGUES, PROTEIN_TREES • SpeciesSet: the sets of species as (a ref. to) an array of GenomeDBs • fetch_by_method_link_type_GenomeDBs(), fetch_by_method_link_type_registry_aliases() • name(), method_link_type(), species_set(), source()

Exerciseshttp://www.ebi.ac.uk/~stephenf/edinburgh-workshop/ComparaAPI.htmlExerciseshttp://www.ebi.ac.uk/~stephenf/edinburgh-workshop/ComparaAPI.html GenomeDB 1. Find out the versions of human and mouse genomes in the database 2. Print the name of all the GenomeDBs in the database DnaFrag 1. Get the DnaFrag for the chromosome 1 of the macaque genome (using a genome_db object as an argument) 2. Get the DnaFrag for the chromosome X of the mouse genome (using a core slice object as an argument) MethodLinkSpeciesSet 1. Find out how many analyses are stored in the database 2. Get the name of the MethodLinkSpeciesSet corresponding to the BlastZ-net analysis for human and mouse 3. Get the names of the all the species using the mlss corresponding to the Pecan analyses

GenomeDB example code use strict; use Bio::EnsEMBL::Registry; my $reg = "Bio::EnsEMBL::Registry"; $reg->load_registry_from_db( -host=>"ensembldb.ensembl.org", -user => "anonymous"); my $genome_db_adaptor = $reg->get_adaptor( "Multi", "compara", "GenomeDB"); my $genome_db = $genome_db_adaptor-> fetch_by_registry_name("human"); print “Name :”,$genome_db->name, "\n"; print “Assembly :”,$genome_db->assembly, "\n"; print “GeneBuild :”,$genome_db->genebuild, "\n";

GenomeDB example code $> perl genome_db1.pl Homo sapiens NCBI36 2006-08-Ensembl Mus musculus NCBIM36 2006-04-Ensembl

DnaFrag example code use strict; use Bio::EnsEMBL::Registry; my $reg = "Bio::EnsEMBL::Registry"; $reg->load_registry_from_db( -host=>"ensembldb.ensembl.org", -user => "anonymous"); my $genome_db_adaptor = $reg->get_adaptor( "Multi", "compara", "GenomeDB"); my $genome_db = $genome_db_adaptor-> fetch_by_registry_name("human"); my $dnafrag_adaptor = $reg->get_adaptor( "Multi", "compara", "DnaFrag"); my $dnafrag = $dnafrag_adaptor-> fetch_by_GenomeDB_and_name($genome_db, "13"); print "Name :", $dnafrag->name, "\n"; print "Length :", $dnafrag->length, "\n"; print "CoordSystem :", $dnafrag->coord_system_name, "\n";

DnaFrag example code $> perl test1.pl Name :13 Length :114142980 CoordSystem :chromosome

MethodLinkSpeciesSetexample code use strict; use Bio::EnsEMBL::Registry; my $reg = "Bio::EnsEMBL::Registry"; $reg->load_registry_from_db( -host=>"ensembldb.ensembl.org", -user => "anonymous"); my $mlssa = $reg->get_adaptor("Multi", "compara", "MethodLinkSpeciesSet"); my $mlss = $mlssa-> fetch_by_method_link_type_registry_aliases( "BLASTZ_NET", ["human", "mouse"]); print $mlss->name, "\n"; print "type: ", $mlss->method_link_type, "\n"; my $species_set= $mlss->species_set(); foreach my $this_genome_db(@$species_set) { print $this_genome_db->name(), "\n"; }

MethodLinkSpeciesSetexample code $ > perl method_link_species_set.pl H.sap-M.mus blastz-net (on H.sap)

Genomic Alignments • BlastZ-Net • used to compare closely related pair of species • BlastZ-raw -> BlastZ-chain -> BlastZ-net • Translated BLAT • used to compare more distant pair of species • Pecan • multiple global alignments • all vs all coding exons wublastp -> Mercator -> Pecan on each syntenic block

GenomicAlignBlock • GenomicAlignBlock • represents a genomic alignment • contains 1 GenomicAlign per sequence • fetch_all_by_MethodLinkSpeciesSet_Slice($mlss,$slice) • Methods: • method_link_species_set(), score(), length(), perc_id(), get_all_GenomicAligns(), get_SimpleAlign() • GenomicAlign • dnafrag(), genome_db(), get_Slice(), dnafrag_start, dnafrag_end(), dnafrag_strand(), aligned_sequence()

GenomicAlignBlock $all_GAlign = $GABlock->get_all_GenomicAligns() $arrayref $Simplealign = $GABlock->get_SimpleAlign() $object $Simplealign: a bioperl object which contains the whole alignment - can be printed in various format using bioperl modules $Galign: an object which represents one of the sequences in the alignment only Hsap.X.1223-1230: ACCTTC-A <- $ga Cfam.X.1390-1395: ACC--CGA <- $ga

Synteny • Based on BlastZ-net alignments • SyntenyRegionAdaptor • fetch_all_by_MethodLinkSpeciesSet_Slice(), fetch_all_by_MethodLinkSpeciesSet_DnaFrag() • Methods: • get_all_DnaFragRegions(), method_link_species_set(), • DnaFragRegion • slice(), dnafrag(), dnafrag_start(), dnafrag_end(), dnafrag_strand()

Exerciseshttp://www.ebi.ac.uk/~stephenf/edinburgh-workshop/ComparaAPI.htmlExerciseshttp://www.ebi.ac.uk/~stephenf/edinburgh-workshop/ComparaAPI.html GenomicAlignBlock 1. Fetch all the BLASTZ_NET alignments between the first 130K nucleotides of the human chromosome X and the mouse genome. 2. Print the exact location of the alignment blocks. 3. Compare the original and the aligned sequences. 4. Find the BLASTZ_NET alignments between human gene BRCA2 and the mouse genome. 5. Print the BLASTZ_NET alignments between the rat gene ECSIT and the mouse genome. 6. Print the PECAN multiple alignments between the rat gene ECSIT and 11 other amniote vertebrates. 7. Print the constrained-element alignments within the rat ECSIT locus (use the constrained elements generated from the 12-way alignments). Synteny 1. Get the human-mouse syntenic map for human chromosome X.

GenomicAlignBlock example code [...] my $slice_adaptor = $reg->get_adaptor( "human", "core", "Slice"); my $slice = $slice_adaptor-> fetch_by_region("chromosome", "12", 1e4, 2e4); my $gaba = $reg->get_adaptor("Multi", "compara", "GenomicAlignBlock"); my $genomic_align_blocks = $gaba-> fetch_all_by_MethodLinkSpeciesSet_Slice( $method_link_species_set, $slice); foreach my $this_gab (@$genomic_align_blocks) { my $all_gas = $this_gab->get_all_GenomicAligns(); foreach my $this_ga (@$all_gas) { print $this_ga->genome_db->name(), ":", $this_ga->get_Slice()->name(), "\n"; print $this_ga->aligned_sequence(), "\n"; } print "\n"; }

GenomicAlignBlock example code $>perl gab.pl Mus musculus:chromosome:NCBIM37:6:121449987:121450302:-1 CCTCTTAATAAACATTATTGTCAA[…] Homo sapiens:chromosome:NCBI36:12:19128:19507:1 CCTCTTAATAAGCACACATATCCT[..]

Synteny example code [...] my $synteny_region_adaptor = $reg->get_adaptor( "Multi", "compara", "SyntenyRegion"); my $synteny_regions = $synteny_region_adaptor-> fetch_all_by_MethodLinkSpeciesSet_Slice( $human_mouse_synteny_method_link_species_set, $human_slice); foreach my $this_synteny_region (@$synteny_regions) { my $these_dnafrag_regions = $this_synteny_region->get_all_DnaFragRegions(); foreach my $this_dnafrag_region (@$these_dnafrag_regions) { print $this_dnafrag_region->dnafrag-> genome_db->name, ": ", $this_dnafrag_region->slice->name, "\n"; } print "\n"; }

Homology • (e! 38): • Orthologue predictions based on ‘best reciprocal blast hits’ • Paralogues for a selected set of species • No global view of the evolution history of the gene considered • e! 39+: • Orthologues and paralogues are inferred from protein trees • Phylogeny: Orthology/Paralogy in one go

BSR: Blast Score Ratio. When 2 proteins P1 and P2 are compared, BSR=scoreP1P2/max(self-scoreP1 or self-scoreP2). The default threshold used in the initial clustering step is 0.33.

Homology types

Homology • Homology object • contains 1 pair of Member/Attribute per gene/protein • fetch_all_by_Member(), fetch_all_by_MethodLinkSpeciesSet(), fetch_all_by_Member_MethodLinkSpeciesSet() • Methods: • method_link_species_set(), description(), subtype(), perc_id(), get_all_Member_Attribute(), get_SimpleAlign()

Family • Compara compute gene family clusters • Runs on all Ensembl transcripts plus all Uniprot/SWISSPROT and Uniprot/SPTREMBL metazoan proteins • The algorithm is based on : All vs all blastp MCL clustering Muscle multiple aligner • Results stored in family, family_member tables

Family • Family object • contains 1 pair of Member/Attribute per gene/protein • fetch_all by_Member() • Methods: • method_link_species_set(), description(), description_score(), get_all_Member_Attribute(), get_SimpleAlign()

Exerciseshttp://www.ebi.ac.uk/~stephenf/edinburgh-workshop/ComparaAPI.htmlExerciseshttp://www.ebi.ac.uk/~stephenf/edinburgh-workshop/ComparaAPI.html Members 1. Find the Member corresponding to SwissProt protein O93279 2. Find the Member for the human gene BRCA2 3. Find all the peptide Members corresponding to the human gene CTDP1 Homology 1. Get all the predicted homologues for the human gene BRCA2 2. Get all the mouse orthologues predicted for the human gene CTDP1 Family 1. Get family predicted for the human gene BRCA2 2. Get the alignments corresponding to the family of the human gene HBEGF

Member example code use strict; use Bio::EnsEMBL::Registry; my $reg = "Bio::EnsEMBL::Registry"; $reg->load_registry_from_db( -host=>"ensembldb.ensembl.org", -user => "anonymous"); my $member_adaptor= $reg->get_adaptor( "Multi", "compara", "Member"); my $member = $member_adaptor-> fetch_by_source_stable_id( "ENSEMBLGENE", "ENSG00000000971"); print "All proteins:\n"; my $all_peptide_members = $member-> get_all_peptide_Members(); foreach my $this_peptide (@$all_peptide_members) { print $this_peptide->stable_id(), "\n"; }

Member example code $> perl test2.pl All proteins: ENSP00000356399 ENSP00000356398 ENSP00000352658

Homology example code [...] my $ma = $reg->get_adaptor( "Multi", "compara", "Member"); my $member = $ma->fetch_by_source_stable_id( "ENSEMBLGENE", "ENSG00000000971"); my $homology_adaptor = $reg->get_adaptor( "Multi", "compara", "Homology"); my $homologies = $homology_adaptor-> fetch_all_by_Member($member); foreach my $this_homology (@$homologies) { print $this_homology->description, "\n"; my $member_attributes = $this_homology-> get_all_Member_Attribute(); foreach my $this_mem_attr (@$member_attributes) { my ($this_member, $this_attribute) = @$this_mem_attr; print $this_member->genome_db->name, " ", $this_member->source_name, " ", $this_member->stable_id, "\n"; } print "\n"; }

Family example code [...] my $ma = $reg->get_adaptor( "Multi", "compara", "Member"); my $member = $ma->fetch_by_source_stable_id( "ENSEMBLGENE", "ENSG00000000971"); my $family_adaptor = $reg->get_adaptor( "Multi", "compara", "Family"); my $families = $family_adaptor-> fetch_all_by_Member($member); foreach my $this_family (@$families) { print $this_family->description, "\n"; my $member_attributes = $this_family-> get_all_Member_Attribute(); foreach my $this_mem_attr (@$member_attributes) { my ($this_member, $this_attribute) = @$this_mem_attr; print $this_member->taxon->binomial, " ", $this_member->source_name, " ", $this_member->stable_id, "\n"; } print "\n"; }

Getting More Information • perldoc – Viewer for inline API documentation. • shell> perldoc Bio::EnsEMBL::Compara::GenomeDB • shell> perldoc Bio::EnsEMBL::Compara::DBSQL::MemberAdaptor • online at: http://www.ensembl.org/ • Tutorial document: • cvs: ensembl-compara/docs/ComparaTutorial.pdf • ensembl-dev mailing list: • ensembl-dev@ebi.ac.uk • Exercise solutions: • http://www.ebi.ac.uk/~stephenf/edinburgh-workshop/solutions.html

Ensembl-dev mailing list and HelpDesk • ensembl-dev mailing list is great for questions around the API and the DB • HelpDesk is very helpful • Give detailed info on what you are trying to do • Check that you have the modules installed ($PERL5LIB pointing to them)

Leaders EwanBirney (EBI), Tim Hubbard (Sanger Institute)‏ Database Schema and Core API Glenn Proctor, Ian Longden, Patrick Meidl, Andreas Kähäri BioMart Arek Kasprzyk, Damian Smedley, Richard Holland, Syed Haldar Distributed Annotation System (DAS)‏ Eugene Kulesha Outreach Xosé M Fernández, Bert Overduin, Giulietta Spudich, Michael Schuster Web Team James Smith, Fiona Cunningham, Anne Parker, Steve Trevanion (VEGA)‏ Comparative Genomics Javier Herrero,Kathryn Beal, Benoît Ballester, Stephen Fitzgerald, Albert Vilella, Leo Gordon Analysis and Annotation Pipeline Val Curwen, Steve Searle, Browen Aken, Julio Banet, Laura Clarke, Sarah Dyer, Jan-Hinnerck Vogel, Kevin Howe, Felix Kokocinski, Stephen Rice, Simon White Functional Genomics Paul Flicek, Yuan Chen, Stefan Gräf, Nathan Johnson, Daniel Rios Zebrafish Annotation Kerstin Jekosch, Mario Caccamo, Ian Sealy VectorBase Annotation Martin Hammond, Dan Lawson, Karyn Megy Systems & Support Guy Coates, Tim Cutts, Shelley Goddard Research Damian Keefe, Guy Slater, Michael Hoffman, Alison Meynert, Benedict Paten, Daniel Zerbino Ensembl Team

A special case of ortholog

Ensembl Compara Perl API

Ensembl Compara Perl API

Presentation Transcript

Perl

Investigating Genomes with Ensembl

Genomes with Ensembl

Ensembl RNASeq Pipeline

Perl

The Ensembl Variation API

Ensembl Funcgen Perl API

The Ensembl Database Schema

Toward a Better Understanding of Cereal Genome Evolution Through Ensembl Compara

Ensembl Developers Meeting

Perl

PERL

Ensembl

An SQL API for Object Oriented Perl

Perl

Perl

Perl

Ensembl Developers Workshop Core API

EnsEMBL

Genomic Database - Ensembl

Ensembl