340 likes | 546 Vues
Ensembl Funcgen Perl API. Nathan Johnson njohnson@ebi.ac.uk EBI - Wellcome Trust Genome Campus, UK. Funcgen. What is Ensembl Funcgen/eFG?. A local data storage and analysis platform OR A Ensembl functional genomics database providing epigenomic and regulatory annotations OR Both.
 
                
                E N D
Ensembl Funcgen Perl API Nathan Johnson njohnson@ebi.ac.uk EBI - Wellcome Trust Genome Campus, UK Funcgen
What is Ensembl Funcgen/eFG? A local data storage and analysis platform OR A Ensembl functional genomics database providing epigenomic and regulatory annotations OR Both
Tab2MAGE MAGE-ML Annotated Features Analysis Pipeline DAS GFF eFG Dataflow Experimental Data Import API FuncGen DB Export API Web API
eFG data Peak Calls e.g. Mpeak, TileMap, ChIPOTLE, Nessie Combinatorial analysis e.g Regulatory Build Externally curated e.g cisRED, MiRanda, Vista Experimental Technology • Experimental meta data • Raw & Normalised data Processed • Arrays/Chips/Probes • e.g. Tiling arrays • Short reads • e.g Solexa, SOLiD etc
eFG data • Ensembl v50 July '08: • >60 data sets (ChIP-chip, wiggle, bed, custom) • 3 species • 9 cell types • 24 Histone modifications, DHSS, CTCF, RNAPoLII … • Regulatory Build v3: • Gene Associated 1584 • Gene Associated - Cell type specific 5614 • Non-Gene Associated 799 • Non-Gene Associated - Cell type specific 520 • Promoter Associated 12022 • Promoter Associated - Cell type specific 1619 • Unclassified 24814 • Unclassified - Cell type specific 127633
eFG Display cisRED miRanda Vista Regulatory Features CTCF Data Methylation data
How eFG fits in. • ensembl-functgenomics API • Object Oriented PERL • Follows Object ObjectAdaptor paradigm • Fully integrated with wider Ensembl family of MySQL DBs • Multi-Assembly: eFG stores a registry of core coordinate information which allows data to be stored using different core DBs and different genome assemblies. • Minimal maintenance: Designed to aid incremental updates to local installations. Patch and update rather than blow away and recreate. • Fully automated data import API and analysis pipeline
Experimental Array eFG Schema Sets Features
Features: Probe > Annotated; External > Regulatory. Sets - An abstract concept for manipulation of data collections: Logical association/combination Access and administration Supporting/Product Set classes: ResultSet - Chips/Channels > Replicates ExperimentalSet - Feature only import. FeatureSet - e.g. Peak calls > AnnotatedFeatures DataSet - Combines supporting Sets and product FeatureSet Features & Sets
eFG data flow DataSet3 DataSet4 DataSet2 External DB ResultSet3 HitList DataSet1 SupportingSet2 ResultSet3 ResultSet2 1... 2.. 3.. External SupportingSet2 ResultSet2 ResultSet1 SupportingSet1 Experimental ResultSet1 Feature Feature SupportingSet1 Result Raw Data Combined FeatureSet Product FeatureSet Export API GFF
Technology data Array: A definitive collection of chips. name(), format(), vendor(), description(), type(). fetch_by_name_vendor(), fetch_all_by_type(). ArrayChip: an individual chip in an array collection. name(), design_id(). fetch_all_by_array_design_ids, fetch_all_by_array_id(), fetch_all_by_ExperimentalChip. Probe: a unique probe sequence within a given array or set of arrays. name(), class(), length(). fetch_all_by_Array, fetch_all_by_ArrayChip(), fetch_all_by_array_probe_probeset_name(). ProbeFeature: an alignment of a Probe against the genome. start(), end(), strand(), mismatches(), cigarline(), analysis(). fetch_all_by_Probe, fetch_all_by_Slice_ExperimentalChips().
DBAdaptor example code use strict; use Bio::EnsEMBL::Funcgen::DBSQL::DBAdaptor; use Bio::EnsEMBL::DBSQL::DBAdaptor; my $dna_db = Bio::EnsEMBL::DBSQL::DBAdaptor->new ( -user => ‘anonymous’, -host => ‘ensembldb.ensembl.org’, -species => ‘Homo_sapiens’, -dbname => ‘homo_sapiens_core_37_35j’, -group => ‘core’, ); my $efg_db = Bio::EnsEMBL::Funcgen::DBSQL::DBAdaptor->new ( -user => ‘anonymous’, -host => ‘ensembldb.ensembl.org’, -species => ‘Homo_sapiens’, -dbname => ‘homo_sapiens_fungen_48_36j’, -group => ‘funcgen’, -dnadb => $dnadb, );
Array example code use strict; use Bio::EnsEMBL::Registry; my $reg = "Bio::EnsEMBL::Registry"; $reg->load_registry_from_db ( -host=> ‘ensembldb.ensembl.org’, -user => ‘anonymous’, ); my $efg_db = $reg->get_DBadaptor(‘Human’, ‘funcgen’); my $array_adaptor = $efg_db->get_ArrayAdaptor; my @arrays = @{$array_adaptor->fetch_all }; foreach my $array(@arrays){ print "\nArray:\t".$array->name."\n"; print "Type:\t".$array->type."\n"; print "Vendor:\t".$array->vendor."\n"; } Array: 2005-05-10_HG17Tiling_Set Type: OLIGO Vendor: NIMBLEGEN Array: ENCODE3.1.1 Type: PCR Vendor: SANGER
ArrayChip example code my $array = $array_adaptor->fetch_by_name_vendor ('2005-05-10_HG17Tiling_Set', 'NIMBLEGEN’); my @achips = @{ $array->get_ArrayChips }; foreach my $ac(@achips){ print "ArrayChip:".$ac->name."\tDesignID:". $ac->design_id."\n"; } ArrayChip:2005-05-10_HG17Tiling_Set31 DesignID:2061 ArrayChip:2005-05-10_HG17Tiling_Set24 DesignID:2054 ArrayChip:2005-05-10_HG17Tiling_Set12 DesignID:2042 ArrayChip:2005-05-10_HG17Tiling_Set03 DesignID:2033 ArrayChip:2005-05-10_HG17Tiling_Set04 DesignID:2034 ArrayChip:2005-05-10_HG17Tiling_Set29 DesignID:2059 ArrayChip:2005-05-10_HG17Tiling_Set13 DesignID:2043 ArrayChip:2005-05-10_HG17Tiling_Set34 DesignID:2064 ArrayChip:2005-05-10_HG17Tiling_Set07 DesignID:2037 ArrayChip:2005-05-10_HG17Tiling_Set17 DesignID:2047 ArrayChip:2005-05-10_HG17Tiling_Set23 DesignID:2053 ArrayChip:2005-05-10_HG17Tiling_Set36 DesignID:2066 ArrayChip:2005-05-10_HG17Tiling_Set08 DesignID:2038
Probe example code my $probe_adaptor = $efg_db->get_ProbeAdaptor; my $pfeature_adaptor = $efg_db->get_ProbeFeatureAdaptor; my $probe = $probe_adaptor->fetch_by_array_probe_probeset_name ('2005-05-10_HG17Tiling_Set', 'chr22P38797630’); print "Got ".$probe->class." probe ".$probe->get_probename."\n"; my @pfeatures = @{$pfeature_adaptor->fetch_all_by_Probe($probe) }; print "Found ".scalar(@pfeatures)." ProbeFeatures\n"; foreach my $pfeature(@pfeatures){ print "ProbeFeature found at:\t".$pfeature->feature_Slice->name."\n"; } Got EXPERIMENTAL probe chr22P38797630 Found 1 ProbeFeatures ProbeFeature found at: chromosome:NCBI36:22:38803076:38803125:1
ExperimentalData1 Experiment provides a natural containers for experimetnal meta. name(), group(), mage_xml(), primary_design_type(), description(), get_ExperimentalChips(). fetch_by_name(), fetch_all_by_group(), get_all_experiment_names(). ExperimentalChip represents a unique physical instance of an ArrayChip. unique_id(), cell_type(), feature_type(), biological_replicate(), technical_replicate(). fetch_all_by_experiment(), fetch_by_unique_id_vendor(). Channel represents a control or experimental channel from and ExperimentalChip. dye(), type(), sample_id(). fetch_all_by_ExperimentalChip(), fetch_all_type_experimental_chip_id().
ExperimentalData1 example code my $exp_adaptor = $efg_db->get_ExperimentAdaptor; my $exp = $exp_adaptor->fetch_by_name(‘ctcf_ren’); my $num_chips = scalar(@{$exp->get_ExperimentalChips }); print $exp->name.' '.$exp->primary_design_type. " experiment contains $num_chips ExperimentalChips\n"; ctcf_ren binding_site_identification experiment contains 36 ExperimentalChips
ExperimentalData2 • ResultSet provides easy access to discrete sets of experimental data e.g replicates. • name(), cell_type(), feature_type(), display_label(), get_ExperimentalChips(), get_ResultFeatures_by_Slice(). • fetch_all_by_name(), fetch_all_by_name_Analysis(), fetch_all_by_FeatureType(), fetch_all_by_Experiment(). • ResultFeature is a special lightweight Feature optimised for display and analysis purposes. • start(), end(), score(). • ResultSet::get_ResultFeatures_by_Slice().
ExperimentalData2 example code my $resultset_adaptor = $efg_db->get_ResultSetAdaptor; my $slice_adaptor = $efg_db->get_SliceAdaptor; my ($result_set) = @{$resultset_adaptor-> fetch_all_by_name(‘ctcf_ren_BR1’) }; my $slice = $slice_adaptor->fetch_by_region(‘chromosome’,‘X’); my @result_features= @{$result_set->get_ResultFeatures_by_Slice($slice)}; print "Chromosome X has ".scalar(@result_features). " results\n"; foreach my $rf(@result_features){ print "Locus:\t".$rf->start.'-'.$rf->end. "\tScore:".$rf->score."\n"; } Chromosome X has 582133 results Locus: 429-478 Score:-0.1095 Locus: 529-578 Score:-0.1155 Locus: 629-678 Score:0.0135 Locus: 729-778 Score:-0.1735 Locus: 829-878 Score:0.256
More Sets • Experimental(Sub)Set are a special placeholder sets which facilitate feature import without any underlying data. • name(), cell_type(), feature_type(), format(), get_subsets(), ExperimentalSubSet->name(). • fetch_by_name(), fetch_all_by_Experiment(), fetch_all_by_CellType(), fetch_all_by_FeatureType(). • FeatureSet is generic set for containing features of various types e.g. AnnotatedFeatures, ExternalFeatures, RegulatoryFeatures. • name(), cell_type(), feature_type(), analysis(), get_Feature_by_Slice(). • fetch_by_name(), fetch_all_by_type(), fetch_all_by_CellType, fetch_all_by_FeatureType().
More Sets • DataSet is the top level container which associates underlying data or ‘supporting sets’ and a product FeatureSet i.e. the results of an analysis based on the underlying data. Supporting sets can be any other type of ‘Set’. • name(), cell_type(), feature_type(), product_FeatureSet(), get_supporting_sets(). • fetch_by_name(), fetch_all_by_supporting_set(), fetch_all_by_product_FeatureSet().
Set example code 1 my $dataset_adaptor = $efg_db->get_DataSetAdaptor; my $data_set = $dataset_adaptor->fetch_by_name (‘Nessie_NG_STD_2_ctcf_ren_BR1’); my @supporting_sets = @{$data_set->get_supporting_sets}; foreach my $sset(@supporting_sets){ print ‘Supporting set ‘.$sset->name.”\n”; print 'Produced by analysis '. $sset->analysis->logic_name."\n"; } my $pfset = $data_set->product_FeatureSet; print “\nProduct FeatureSet is “.$pfset->name.”\n”; print 'Produced by analysis '. $pfset->analysis->logic_name."\n"; Supporting set: ctcf_ren_BR1_TR1 Produced by analysis VSN_GLOG Product FeatureSet is Nessie_NG_STD_2_ctcf_ren_BR1 Produced by analysis Nessie_NG_STD_2
Set example code 2 my $featureset_adaptor = $efg_db->get_FeatureSetAdaptor; my @ext_fsets = @{$featureset_adaptor-> fetch_all_by_type('external')}; foreach my $ext_fset(@ext_fsets){ print "External FeatureSet:\t".$ext_fset->name."\n"; } External FeatureSet: miRanda miRNA External FeatureSet: cisRED group motifs External FeatureSet: cisRED search regions External FeatureSet: VISTA enhancer set
Features • ProbeFeature represent an individual alignment of a probe sequence. • probe(), probeset(), probelength(), get_result_by_ResultSet(). • fetch_all_by_Probe(), fetch_all_by_Slice_ExperimentalChips(). • AnnotatedFeature represents any feature based on experimental information i.e. ResultSet or ExperimentalSet data. • cell_type(), feature_type(), score(), display_label(). • ExternalFeature represents an individual feature from an externally curated set. • cell_type(), feature_type(), display_label().
Features • RegulatoryFeature represents a feature generated by the Regulatory Build. A combinatorial analysis based on DNase1 HSS’s, CTCF and histone modifications. • feature_type(), bound_start(), bound_end(), regulatory_attributes, display_label(), stable_id(). • fetch_all_by_Slice, fetch_by_stable_id().
Features example code 1 my $featureset_adaptor = $efg_db->get_FeatureSetAdaptor; my $feature_set = $featureset_adaptor->fetch_by_name (‘miRanda miRNA’); my @features= $feature_set->get_Features_by_Slice($slice); foreach my $feat(@features){ print $feat->display_label."\t".$feat->feature_Slice->name."\n"; } ENST00000390665:mmu-miR-712 chromosome:NCBI36:X:214111:214131:-1 ENST00000390665:mmu-miR-673-5p chromosome:NCBI36:X:214115:214136:-1 ENST00000390665:hsa-miR-22 chromosome:NCBI36:X:214125:214146:-1 ENST00000390665:hsa-miR-887 chromosome:NCBI36:X:214138:214159:-1 ENST00000390665:mmu-miR-696 chromosome:NCBI36:X:214149:214165:-1 ENST00000390665:hsa-miR-328 chromosome:NCBI36:X:214178:214200:-1 ENST00000390665:mmu-miR-669b chromosome:NCBI36:X:214228:214250:-1 ENST00000390665:hsa-miR-197 chromosome:NCBI36:X:214264:214285:-1 ENST00000390665:hsa-miR-220b chromosome:NCBI36:X:214265:214286:-1 ENST00000390665:hsa-miR-636 chromosome:NCBI36:X:214341:214362:-1 ENST00000390665:mmu-miR-689 chromosome:NCBI36:X:214424:214445:-1
Features example code 2 my $regfeat_adaptor = $efg_db->get_RegulatoryFeatureAdaptor; my @reg_feats= $regfeat_adaptor->fetch_by_Slice($slice); foreach my $reg_feat(@reg_features){ print $reg_feat->stable_id.' '. $reg_feat->feature_type->name."\n"; foreach my $attr_feat(@{$reg_feat->regulatory_attributes}){ print 'AttributeFeature '. $attr_feat->feature_type->name."\n"; } } ENSR00000175296 Promoter Associated - Cell type specific AttributeFeature H3K4me3 AttributeFeature H3K4me3 AttributeFeature DNase1 AttributeFeature DNase1 AttributeFeature H3K4me3 ENSR00000092125 Unclassified - Cell type specific AttributeFeature DNase1
eFG Environments • eFG environments provides useful functions, configuration and administration utilities: • efg • efg_pipeline • Coming soon… • Array mapping environment: • Affy, Illumina, Codelink, Agilent, Nimblegen. • Genomic & transcript mapping pipelines.
eFG Import • efg environment • Arrays: • Nimblegen • Sanger ENCODE • Simple: • GFF • BED • Wiggle • External: • cisRED • miRanda • VISTA • redFLY
eFG Import • ChIP-chip • Normalisation: VSN; TukeyBiweight. • Bio::MAGE/Tab2Mage • ResultSet nomeclature: EXP1 EXP1_BR1 EXP1_BR1_TR1 EXP1_BR1_TR2 • ChIP-Seq • Pre/Post analysis
eFG Analysis • efg_pipeline environment • Pipeline - Ensembl gene build pipeline technology. • Analysis Runnables: • ACME • Chipotle • Splitter • TileMap • Nessie(unpublished) • SWEmbl(unpublished) • Regulatory Build
DNAse1 DNAse1 CTCF H3K36me3 H3K4me3 H3K4me3 H3K27me3 eFG Analysis • Regulatory Build - Feature construction: • Anchor/Focus sets: DNase1; CTCF. • Attribute sets: Histone Modifications; Transcription factors. • Regulatory Annotation - Patterns associated with: • Promoter regions • Gene regions • Non-Gene regions
Getting More Information Workshop material http://www.ebi.ac.uk/~njohnson/courses/15.09.2008-GI-Hinxton perldoc – Viewer for inline API documentation. shell> perldoc Bio::EnsEMBL::Funcgen::RegulatoryFeature online at: http://www.ensembl.org/info/software/Pdoc/ eFG schema description: online at: http://www.ensembl.org/info/using/api/funcgen/funcgen_schema.html eFG installation document: online at: http://www.ensembl.org/info/using/api/funcgen/efg_introduction.html ensembl-dev mailing list: ensembl-dev@ebi.ac.uk