Introduction to DAS / State of the Union

Introduction to DAS / State of the Union Tim Hubbard th@sanger.ac.uk DAS developer workshop 10th March 2009 Wellcome Trust Genome Campus

Distributed Annotation System or How I Learnt to Stop Worrying and Love Data Federation Credit: Andreas Prlić

Distributed Annotation System • Origins: • xml client/server specification (http://biodas.org/) • Lincoln Stein, Sean Eddy, Robin Dowell and LaDeana Hillier • acedb based prototype server • Java based prototype client • Dowell, R.D., Jokerst, R.M., Day, A., Eddy, S.R. & Stein, L. (2001) BioMedCentral Bioinformatics 2. • Genome campus adoption • Initially via Ensembl becoming a DAS client (now also a DAS server) • Code: Dazzle and Proserver servers; Bio::DASLite and biojava client libraries • Hosts DAS registry (http://www.dasregistry.org/)

DAS in a nutshell • Standardized set of web services • Reference servers (the sequence) • Annotation servers (features: chr:start-end) • Alignment servers (chr:start-end matches chr:start-end) • Identifier based servers (ref item X rather than coordinate) • Standardization allows clients to connect to different DAS sources without additional programming

Data integration • Complete genomes provide the framework to pull all biological data together such that each piece says something about biology as a whole • Biology is too complex for any organisation to have a monopoly of ideas or data • The more organisations provide data or analysis separately, the harder it becomes for anyone to make use of the results

Utility of bioinformatics Scientific impact Too little bioinformatics Too many databases Too diverse interfaces

Split data and presentation • Databases responsible for curating data and serving it as primitive datatypes defined by open standards (high cost) • Different front ends or components of front ends compete for users (development of each low cost) c.f. browsers.

Data Services

e! contigview epigenome Apollo 3D structure Servers Campus DAS systems Clients Genome Coordinates Dazzle CDS Coordinates Sources Ensembl Pfam UniProt PubMed COSMIC Proserver e! geneview Protein Coordinates LDAS otterlace Stable Identifiers Pfam Sequence Alignments Registry

Rise of Federation Technologies • DAS for features • BioMart for data mining • BioMart server is a DAS server • New international genome data projects • routinely using the F word • frequently the D and B words too • e.g. International Cancer Genome Consortium

DAS infrastructure status • Lots of progress • Servers: Dazzle, Proserver, MyDas, Bio::Daslite • Clients: Ensembl, Vega, Dasty, SPICE, Pfam, Jalview, Pepper, IGB • >500 sources in DAS registry (http://www.dasregistry.org/) • Broadly adopted by large scale projects: Ensembl, biosapiens, efamily, ZF-models, eProtein, ENCODE annotation • Extensions in 1.53E: stylesheets, semantic zooming, ontology support, timestamps, interactions • Planned 1.6: incorporating some features of DAS2 specification • Better adoption of DAS in US • Opportunities • Searching, writeback • Source ranking, credit, social networking • Inter-client communications protocol • Async delivery/caching; servers built on servers/workflows • Alternative entry points from servers? Next left/right? Date of addition?

2008 the year of… • Open access to publications • PMC, ukPMC, Zotero, Papers, MyNCBI, Citeulike, Connotea, 2collab and HubMed • All WT funded publications open in 6 months • All NIH funded publications open in 12 months • DAS for publications? • Text is just a new coordinate system • Links to Social Networks? • Google OpenSocial • Still waiting…

2009 the year of… • Massive datasets • Track likely to be 50 million solexa transcriptome reads • Need: • Better ways for users to create tracks for large datasets

Problems of large user data(credits to Jim Kent, UCSC) • Easy to generate 1 GB files with next gen sequencing. • 25 million tag mappings at 40 bytes each • Potential to translate into histograms with 1 floating point number every 12 bases • Slow to load into MySQL database backend to local DAS server; many users will not want to setup DAS servers • Too large to upload to remote DAS server services (e.g. Ensembl) to create track • Most users only look at 5-50 sites - less than 1% genome

Jim Kent’s idea • User runs program to convert their data into single indexed file (BigWig & BigBed) • Place on their website • UCSC browser fetches parts of file on demand using http(s) “byte range” queries • Relationship to DAS? • Potential to create DAS server plugin to serve BigWig/BigBed files as DAS servers

Acknowledgements Ewan Birney Tony Cox Thomas Down Rob Finn Stefan Graf David Jackson Andreas Kahari Eugene Kulesha Henning Hermjakob Roger Pettett Matt Pocock James Smith Jim Stalker Janet Thornton Jonathan Warren Andy Jenkinson Andreas Prlic Ensembl/Sanger Web team efamily, biosapiens, eProtein Zebrafish analysis (ZF-models) Anacode/Acedb (otterlace/Zmap)

2009 the year of… • Massive datasets • Track likely to be 50 million solexa transcriptome reads • Private datasets • EGA requires registration and logins • Even summary data currently not public • Need: • Better ways for users to create tracks for large datasets • Federated access controls for patient data

Todo: tilling array DAS stylesheet magic (Eugene Kulesha)

Introduction to DAS / State of the Union

Introduction to DAS / State of the Union

Presentation Transcript

Florida State University Department of Computer Science

An Introduction to Social Network Analysis

MALAYAN UNION

Introduction to JavaScript

Russia/ SOVIET UNION

Researching and Understanding European Union Law

Is it Worth the Time and Effort? T eachers’ Perceptions of 4MAT in the Southern Union

This project is funded by the European Union .

State of the Union

The Cold War 1945-1991 US vs Union of Soviet Socialist Republics

European union

TREATMENT ASEPTIC NON UNION

Roger LeRoy Miller Economics Today

Slavery Divides the Nation

Name that State

Slavery Divides the Nation