1 / 43

Exploring Williams-Beuren Syndrome

Exploring Williams-Beuren Syndrome. Professor Carole Goble http://www.mygrid.org.uk. Acknowledgements. myGrid is an EPSRC funded UK eScience Program Pilot Project. Particular thanks to the other members of the Taverna project, http://taverna.sf.net. Roadmap. my Grid in a nutshell

lalasa
Télécharger la présentation

Exploring Williams-Beuren Syndrome

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Exploring Williams-Beuren Syndrome Professor Carole Goble http://www.mygrid.org.uk GGF11 Semantic Grid Applications Workshop, Hilton Hawaiian Village Beach Resort & Spa, Honolulu, Thursday June 10, 2004

  2. Acknowledgements myGrid is an EPSRC funded UK eScience Program Pilot Project Particular thanks to the other members of the Taverna project, http://taverna.sf.net GGF11 Semantic Grid Applications Workshop, Hilton Hawaiian Village Beach Resort & Spa, Honolulu, Thursday June 10, 2004

  3. Roadmap • myGrid in a nutshell • Gene characterisation in Williams-Beuren Syndrome. • Semantic Aspects • Information model • Service discovery • Data Management - LSID • Metadata management for provenance – RDF • Lessons learnt and opportunities GGF11 Semantic Grid Applications Workshop, Hilton Hawaiian Village Beach Resort & Spa, Honolulu, Thursday June 10, 2004

  4. In a nutshell • Bioinformatics toolkit • Open (Web) Services • myGrid components • External domain services • No control or influence over service providers • Open to third party metadata • Open extensible architecture • Assemble your own components • Designed to work together • Toolkit • Axis/Apache based • RDF and DAML+OIL/OWL • Jena, OilEd, Instance Store & FaCT Haystack Provenance Browser Semantic Discovery Pedro View UDDI registry Gateway & CHEF Portal Taverna WfDE Freefluo WfEE Event Notification LSID Info. Model Soaplab Gowlab mIR GGF11 Semantic Grid Applications Workshop, Hilton Hawaiian Village Beach Resort & Spa, Honolulu, Thursday June 10, 2004

  5. Williams-Beuren Syndrome • Microdeletion of 155 Mbases on Chromosome 7 • Hannah Tipney, May Tassabehji, Andy Brass, St Mary’s Hospital, Manchester, UK • Characterise an unknown gene • Annotation pipelines and Gene expression analysis Services from USA, Japan, various sites in UK GGF11 Semantic Grid Applications Workshop, Hilton Hawaiian Village Beach Resort & Spa, Honolulu, Thursday June 10, 2004

  6. Physical Map CTA-315H11 CTB-51J22 GTF2IRD2P Gap FKBP6T POM121 GTF2IP NOLR1 NCF1P PMS2L STAG3 Block B Block A Block C Williams-Beuren Syndrome Microdeletion A-cen B-cen C-cen C-mid B-mid A-mid B-tel A-tel C-tel WBSCR1/E1f4H WBSCR5/LAB GTF2IRD1 WBSCR21 WBSCR22 WBSCR18 WBSCR14 GTF2IRD2 POM121 NOLR1 BAZ1B BCL7B FKBP6 GTF2I CLDN3 CLDN4 CYLN2 STX1A LIMK1 NCF1 TBL2 RFC2 FZD9 ELN ~1.5 Mb 7q11.23 Patient deletions * * WBS SVAS Chr 7 ~155 Mb GGF11 Semantic Grid Applications Workshop, Hilton Hawaiian Village Beach Resort & Spa, Honolulu, Thursday June 10, 2004

  7. Filling a genomic gap Two major steps: • Extend into the gap: Similarity searches; RepeatMasker, BLAST • Characterise the new sequence: NIX, Interpro, etc… • Numerous web-based services (i.e. BLAST, RepeatMasker) • Cutting and pasting between screens • Large number of steps • Frequently repeated – info now rapidly added to public databases • Don’t always get results • Time consuming • Huge amount of interrelated data is produced – handled in lab book and files saved to local hard drive • Mundane • Much knowledge remains undocumented • Bioinformatician does the analysis GGF11 Semantic Grid Applications Workshop, Hilton Hawaiian Village Beach Resort & Spa, Honolulu, Thursday June 10, 2004

  8. Point, click, cut, paste ID MURA_BACSU STANDARD; PRT; 429 AA. DE PROBABLE UDP-N-ACETYLGLUCOSAMINE 1-CARBOXYVINYLTRANSFERASE DE (EC 2.5.1.7) (ENOYLPYRUVATE TRANSFERASE) (UDP-N-ACETYLGLUCOSAMINE DE ENOLPYRUVYL TRANSFERASE) (EPT). GN MURA OR MURZ. OS BACILLUS SUBTILIS. OC BACTERIA; FIRMICUTES; BACILLUS/CLOSTRIDIUM GROUP; BACILLACEAE; OC BACILLUS. KW PEPTIDOGLYCAN SYNTHESIS; CELL WALL; TRANSFERASE. FT ACT_SITE 116 116 BINDS PEP (BY SIMILARITY). FT CONFLICT 374 374 S -> A (IN REF. 3). SQ SEQUENCE 429 AA; 46016 MW; 02018C5C CRC32; MEKLNIAGGD SLNGTVHISG AKNSAVALIP ATILANSEVT IEGLPEISDI ETLRDLLKEI GGNVHFENGE MVVDPTSMIS MPLPNGKVKK LRASYYLMGA MLGRFKQAVI GLPGGCHLGP RPIDQHIKGF EALGAEVTNE QGAIYLRAER LRGARIYLDV VSVGATINIM LAAVLAEGKT IIENAAKEPE IIDVATLLTS MGAKIKGAGT NVIRIDGVKE LHGCKHTIIP DRIEAGTFMI GGF11 Semantic Grid Applications Workshop, Hilton Hawaiian Village Beach Resort & Spa, Honolulu, Thursday June 10, 2004

  9. WBS Workflows: Query nucleotide sequence RepeatMasker ncbiBlastWrapper Pink: Outputs/inputs of a service Purple: Taylor-made services Green: Emboss soaplab services Yellow: Manchester soaplab services Grey: Unknowns GenBank Accession No URL inc GB identifier Translation/sequence file. Good for records and publications prettyseq GenBank Entry Amino Acid translation Sort for appropriate Sequences only Identifies PEST seq epestfind 6 ORFs Seqret Identifies FingerPRINTS pscan MW, length, charge, pI, etc Nucleotide seq (Fasta) pepstats sixpack ORFs transeq Predicts Coiled-coil regions RepeatMasker pepcoil tblastn Vs nr, est, est_mouse, est_human databases. Blastp Vs nr Coding sequence GenScan ncbiBlastWrapper Restriction enzyme map restrict SignalP TargetP PSORTII Predicts cellular location CpG Island locations and % cpgreport InterPro PFAM Prosite Smart Identifies functional and structural domains/motifs RepeatMasker Repetative elements Hydrophobic regions Pepwindow? Octanol? Blastn Vs nr, est databases. GGF11 Semantic Grid Applications Workshop, Hilton Hawaiian Village Beach Resort & Spa, Honolulu, Thursday June 10, 2004 ncbiBlastWrapper

  10. Collections of Tasks Building Domain Tasks Workflow Service Providers Enactment Bioinformaticians Storage Scientists Description Provenance Service Discovery Data Management Finding Querying Annotation providers GGF11 Semantic Grid Applications Workshop, Hilton Hawaiian Village Beach Resort & Spa, Honolulu, Thursday June 10, 2004

  11. Registry Bioinformaticians Taverna WfDE Querying/sharing/ federating/registering Query & Retrieve Workflow Execution Feta Annotation/description FreeFluo WfEE invoking Annotation providers Store data/ knowledge Interface Description Pedro Annotation tool mIR Others Service Providers WSDL Vocabulary Soap- lab Haystack Provenance Browser Ontology Store Data descriptions Scientists GGF11 Semantic Grid Applications Workshop, Hilton Hawaiian Village Beach Resort & Spa, Honolulu, Thursday June 10, 2004

  12. WBS task • Wrap services as web services • Register them • Build a workflow using the services • Evolve the workflow • Run it over and over again in case data has changed • Record results & provenance • Inspect and compare results & provenance • Event notification, portal, 3rd party annotation… GGF11 Semantic Grid Applications Workshop, Hilton Hawaiian Village Beach Resort & Spa, Honolulu, Thursday June 10, 2004

  13. GGF11 Semantic Grid Applications Workshop, Hilton Hawaiian Village Beach Resort & Spa, Honolulu, Thursday June 10, 2004

  14. User Results Benchmark: Two iterations of workflows (1 day run) • Reduced gap by 267 693 bp at its centrmeric end • Correctly located all seven known genes in this region • Identified 33 of the 36 known exons residing in this location Manually: takes two days (+) including analysis Now: takes 30 mins to produce results and half a day for analysis. • Less boring. Less prone to mistakes. • Once notification installed won’t even have to initiate it. GGF11 Semantic Grid Applications Workshop, Hilton Hawaiian Village Beach Resort & Spa, Honolulu, Thursday June 10, 2004

  15. Where is the semantics GGF11 Semantic Grid Applications Workshop, Hilton Hawaiian Village Beach Resort & Spa, Honolulu, Thursday June 10, 2004

  16. Information Model v2 • Scientific data and the life-science identifier • Types • Identifier Types • Values and Documents • Provenance information • Annotation and Argumentation • Resources and Identifiers • People, teams and organizations • Representing the e-science process • Experimental methods for e-science XML messages between services conform to the IMv2 GGF11 Semantic Grid Applications Workshop, Hilton Hawaiian Village Beach Resort & Spa, Honolulu, Thursday June 10, 2004

  17. Semantic discovery • The User does the choosing of services • A common ontology is used to annotate and query any myGrid object including services. • Ontology is built using DAML+OIL and reasoning • Deployed as a static RDF graph • Discover workflows and services described in the registry via Taverna. • Look for all workflows that accept an input of semantic type nucleotide sequence. GGF11 Semantic Grid Applications Workshop, Hilton Hawaiian Village Beach Resort & Spa, Honolulu, Thursday June 10, 2004

  18. Controlling contents of metadata and data Ontologies Describing & Linking Provenance records Resource annotations Change & event Notification topics Role of Ontologies Service matching and provisioning Composing and validating workflows and service compositions & negotiations Service & resource registration & discovery Help Knowledge-based guidance and recommendation Schema mediation GGF11 Semantic Grid Applications Workshop, Hilton Hawaiian Village Beach Resort & Spa, Honolulu, Thursday June 10, 2004

  19. Observations GGF11 Semantic Grid Applications Workshop, Hilton Hawaiian Village Beach Resort & Spa, Honolulu, Thursday June 10, 2004

  20. Services http://pedro.man.ac.uk • Practically all the services are remote and third party • Services are changeable and unreliable • Redundant services are essential • WSDL in the wild is poor • Automated annotation GGF11 Semantic Grid Applications Workshop, Hilton Hawaiian Village Beach Resort & Spa, Honolulu, Thursday June 10, 2004

  21. Can you guess what it is yet? GGF11 Semantic Grid Applications Workshop, Hilton Hawaiian Village Beach Resort & Spa, Honolulu, Thursday June 10, 2004

  22. Model of services operation name, description input output task method resource application service name, description authororganisation parameter name, description semantic type format transport type collection type collection format workflow WSDL operation WSDL service Soaplab service bioMoby service GGF11 Semantic Grid Applications Workshop, Hilton Hawaiian Village Beach Resort & Spa, Honolulu, Thursday June 10, 2004

  23. SHIM Services • Services that enable domain services to fit together • Outnumber domain services • Libraries • Candidates for automatic selection, composition and substitution Main Bioinformatics Applications Main Bioinformatics Services Main Bioinformatics Application Main Bioinformatics Application SHIM Services GGF11 Semantic Grid Applications Workshop, Hilton Hawaiian Village Beach Resort & Spa, Honolulu, Thursday June 10, 2004

  24. Results management • Automated workflows produce lots of heterogeneous data • These are just some of the results from one workflow run for Williams Disease GGF11 Semantic Grid Applications Workshop, Hilton Hawaiian Village Beach Resort & Spa, Honolulu, Thursday June 10, 2004

  25. Amplification One input Many outputs GGF11 Semantic Grid Applications Workshop, Hilton Hawaiian Village Beach Resort & Spa, Honolulu, Thursday June 10, 2004

  26. Dealing with results • FreeFluo agnostic about the data flowing through it. • Taverna includes a DataThing class, which can be tagged with terms from ontologies, free text descriptions and MIME types, and which may contain arbitrary collection structures. • Using the metadata hints we can locate and launch pluggable view components. • Hybrid typing scheme allows for a ‘best effort’ approach to data typing. • Life science types are intractable for reasonable effort or completeness. GGF11 Semantic Grid Applications Workshop, Hilton Hawaiian Village Beach Resort & Spa, Honolulu, Thursday June 10, 2004

  27. Intermediate Results GGF11 Semantic Grid Applications Workshop, Hilton Hawaiian Village Beach Resort & Spa, Honolulu, Thursday June 10, 2004

  28. Intermediate Results • Workflows change the way the bioinformatican works • Before: analyse results as go along • After: all results in one go • So linking intermediate results important GGF11 Semantic Grid Applications Workshop, Hilton Hawaiian Village Beach Resort & Spa, Honolulu, Thursday June 10, 2004

  29. Life Science IDs http://www.i3c.org/wgr/ta/resources/lsid/docs/ • LSID provides a uniform naming scheme. • LSID Resolver guarantees to resolve to same data object. • LSID Authority dishes them out. • Also returns metadata of object. • Used throughout myGrid as an object naming device. • myGrid Repository acts an LSID Authority • LSID allows universal access to results for collaboration, as well as for review. • RDF+LSID explains the context of results, and provides guidance for further investigations. I3C / IBM / EBI proposal for a Life Science Identifier Pioneered by myGrid GGF11 Semantic Grid Applications Workshop, Hilton Hawaiian Village Beach Resort & Spa, Honolulu, Thursday June 10, 2004

  30. Process Provenance GGF11 Semantic Grid Applications Workshop, Hilton Hawaiian Village Beach Resort & Spa, Honolulu, Thursday June 10, 2004

  31. Link v Data Representation • Data management questions refer to relationships rather than internal content • What are the origins of this data? • Which service produced this data? • Which data is this derived from? • Who was this data produced for? • ?What is this data telling me? • Data analysis questions delegated to external services. GGF11 Semantic Grid Applications Workshop, Hilton Hawaiian Village Beach Resort & Spa, Honolulu, Thursday June 10, 2004

  32. Representing links urn:lsid:taverna.sf.net:datathing:45fg6 urn:lsid:taverna.sf.net:datathing:23ty3 • Identify each resource • Life science identifier: URI with associated data and metadata retrieval protocols. • Understanding that underlying data will not change GGF11 Semantic Grid Applications Workshop, Hilton Hawaiian Village Beach Resort & Spa, Honolulu, Thursday June 10, 2004

  33. Representing links II http://www.mygrid.org.uk/ontology#derived_from urn:lsid:taverna.sf.net:datathing:45fg6 urn:lsid:taverna.sf.net:datathing:23ty3 • Identify link type • Again use URI • Allows us to use RDF infrastructure • Repositories • Ontologies GGF11 Semantic Grid Applications Workshop, Hilton Hawaiian Village Beach Resort & Spa, Honolulu, Thursday June 10, 2004

  34. Organisation level provenance Process level provenance Service Project runBye.g. BLAST @ NCBI Experiment design Process Workflow design componentProcesse.g. web service invocation of BLAST @ NCBI partOf Event instanceOf componentEvente.g. completion of a web service invocation at 12.04pm Workflow run Data/ knowledge level provenance knowledge statementse.g. similar protein sequence to run for User can add templates to each workflow process to determine links between data items. Data item Person Organisation Data item Data item GGF11 Semantic Grid Applications Workshop, Hilton Hawaiian Village Beach Resort & Spa, Honolulu, Thursday June 10, 2004 data derivation e.g. output data derived from input data

  35. ..masked_sequence_of .. nucleotide_sequence project ..part_of organisation >gi|19747251|gb|AC005089.3| Homo sapiens BAC clone CTA-315H11 from 7, complete sequence AAGCTTTTCTGGCACTGTTTCCTTCTTCCTGATAACCAGAGAAGGAAAAGATCTCCATTTTACAGATGAG GAAACAGGCTCAGAGAGGTCAAGGCTCTGGCTCAAGGTCACACAGCCTGGGAACGGCAAAGCTGATATTC AAACCCAAGCATCTTGGCTCCAAAGCCCTGGTTTCTGTTCCCACTACTGTCAGTGACCTTGGCAAGCCCT GTCCTCCTCCGGGCTTCACTCTGCACACCTGTAACCTGGGGTTAAATGGGCTCACCTGGACTGTTGAGCG experiment definition rdf:type ..part_of group urn:lsid:taverna:datathing:13 ..part_of ..author workflow definition ..works_for ..invocation_of ..author person ..BLAST_Report workflow invocation ..similar_sequences_to ..run_for ..run_during service description rdf:type 19747251 AC005089.3 831 Homo sapiens BAC clone CTA-315H11 from 7, complete sequence 15145617 AC073846.6 815 Homo sapiens BAC clone RP11-622P13 from 7, complete sequence 15384807 AL365366.20 46.1 Human DNA sequence from clone RP11-553N16 on chromosome 1, complete sequence 7717376 AL163282.2 44.1 Homo sapiens chromosome 21 segment HS21C082 16304790 AL133523.5 44.1 Human chromosome 14 DNA sequence BAC R-775G15 of library RPCI-11 from chromosome 14 of Homo sapiens (Human), complete sequence 34367431 BX648272.1 44.1 Homo sapiens mRNA; cDNA DKFZp686G08119 (from clone DKFZp686G08119) 5629923 AC007298.17 44.1 Homo sapiens 12q22 BAC RPCI11-256L6 (Roswell Park Cancer Institute Human BAC Library) complete sequence 34533695 AK126986.1 44.1 Homo sapiens cDNA FLJ45040 fis, clone BRAWH3020486 20377057 AC069363.10 44.1 Homo sapiens chromosome 17, clone RP11-104J23, complete sequence 4191263 AL031674.1 44.1 Human DNA sequence from clone RP4-715N11 on chromosome 20q13.1-13.2 Contains two putative novel genes, ESTs, STSs and GSSs, complete sequence 17977487 AC093690.5 44.1 Homo sapiens BAC clone RP11-731I19 from 2, complete sequence 17048246 AC012568.7 44.1 Homo sapiens chromosome 15, clone RP11-342M21, complete sequence 14485328 AL355339.7 44.1 Human DNA sequence from clone RP11-461K13 on chromosome 10, complete sequence 5757554 AC007074.2 44.1 Homo sapiens PAC clone RP3-368G6 from X, complete sequence 4176355 AC005509.1 44.1 Homo sapiens chromosome 4 clone B200N5 map 4q25, complete sequence 2829108 AF042090.1 44.1 Homo sapiens chromosome 21q22.3 PAC 171F15, complete sequence urn:lsid:taverna:datathing:15 service invocation ..described_by ..created_by ..filtered_version_of A B Provenance tracking • Automated generation of this web of links preferable • Workflow enactor generates • LSIDs • Data derivation links • Knowledge links • Process links • Organisation links Relationship BLAST report has with other items in the repository Other classes of information related to BLAST report GGF11 Semantic Grid Applications Workshop, Hilton Hawaiian Village Beach Resort & Spa, Honolulu, Thursday June 10, 2004

  36. Storage • LSID has no protocol for storage • Taverna/ Freefluo implements its own data/ metadata storage protocol Publish interface Taverna/ Freefluo Metadata Store data Data store metadata GGF11 Semantic Grid Applications Workshop, Hilton Hawaiian Village Beach Resort & Spa, Honolulu, Thursday June 10, 2004

  37. Retrieval • LSID protocol used to retrieve data and metadata • Query handled separately RDF aware client LSID aware client Query LSID interface Publish interface Taverna/ Freefluo Metadata Store Metadata Store data Data store Data store metadata GGF11 Semantic Grid Applications Workshop, Hilton Hawaiian Village Beach Resort & Spa, Honolulu, Thursday June 10, 2004

  38. IBM’s BioHaystack GenBank record Portion of the Web of provenance Managing collection of sequences for review GGF11 Semantic Grid Applications Workshop, Hilton Hawaiian Village Beach Resort & Spa, Honolulu, Thursday June 10, 2004

  39. Observations • Managed the transition from generic middleware development to practical day to day useful services • Real users (plural) fundamental to that • End to end support for an entire scenario • Bury the semantics • Show stoppers for practical adoption are not technical showstoppers • Can I incorporate my favourite service? • Can I manage the results? • By tapping into (defacto) standards and communities we can leverage others results and tools – LSID, Haystack, Pedro. GGF11 Semantic Grid Applications Workshop, Hilton Hawaiian Village Beach Resort & Spa, Honolulu, Thursday June 10, 2004

  40. Acknowledgements myGrid is an EPSRC funded UK eScience Program Pilot Project Particular thanks to the other members of the Taverna project, http://taverna.sf.net GGF11 Semantic Grid Applications Workshop, Hilton Hawaiian Village Beach Resort & Spa, Honolulu, Thursday June 10, 2004

  41. myGrid People Core • Matthew Addis, Nedim Alpdemir, Tim Carver, Rich Cawley, Neil Davis, Alvaro Fernandes, Justin Ferris, Robert Gaizaukaus, Kevin Glover, Carole Goble, Chris Greenhalgh, Mark Greenwood, Yikun Guo, Ananth Krishna, Peter Li, Phillip Lord, Darren Marvin, Simon Miles, Luc Moreau, Arijit Mukherjee, Tom Oinn, Juri Papay, Savas Parastatidis, Norman Paton, Terry Payne, Matthew Pockock Milena Radenkovic, Stefan Rennick-Egglestone, Peter Rice, Martin Senger, Nick Sharman, Robert Stevens, Victor Tan, Anil Wipat, Paul Watson and Chris Wroe. Users • Simon Pearce and Claire Jennings, Institute of Human Genetics School of Clinical Medical Sciences, University of Newcastle, UK • Hannah Tipney, May Tassabehji, Andy Brass, St Mary’s Hospital, Manchester, UK Postgraduates • Martin Szomszor, Duncan Hull, Jun Zhao, Pinar Alper, John Dickman, Keith Flanagan, Antoon Goderis, Tracy Craddock, Alastair Hampshire Industrial • Dennis Quan, Sean Martin, Michael Niemi, Syd Chapman (IBM) • Robin McEntire (GSK) Collaborators • Keith Decker GGF11 Semantic Grid Applications Workshop, Hilton Hawaiian Village Beach Resort & Spa, Honolulu, Thursday June 10, 2004

  42. http://www.mygrid.org.uk GGF11 Semantic Grid Applications Workshop, Hilton Hawaiian Village Beach Resort & Spa, Honolulu, Thursday June 10, 2004

  43. Summary • myGrid offers service based middleware components • Open source and freely downloadable • Open Grid Service Architecture-compliant • Allows the scientist to be at the centre of the Grid -- Personalisation • Generic middleware that suits the creation of bioinformatics applications • Inclusion of rich semantics to facilitate the scientific process Available from http://www.mygrid.org.uk GGF11 Semantic Grid Applications Workshop, Hilton Hawaiian Village Beach Resort & Spa, Honolulu, Thursday June 10, 2004

More Related