290 likes | 427 Vues
An Identity Crisis in the Life Sciences. Jun Zhao, Carole Goble and Robert Stevens The University of Manchester, UK Thanks to: Tom Oinn, Matthew Pocock, Daniele Turi And our users And the EPSRC.
E N D
An Identity Crisis in the Life Sciences Jun Zhao, Carole Goble and Robert Stevens The University of Manchester, UK Thanks to: Tom Oinn, Matthew Pocock, Daniele Turi And our users And the EPSRC
12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt 12241 cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt 12301 gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct 12361 gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt 12421 taggtgactt gcctgttttt ttttaattgg gatcttaatt tttttaaatt attgatttgt 12481 aggagctatt tatatattct ggatacaagt tctttatcag atacacagtt tgtgactatt UK e-Science project Middleware for in silico experiments by individual life scientists, stuck in under-resourced labs, who use other people’s applications.
Data pipelines Collect data Compute data Frequently updated public resources Open world Get the same data product in different experiment context Bioinformatics workflows Bioinformatician users Taverna workflow workbench collectedmetabolic pathway computedBLAST report computedBLAST report
[instanceOf] urn:data1 SwissProt_seq [similar_sequence_to] [input] urn:hit1… [performsTask] [instanceOf] urn:BlastNInvocation3 urn:hit2…. [contains] [output] Find similar sequence urn:hit50….. urn:data2 Sequence_hit urn:data12 [input] [hasHits] [instanceOf] urn:compareinvocation3 Blast_report [directlyDerivedFrom] [distantlyDerivedFrom] [instanceOf] [output] urn:hit5… urn:data:3 urn:hit8…. [contains] Data generated by services/workflows [output] urn:hit10….. [output] urn:data:f1 urn:invocation5 [ ] Properties [type] [hasName] urn:data:f2 Concepts [type] [hasName] Services Missed sequence DatumCollection New sequence LSDatum literals Concept Data
DNA_sequence Blast_service Blast_report [instanceOf] urn:data1 SwissProt_seq instanceOf [similar_sequence_to] [input] urn:BlastNInvocation3 urn:hit1… instanceOf [performsTask] instanceOf [instanceOf] urn:BlastNInvocation3 urn:hit2…. inputOf outputOf urn:run5 [contains] contains_similiar_seq_to [output] Find similar sequence urn:hit50….. createdFrom urn:data2 urn:data:3 urn:data2 Sequence_hit inputOf runOf urn:data12 [input] [hasHits] [instanceOf] urn:williamsA urn:genbank1… urn:compareinvocation3 Blast_report instanceOf urn:genbank2… [directlyDerivedFrom] DNA_sequence [distantlyDerivedFrom] createdBy [instanceOf] createdBy [output] urn:hit5… urn:data:3 urn:genbank50… urn:data2 inputOf urn:hit8…. [contains] Data generated by services/workflows urn:run7 LSID [output] urn:hit10….. [output] urn:data:f1 urn:invocation5 GenBank UniProt runOf [ ] Properties [type] [hasName] urn:data:f2 Concepts urn:williamsB [type] [hasName] Services Missed sequence DatumCollection New sequence LSDatum literals Fusion between different data models using shared concepts and shared data Add assertions, Add rulesReason over assertions
Putting Provenance to Use • Single workflow • audit trail • recipe • Multiple workflow runs (versions) • Aggregation - gathering • Integration - merging • Comparison - differencing
Any idea? • 30350027 • 30350027 • gi:30350027 Life Science Identifier A ruddy great lump of RDF
12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt 12241 cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt 12301 gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct 12361 gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt 12421 taggtgactt gcctgttttt ttttaattgg gatcttaatt tttttaaatt attgatttgt 12481 aggagctatt tatatattct ggatacaagt tctttatcag atacacagtt tgtgactatt URIs for Dataurn:lsid:mygrid.ac.uk:data:49841:1 • Life Science Identifier • Protocol for allocation and resolution • Adopted by a range of data providers • LSIDs in the data providers databases we collect during workflow execution • LSIDs for the data products we computed during the workflow execution http://www.omg.org/cgi-bin/doc?lifesci/2003-12-02
Having a BLAST in every workflow! Seq database score BLAST BlastReport BLAST_simplifer A list of Sequences GenBank_retrieve GenBank Report
BLAST simplifer SEQ listOf Computed Collections and Collected data items BLAST Report BLAST Report BLAST Report Sequence1 Sequence1 Sequence1 Sequence2 Sequence2 Sequence2 Sequence3 Sequence3 Sequence4 BLAST simplifer BLAST simplifer SEQ SEQ listOf listOf
Equivalent data BLAST simplifer Corresponding data SEQ Context of the workflow listOf Data Co-references BLAST Report BLAST Report Sequence1 Sequence1 Sequence2 Sequence2 Sequence3 Sequence4 BLAST simplifer SEQ listOf
Aggregation of repeated run Run2 Run1 rdf:type rdf:type DNASeq urn:lsid:tav:57b6 urn:lsid:tav:ic531 derivedFrom derivedFrom rdf:type BLASTReport rdf:type urn:lsid:tav:57b13 urn:lsid:tav:ic537 derivedFrom derivedFrom urn:lsid:tav:57b14 urn:lsid:tav:ic538 refersTo refersTo refersTo rdf:type DNASeq AC005089
External Duplicates Sequence gi:15145617 Different providers ac073846 A replica urn:lsid:myg:ac073846 Different tool providers mmu:11423
LSID Assignment Process Taverna LSID Authority Data service BAKLAVA Data storage group MySQL relational stores Customized DB Customized DB Workflow enactor Provenance service wfEvents Equivalent data in repeated runs Duplicate ids for these data KAVE Jena/Sesame RDF store External domain service
Provenance from two repeated runs No convergence urn:lsid:tav:brpt1 my:derivedFrom my:derivedFrom urn:lsid:tav:seqcollection1 urn:lsid:tav:seq1 my:hasElement Run1 urn:lsid:tav:brpt2 my:derivedFrom my:derivedFrom urn:lsid:tav:seq2 urn:lsid:tav:seqcollection2 my:hasElement Run2
A list of Seq BLAST BlastReport BLAST_simplifer GenBank_retrieve But hidden!! Execution duplicates Sequence1 urn:gb:seq1 Sequence1 urn:gb:seq1 BLAST report BLAST report urn:lsid:tav:brpt1 urn:lsid:tav:brpt2
Execution duplicates A list of Seq BLAST BlastReport BLAST_simplifer GenBank_retrieve urn:tav:seqc1 SEQ1 listOf Sequence1 urn:tav:seq1 Sequence2 Sequence3 urn:gb:seq1 urn:tav:seqc2 urn:tav:seq2 Sequence1 SEQ1 listOf Sequence2 Sequence3
Managing identity co-reference • Identity co-reference: • Identifying duplicate identities that refer to the same object but kept context • An approach: • An IDSet entity • Identity equivalence for collected data • Identity correspondence for computed data • An identity service • Identity normalisation and cleansing activity
merge IDSet entity • IDSet(BLASTrpt) = {{urn:tav:brpt1}, {urn:tav:brpt3}} Sequence urn:gb:seq1 Query by its content urn:lsid:tav:brpt1 BLASTreport Query by its identity IDSet created by another organization IDSet1 IDSet3
KAVE KAVE + Jena/Sesame RDF store MySQL relational store Identity store Identity service Extended Architecture Data service BAKLAVA Data storage group Taverna LSID Authority MySQL relational stores Customized DB Customized DB Workflow enactor Provenance service wfEvents External domain service
Identifying collected product KAVE+ urn:gb:seq1 3 Identity service 1 3 2 Identity store IDSet 1 urn:gb:seq1 urn:gb:seq1 Store the id and the IDSet Receive an identity Look for or create Its IDSet
Seq1 listOf SEQ2 Seq2 Seq3 Identifying a collection product KAVE+ 1 3 Identity service 3 2 unr:lsid:seqc2 Identity store IDSet unr:lsid:seqc2 unr:lsid:seqc1 urn:lsid:seqc1 Receive an identity Look for equivalent collection Look for or create Its IDSet Store the id and the IDSet
Putting the Identity Service to Use Provenance Integration Run1 Run2 b1 s1 b2 Provenance Aggregation s2 c1 c2 Provenance Normalization Identity Management
Discussion • Scalability issues: • Normalizing provenance graphs • Building IDSet for collections with multiple hierarchies • Open world data type-free context • Use experimental context more effectively – workflows are not independently executed. • Granularity of identity • Identity aware operations in workflow • Multiple naming schemes • Migration duplicates • Compacting data results
Conclusion • Combining provenance kind of depends on finding points of commonality. Like data identity. • Duplicate identities will occur in an open world • Hard to achieve uniqueness without community commitment • Different types of equivalent objects • How much can be avoided? • And how much has to be repaired?