1 / 30

Rebecca Crowley and Kevin Mitchell crowelyrs@upmc.edu mitchellkj@upmc.edu

caGrid Version 0.5 Reference Implementation < caTIES> caBIG Architecture Workspace Face to Face Georgetown University August 16 th -18 th , 2005. Rebecca Crowley and Kevin Mitchell crowelyrs@upmc.edu mitchellkj@upmc.edu. Outline. Project History and demonstration of existing application

tarmon
Télécharger la présentation

Rebecca Crowley and Kevin Mitchell crowelyrs@upmc.edu mitchellkj@upmc.edu

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. caGrid Version 0.5 Reference Implementation<caTIES>caBIG Architecture Workspace Face to FaceGeorgetown UniversityAugust 16th -18th, 2005 Rebecca Crowley and Kevin Mitchell crowelyrs@upmc.edu mitchellkj@upmc.edu

  2. Outline • Project History and demonstration of existing application • Data Model • Project Architecture • Process of getting to “Silver” level compliance • Functionality Exposed to Grid • Process of Grid Enablement • Lessons Learned / Technical Difficulties / Wish List • Acknowledgements

  3. Project History • caTIES is a text processing system that creates de-identified structured data from unstructured free-text pathology reports and makes reports accessible to researchers. • Information about tumor, stage, prognostic factors • Index to fixed tissue, source of annotation for frozen or processed tissue • caTIES de-identifies entire corpus of reports, creates concept codes using NCI metathesaurus, to MySQL datastore • Deployed to adopter at University of Pennsylvania, intention is to create a network of institutions that can share data and tissue • Used OGSA-DAI and OGSI to facilitate data-sharing between Pitt and Penn; IRB protocols to provide data, creating culture for data-sharing

  4. Demonstration of existing web-service based application

  5. Data Model • Evolving data model • Rich annotations derived from Pathology Report, not all of which appear in Phase I model • Each Phase of the caTIES project adds additional use cases and expands the scope of the data model • Phase I: Basic mechanisms for document search and retrieval based on NCI Metathesaurus concepts • Phase II: Fill requests for tissue that utilize HB model, add queries based on temporality • Phase III: Retrieval of data or tissues based on more finely granular structured information extracted from documents

  6. Phase I model

  7. Early phase 2 model

  8. Small part of Phase III model

  9. Reference Implementation Goals (from our perspective) • Achieve silver compliance using caCORE SDK • Replace existing OGSA-DAI web-services based method for data sharing with caGrid • Utilize grid and local security mechanisms • Learn more about how caGrid would be used for other facets of this project • Analytic services for our coder processing resources • Communication between applications • Requirements for next iteration of caGrid

  10. Process of getting to “Silver” level compliance • ROUND ONE • Developed original model by ‘extracting’ the two classes needed to reproduce Phase I functionality • In particular, we left out the ‘identified’ side of our model • Generated caCORE-like API using caCORE SDK • Created data model in EA (mySQL) • Automatically created ORM using caCORE SDK • Created semantic annotations file using semantic-connector script • Manually annotated our own models prior to sending to NCICB • Learned a lot about what we can and cannot expect from these annotations • Developers have to be the ones to catch the mistakes. May be best if we know these annotations well

  11. First Round

  12. Process of getting to “Silver” level compliance • But there was a problem…the object model had never been formally reviewed…because there was no formal process to do so when we went through ROUND ONE. Started anew on August 8th. • Email and teleconference with key NCICB staff for our WS (Ian Fore) and VCDE (George Komatsoulis). Issued new object model. NCICB requested 13 additional changes. Came to agreement over most of these changes over course of 5 days, ~15 emails, one teleconference. • New objects required for concept, referent, application, and execution • Removal of relationship, clarification of naming and definitions • Most contentious issue was use of ordered concept code lists, a kind of document-transformation that we find very useful • In general, we found that there were some slight contortions needed to use ‘domain-modeling’ principles when trying to model documents • Regenerated data model, dependencies, created new db and API • Moved data from backend generated in 1st attempt to final backend

  13. Process of getting to “Silver” level compliance • Ran semantic-connector script over XMI to produce Excel file • Sent annotations to NCICB for review and creation of 2 additional concepts • Created annotated XMI and sent to NCICB • Loaded to caDSR Stage – small human error in excel yielded model with three classes appearing as one

  14. Phase I model

  15. Process of getting to “Silver” level compliance • Ran semantic-connector script again • Sent annotated XMI to NCICB • Loaded to caDSR Stage – reviewed and approved • Loaded to caDSR Production • Metadata extract generated by caDSR and returned to us

  16. Process of getting to “Silver” level compliance

  17. Functionality Exposed to the Grid • Data service, exposing all objects in current Phase I model to the grid, but very likely tightly restricting access until processes are in place to administer users for research on the caGrid • Example queries we will or would like to support (Phases I and II): • Return the total number of cases across all caTIES nodes across the entire Grid of pediatric PNET in patients <18 years old (public) • Return de-identified reports for women age 30-50 with Atypical Ductal Hyperplasia on breast biopsy followed within 1-5 years by DCIS or Infiltrating Ductal Carcinoma on any procedure (users with IRB authorization to access de-identified data) • Return de-identified accession numbers for patients with high grade prostatic intraepithelial neoplasia (HGPIN) but no prostate cancer – and then order these blocks (users with IRB authorization to access de-identified data and get tissue, approved for materials transfer from the institution) • Return actual identified accession #’s for cases at my institution of pilocytic astrocytomas from 1989-1992 (honest broker)

  18. Security Requirements • All data and communications must be secure • Single sign on for honest brokers and others who will need to access our ‘identified’ datastore (linkage file) • Preference for one integrated set of security tools • Attribute and operation level security • Access to patient or report level data can only be granted to individuals known to have an IRB protocol to do research • Access to generic operations (get a histogram of X type of cases) can be universal to any user • Access to patient or report level data can only be granted to individuals who have an active IRB protocol

  19. Process of Grid Enablement • Request public access via http for 1upmc-spn01 • Deploy CSM for local application security • Deploy and configure grid services (including grid-level security) • Contribute to requirements for next iteration of caGrid security based on IRB requirements (Pitt and Penn, hopefully others) • Virtual organizations • Delegation • Single sign on • Attribute management • Gradually open up access as user management policies, processes and technology mature

  20. Lessons Learned - Modeling • Modeling • Many to many relationships in the object domain require correlation tables in the data model. But when the correlation itself has data a relationship object must exist (e.g., Application  Execution <- PathologyReport) Execution.startTime and Execution.endTime necessitate object domain accessor • Useful to use multiple model views to focus on key model aspects. Phase I model could be vetted while overall model expanded to PhaseII • Importing semantically annotated XMI back to the model clobbers diagrams and can negatively effect the data model • A good idea to always generate the application and test it with a simple call before proceeding to semantic activity

  21. Lessons Learned – Semantic Annotation • Semantic Connection • One way trip less troublesome • Better to treat the import and recovery of annotated XMI into EA as read only • Start each semantic-connector run with only Documentation and Description tags • This eliminates stray tags from previous annotation which may not be overwritten in the EVS Semantic Connector Report • Check at multiple points in process, human errors creep in at every step • UML Model, Excel file modified by deliverable, Excel file returned from NCICB, caDSR Stage • Vocabulary development process adds some unanticipated complexities to annotation

  22. Lessons Learned – Modeling and Semantic Annotation • Difficulties with maintenance and iterative model-building • Manual merge of Excel files is very error-prone • Many artifacts with partial representations • High effort cost to changing things later in process • Need to develop work processes that minimize wasted effort

  23. Questions we are still trying to answer • How do we map our models to data standards when the name of the class in the OM and the object class of the standard CDE are different? • Participant.gender • Execution.startTime • What would manual mapping do to the semantics? • What are we trying to achieve with semantic annotation? • Query?, Aggregation across sources?, Inference? • How should resources be balanced with effort required? • When will real researchers be accessing caGrid data services? How will we deal with process of credentialing and granting permissions to researchers? • How should we handle local application vs. grid level security in the short-term, and in the long-term?

  24. Future Work • Before end of August • Secure silver API with CSM • Complete installation of caGRID • Understand CSM as well as caGRID credentialing and user management and restrict access • Make node public • Register service • Eventually • Create text processing components as analytic services • Integrate API with GUI • Expand model to phase II and III

  25. Wish List • Better communication • Semantic Annotation • End point of reference implementations • Security policies and processes • Communication between reference implementations • Could we start to use list-serv to ask and address problems? • Better tooling • Semantic annotation • Single artifact

  26. Acknowledgements NCICB Ian ForeGeorge KomatsoulisRam Chilukuri Avinash Shanbhag Manav KherTara AkhavanMike ConnollyKevin Fitzpatrick Christophe Ludet Nicole Thomas Himanso SahniWilliam SanchezRuowei WuNafis Zebarjani BAH Brian DavisGreg EleyBal HarshawardhanArumani ManisundaramMark Adams Ardais David Aronow – Metadata Mentor Washington University Rakesh Nagarajan-Architecture Mentor caTIES/UPMC Kevin MitchellLinda SchmandtGirish ChavanAdi NemlekarJon Tobias UPCI Ronald HerbermanMichael Becich caTIES/Penn Michael FeldmanDavid FenstermacherTara McSherryJohn QuigleyVishal Nayak

More Related