150 likes | 287 Vues
This case study explores the creation of a semantic search application within the pharmaceutical domain, led by Tom Reamy, Chief Knowledge Architect at KAPS Group. The project utilized agile methodologies to automate and enhance the annotation of scientific documents. Key objectives included discovering new entities, improving user self-service capabilities, and integrating advanced semantic technologies. The results demonstrated a high level of precision and recall in identifying relevant drug names, diseases, and clinical trial data. The successful implementation satisfied client needs and showcased the potential of semantic search in healthcare.
E N D
Developing a Semantic Search ApplicationA PharmaCase Study Tom ReamyChief Knowledge Architect KAPS Group http://www.kapsgroup.com Program Chair – Text Analytics World Taxonomy Boot Camp: Washington DC, 2013
KAPS Group: General • Knowledge Architecture Professional Services – Network of Consultants • Partners – SAS, SAP, IBM, FAST, Smart Logic, Concept Searching • Attensity, Clarabridge, Lexalytics, • Strategy– IM & KM - Text Analytics, Social Media, Integration • Services: • Taxonomy/Text Analytics development, consulting, customization • Text Analytics Fast Start – Audit, Evaluation, Pilot • Social Media: Text based applications – design & development • Clients: • Genentech, Novartis, Northwestern Mutual Life, Financial Times, Hyatt, Home Depot, Harvard Business Library, British Parliament, Battelle, Amdocs, FDA, GAO, etc. • Applied Theory – Faceted taxonomies, complexity theory, natural categories, emotion taxonomies Presentations, Articles, White Papers – http://www.kapsgroup.com
Project • Agile Methodology • Goal – evaluate semantic technologies ability to: • Replace manual annotation of scientific documents – automated or semi-automated • Discover new entities and relationships • Provide users with self-service capabilities • Goal – feasibility and effort level
Components – Technology, Resources • Cambridge Semantics, Linguamatics, SAS Enterprise Content Categorization • Initial integration – passing results as XML • Content – scientific journal articles • Taxonomy – Mesh – select small subset • Access to a “customer” – critical for success
Three rounds - Iterations • Visualization – faceted search, sort by date, author, journal • Cambridge Semantics • Round 1 – PDF from their database • Needed to create additional structure and metadata • No such thing as unstructured content • Round 2 & 3 – XML with full metadata from PubMed • Entity Recognition – Species, Document Type, Study Type, Drug Names, Disease Names, Adverse Events
Components & Approach • Rules or sample documents? • Need more precision and granularity than documents can do • Training sets – not as easy as thought • First Rules – text indicators to define sections of the document • Objectives, Abstract, Purpose, Aim – all the “same” section • Separate logic of the rules from the text • Stable rules, changing text • Scores – relevancy with thresholds • Not just frequency of words
Document Type Rules • (START_2000, (AND, (OR, _/article:"[Abstract]", _/article:"[Methods]“, _/article:"[Objective]", • _/article:"[Results]", _/article:"[Discussion]“, (OR, • _/article:"clinical trial*", _/article:"humans", • (NOT, (DIST_5, (OR,_/article:"approved", _/article:"safe", _/article:"use", _/article:"animals"), • Clinical Trial Rule: • If the article has sections like Abstract or Methods • AND has phrases around “clinical trials / Humans” and not words like “animals” within 5 words of “clinical trial” words – count it and add up a relevancy score
Rules for Drug Names and Diseases • Primary issue – major mentions, not every mention • Combination of noun phrase extraction and categorization • Results – virtually 100% • Taxonomy of drug names and diseases • Capture general diseases like thrombosis and specific types like deep vein, cerebral, and cardiac • Combine text about arthritis and synonyms with text like “Journal of Rheumatology”
Rules for Drug Names and Diseases • (OR, _/article/title:"[clonidine]", • (AND, _/article/mesh:"[clonidine]",_/article/abstract:"[clonidine]"), • (MINOC_2, _/article/abstract:"[clonidine]") • (START_500, (MINOC_2,"[clonidine]"))) • Means – any variation of drug name in title – high score • Any variation in Mesh Keywords AND in abstract – high score • Any variation in Abstract at least 2x – good score • Any variation in first 500 words at least 2x – suspect
Rules for Drug Names and Diseases • Results: • Wide Range by type -- 70-100% recall and precision • Focus mostly on precision – difficult to test recall • One deep dive area indicated that 90%+ scores for both precision and recall could be built with moderate level of effort • Not linear effort – 30% accuracy does not mean 1/3 done
Iteration 3 • Complete treatment of disease state: • Indication (disease you want to treat) • Concomitant disease • Adverse or side effects • Use XML metadata – some variant of “adverse” • Any combination of words associated with a disease (depression) and any of the words that indicated an adverse event or effect
Conclusion • Project was a success! • Useful results – as defined by the customer • Reasonable and doable effort level – both for initial development and maintenance • Essential Success Factors • Rules not documents, training sets (starting point) • Full platform for disambiguation of noun phrase extraction, major-minor mention • Separation of logic and text • Semantic Search works! • If you do it smart!
Questions? Tom Reamytomr@kapsgroup.com KAPS Group Knowledge Architecture Professional Services http://www.kapsgroup.com www.TextAnalyticsWorld.com March 17-19, San Francisco