1 / 63

Adapting an Algorithm to a Corpus

Adapting an Algorithm to a Corpus. Peter Nelson Carleton College. J. Starren, M.D., Ph.D. L. Rasmussen. Project Purpose. In the context of a GWAS on hypothyroidism A particular natural language processing algorithm used to identify contextual features

glyn
Télécharger la présentation

Adapting an Algorithm to a Corpus

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Adapting an Algorithm to a Corpus Peter Nelson Carleton College J. Starren, M.D., Ph.D. L. Rasmussen

  2. Project Purpose • In the context of a GWAS on hypothyroidism • A particular natural language processing algorithm used to identify contextual features • Discover and evaluate automatic and semi-automatic methods of adapting that algorithm to a corpus of medical records

  3. Project Motivation • PMRP • eMERGE • Hypothyroidism GWAS • Phenotyping

  4. Project Motivation - PMRP • Marshfield Clinic PMRP • ~ 20,000 people from central WI • EHR and blood samples • Studies in the fields of: • Population Genetics • Genetic Epidemiology • Pharmacogenetics • Leverage genetic data to improve care

  5. Project Motivation - eMERGE • eMERGE Network • Organized by NHGRI • Members • Marshfield Clinic • Vanderbilt • Northwestern • Mayo Clinic • Group Health Cooperative • Genome Wide Association Studies

  6. What is a GWAS? Why Do One? • “[A GWAS] involves rapidly scanning markers across the… genomes of many people to find genetic variations associated with a particular disease.” • “[R]esearchers can use the information to develop better strategies to detect, treat and prevent the disease.” • “…common, complex diseases, such as asthma, cancer, diabetes….” NHGRI website (http://www.genome.gov/20019523)

  7. Hypothyroidism GWAS • Insufficient hormone production by thyroid gland can cause fatigue, weight gain, and other symptoms. • Diagnosable and treatable • About 3% of American population have clinical condition • Different Causes

  8. Hypothyroidism GWAS • eMERGE Study • Identify patients with presumptive Hashimoto’s disease induced hypothyroidism (Cases) • Identify patients with normal thyroid function (Controls) • Genotype cases and controls (by testing for 100,000s of SNPs) • Genome-wide association analysis

  9. Phenotyping in a GWAS • Doctors design an algorithm for phenotyping based on the presence or absence of key procedures, medicines, and conditions in a patient’s medical history • EHR is used as a resource • Coded fields • Unmarked text • Images

  10. Manual vs Electronic Phenotyping • Manual phenotyping by chart abstractors • Accurate (Gold standard) • Far too expensive (~20,000 medical records to process) • Electronic phenotyping by computers • Methods • Query database of coded fields • Natural language processing on free text • OCR and Image Processing on other resources • Comparatively cheap • Sample must be validated by chart abstractors

  11. Natural Language Processing • What is it? • What problems must be solved? • How can they be solved?

  12. Natural Language Processing • Search for concepts in free text of EHR • Simple keyword search insufficient • “There was no evidence of polyps or ulceration.” • “Rule out H. pylori, gastritis and gastropathy.” • “She should return to the Emergency Department if she experiences nausea or vomiting.” • “Patient should avoid any tests which involve the use of iodinated contrast material” • “The indication for this procedure is family history of colon cancer.”

  13. Natural Language Processing • Search for concepts in free text of EHR • Negated • “There was no evidence of polyps or ulceration.” • “Rule out H. pylori, gastritis and gastropathy.” • “She should return to the Emergency Department if she experiences nausea or vomiting.” • “Patient should avoid any tests which involve the use of iodinated contrast material” • “The indication for this procedure is family history of colon cancer.”

  14. Natural Language Processing • Search for concepts in free text of EHR • Hypothetical • “There was no evidence of polyps or ulceration.” • “Rule out H. pylori, gastritis and gastropathy.” • “She should return to the Emergency Department if she experiences nausea or vomiting.” • “Patient should avoid any tests which involve the use of iodinated contrast material” • “The indication for this procedure is family history of colon cancer.”

  15. Natural Language Processing • Search for concepts in free text of EHR • Family History • “There was no evidence of polyps or ulceration.” • “Rule out H. pylori, gastritis and gastropathy.” • “She should return to the Emergency Department if she experiences nausea or vomiting.” • “Patient should avoid any tests which involve the use of iodinated contrast material” • “The indication for this procedure is family history of colon cancer.”

  16. NegEx • Simple • Performs well • Against gold standard • Against MedLEE • Against straight statistical methods • Recently extended • Hypothetical & Family History • “ConText”

  17. NegEx “There was no evidence of polyps or ulceration.”

  18. NegEx “There was no evidence of polyps or ulceration.”

  19. NegEx “There was no evidence of polyps or ulceration.”  ................................................. |

  20. NegEx “There was no evidence of polyps or ulceration.”  ................................................. |

  21. NegEx “Rule out H. pylori, gastritis, and gastropathy.”

  22. NegEx “Rule out H. pylori, gastritis, and gastropathy.” ………………………………………|

  23. NegEx “Quantitative PCR testing for BK Virus is negative.”

  24. NegEx “Quantitative PCR testing for BK Virus is negative.” |…………………………………………………

  25. NegEx • “No evidence of spread of cancer to the lungs.” • “No residua of healed fractures can be seen otherwise.”

  26. NegEx ………………………………………………..| • “No evidence of spread of cancerto the lungs.” …………………………………………………………| • “No residua of healed fractures can be seen otherwise.”

  27. NegEx • “No evidence of spread of cancerto the lungs.” • “No residua of healed fractures can be seen otherwise.”

  28. NegEx • NegEx, and therefore ConText, require carefully tuned lists of triggers and pseudotriggers. • How big must a list be to perform well?

  29. Scenarios • Annotated training set used to populate lists • Large unmarked training set used to extend existing lists

  30. Using Annotated Data • NegEx/ConText creators provide annotated excerpts from medical records • Look for associations between words and negation to populate list of triggers • Look for associations between words near triggers and false positives to populate list of pseudotriggers

  31. Identifying Triggers • Create a confusion matrix for each word • Sort words by some statistic based on these confusion matrices • Select or reject top candidate as a trigger • Repeat on yet unexplained sentences until stopping condition met

  32. Identifying Triggers • Statistical measures used • Log-likelihood ratio • Precision (PPV) • Recall (Sensitivity) • F-measure

  33. Log-Likelihood Ratio • Triggers: { }

  34. Log-Likelihood Ratio • Triggers: { no }

  35. Log-Likelihood Ratio • Triggers: { no }

  36. Log-Likelihood Ratio • Triggers: { no, denies }

  37. Log-Likelihood Ratio • Triggers: { no, denies }

  38. Log-Likelihood Ratio • Triggers: { no, denies, not }

  39. Log-Likelihood Ratio • Triggers: { no, denies, not }

  40. Log-Likelihood Ratio • Triggers: { no, denies, not, denied }

  41. Log-Likelihood Ratio • Triggers: { no, denies, not, denied }

  42. Log-Likelihood Ratio • Triggers: { no, denies, not, denied, without }

  43. Log-Likelihood Ratio • Triggers: { no, denies, not, denied, without }

  44. Log-Likelihood Ratio • Triggers: { no, denies, not, denied, without, negative }

  45. Log-Likelihood Ratio • Triggers: { no, denies, not, denied, without, negative }

  46. Log-Likelihood Ratio • Triggers: { no, denies, not, denied, without, negative, resolved (post) }

  47. Log-Likelihood Ratio • Triggers: { no, denies, not, denied, without, negative, resolved (post) }

  48. Other Measures • Precision (PPV) • 271 tie for 100% • Poor metric • Recall (sensitivity) • Catches all the same ones as LLR • Also finds “any”, “the”, and “for” • Imprecise metric • F-measure • Identical results to LLR • Good metric

  49. Identifying Pseudotriggers • Use analogous method to find words that predict false-positives • Limit to words next to triggers • Filter out prospects with low precision • Sort by LLR

More Related