Adapting an Algorithm to a Corpus

Adapting an Algorithm to a Corpus Peter Nelson Carleton College J. Starren, M.D., Ph.D. L. Rasmussen

Project Purpose • In the context of a GWAS on hypothyroidism • A particular natural language processing algorithm used to identify contextual features • Discover and evaluate automatic and semi-automatic methods of adapting that algorithm to a corpus of medical records

Project Motivation • PMRP • eMERGE • Hypothyroidism GWAS • Phenotyping

Project Motivation - PMRP • Marshfield Clinic PMRP • ~ 20,000 people from central WI • EHR and blood samples • Studies in the fields of: • Population Genetics • Genetic Epidemiology • Pharmacogenetics • Leverage genetic data to improve care

Project Motivation - eMERGE • eMERGE Network • Organized by NHGRI • Members • Marshfield Clinic • Vanderbilt • Northwestern • Mayo Clinic • Group Health Cooperative • Genome Wide Association Studies

What is a GWAS? Why Do One? • “[A GWAS] involves rapidly scanning markers across the… genomes of many people to find genetic variations associated with a particular disease.” • “[R]esearchers can use the information to develop better strategies to detect, treat and prevent the disease.” • “…common, complex diseases, such as asthma, cancer, diabetes….” NHGRI website (http://www.genome.gov/20019523)

Hypothyroidism GWAS • Insufficient hormone production by thyroid gland can cause fatigue, weight gain, and other symptoms. • Diagnosable and treatable • About 3% of American population have clinical condition • Different Causes

Hypothyroidism GWAS • eMERGE Study • Identify patients with presumptive Hashimoto’s disease induced hypothyroidism (Cases) • Identify patients with normal thyroid function (Controls) • Genotype cases and controls (by testing for 100,000s of SNPs) • Genome-wide association analysis

Phenotyping in a GWAS • Doctors design an algorithm for phenotyping based on the presence or absence of key procedures, medicines, and conditions in a patient’s medical history • EHR is used as a resource • Coded fields • Unmarked text • Images

Manual vs Electronic Phenotyping • Manual phenotyping by chart abstractors • Accurate (Gold standard) • Far too expensive (~20,000 medical records to process) • Electronic phenotyping by computers • Methods • Query database of coded fields • Natural language processing on free text • OCR and Image Processing on other resources • Comparatively cheap • Sample must be validated by chart abstractors

Natural Language Processing • What is it? • What problems must be solved? • How can they be solved?

Natural Language Processing • Search for concepts in free text of EHR • Simple keyword search insufficient • “There was no evidence of polyps or ulceration.” • “Rule out H. pylori, gastritis and gastropathy.” • “She should return to the Emergency Department if she experiences nausea or vomiting.” • “Patient should avoid any tests which involve the use of iodinated contrast material” • “The indication for this procedure is family history of colon cancer.”

Natural Language Processing • Search for concepts in free text of EHR • Negated • “There was no evidence of polyps or ulceration.” • “Rule out H. pylori, gastritis and gastropathy.” • “She should return to the Emergency Department if she experiences nausea or vomiting.” • “Patient should avoid any tests which involve the use of iodinated contrast material” • “The indication for this procedure is family history of colon cancer.”

Natural Language Processing • Search for concepts in free text of EHR • Hypothetical • “There was no evidence of polyps or ulceration.” • “Rule out H. pylori, gastritis and gastropathy.” • “She should return to the Emergency Department if she experiences nausea or vomiting.” • “Patient should avoid any tests which involve the use of iodinated contrast material” • “The indication for this procedure is family history of colon cancer.”

Natural Language Processing • Search for concepts in free text of EHR • Family History • “There was no evidence of polyps or ulceration.” • “Rule out H. pylori, gastritis and gastropathy.” • “She should return to the Emergency Department if she experiences nausea or vomiting.” • “Patient should avoid any tests which involve the use of iodinated contrast material” • “The indication for this procedure is family history of colon cancer.”

NegEx • Simple • Performs well • Against gold standard • Against MedLEE • Against straight statistical methods • Recently extended • Hypothetical & Family History • “ConText”

NegEx “There was no evidence of polyps or ulceration.”

NegEx “There was no evidence of polyps or ulceration.”  ................................................. |

NegEx “Rule out H. pylori, gastritis, and gastropathy.”

NegEx “Rule out H. pylori, gastritis, and gastropathy.” ………………………………………|

NegEx “Quantitative PCR testing for BK Virus is negative.”

NegEx “Quantitative PCR testing for BK Virus is negative.” |…………………………………………………

NegEx • “No evidence of spread of cancer to the lungs.” • “No residua of healed fractures can be seen otherwise.”

NegEx ………………………………………………..| • “No evidence of spread of cancerto the lungs.” …………………………………………………………| • “No residua of healed fractures can be seen otherwise.”

NegEx • “No evidence of spread of cancerto the lungs.” • “No residua of healed fractures can be seen otherwise.”

NegEx • NegEx, and therefore ConText, require carefully tuned lists of triggers and pseudotriggers. • How big must a list be to perform well?

Scenarios • Annotated training set used to populate lists • Large unmarked training set used to extend existing lists

Using Annotated Data • NegEx/ConText creators provide annotated excerpts from medical records • Look for associations between words and negation to populate list of triggers • Look for associations between words near triggers and false positives to populate list of pseudotriggers

Identifying Triggers • Create a confusion matrix for each word • Sort words by some statistic based on these confusion matrices • Select or reject top candidate as a trigger • Repeat on yet unexplained sentences until stopping condition met

Identifying Triggers • Statistical measures used • Log-likelihood ratio • Precision (PPV) • Recall (Sensitivity) • F-measure

Log-Likelihood Ratio • Triggers: { }

Log-Likelihood Ratio • Triggers: { no }

Log-Likelihood Ratio • Triggers: { no, denies }

Log-Likelihood Ratio • Triggers: { no, denies, not }

Log-Likelihood Ratio • Triggers: { no, denies, not, denied }

Log-Likelihood Ratio • Triggers: { no, denies, not, denied, without }

Log-Likelihood Ratio • Triggers: { no, denies, not, denied, without, negative }

Log-Likelihood Ratio • Triggers: { no, denies, not, denied, without, negative, resolved (post) }

Other Measures • Precision (PPV) • 271 tie for 100% • Poor metric • Recall (sensitivity) • Catches all the same ones as LLR • Also finds “any”, “the”, and “for” • Imprecise metric • F-measure • Identical results to LLR • Good metric

Identifying Pseudotriggers • Use analogous method to find words that predict false-positives • Limit to words next to triggers • Filter out prospects with low precision • Sort by LLR

Adapting an Algorithm to a Corpus

Adapting an Algorithm to a Corpus

Presentation Transcript

An Intelligent Network Routing Algorithm by a Genetic Algorithm

An Introduction to the Web as Corpus

Adapting to Change

Adapting to a changing highschool population

Adapting to Change

How to evaluate a corpus

ADAPTING A RESOURCE

Adapting to a Varying Environment

Feed Corpus : An Ever Growing Up to Date Corpus

How to Parallelize an Algorithm

Adapting to change

Adapting To An Outdoor Environment

ADAPTING TO A CHANGING CLIMATE

You talking to me? A Corpus and Algorithm for Conversation Disentanglement

Adapting the Rete-Algorithm to Evaluate F- Logic Rules

Adapting to a Changing Climate

Adapting to Work

Adapting to a changing climate

An Intelligent Network Routing Algorithm by a Genetic Algorithm

Adapting to Work