1 / 23

Emotive Language and Disease Outbreak Reports

Emotive Language and Disease Outbreak Reports. AACL 2008 Mike Conway - mike@nii.ac.jp National Institute of Informatics, Tokyo. Talk Outline:. What are disease outbreak reports and why are they important. BioCaster : A system for automatically identifying disease outbreak reports

cindy
Télécharger la présentation

Emotive Language and Disease Outbreak Reports

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Emotive Language and Disease Outbreak Reports AACL 2008 Mike Conway - mike@nii.ac.jp National Institute of Informatics, Tokyo

  2. Talk Outline: • What are disease outbreak reports and why are they important. • BioCaster: A system for automatically identifying disease outbreak reports • The BioCaster disease outbreak report corpus • Emotive language and disease outbreak reports

  3. Disease Outbreak Reports • Identifying disease outbreaks at an early stage (e.g. Avian flu, Ebola, etc.) • ProMed: A human curated alerting system. • Towards identifying disease outbreaks automatically from general news reports. • Currently the system works in English only (to be expanded to Japanese, Thai and Vietnamese)

  4. Example ProMED Report Avian influenza situation in Turkey - Laboratory tests conducted in Turkey have confirmed detection of the H5 subtype of avian influenza virus in samples from an additional 10 patients. Five of these cases were announced by the Ministry of Health yesterday and an additional five were announced today. Most patients are children and all have been hospitalized for treatment and evaluation. Of the five patients announced on Sunday, three are from Ankara Province and include two brothers, aged five and two years, and a 65-year-old man. All three patients are hospitalized in Ankara. The additional two cases, a nine-year-old girl and her three-year-old brother, are from the Dogubeyazit district in Agri Province, and are hospitalized in the city of Van. The five cases announced today are from Kastamonu, Corum, and Samsun provinces, bordering the Black Sea in the north-central part of the country, and from Van Province. This brings the total number of cases in Turkey, confirmed by laboratory tests there, to 14. Of these patients, two have died …

  5. Ambiguity Confusion Vista attacked by 13-year old virus Zika virus in Micronesia (Yap) South Sudan hit by Ebola-like fever Philadelphia gripped by baseball fever Bird flu outbreak drill spooks Manitoba town Undiagnosed disease in Java + -

  6. Global Health Monitor www.biocaster.nii.ac.jp

  7. The BioCaster Corpus • 500 annotated documents • 238 Relevant (broken down into three subcategories: alert, publish and check) • 262 Reject (that is, not disease outbreak reports) • 7 different domains (health, society, business, sport, politics, science and technology) • Currently annotated for named entities (but more annotation to come) • The number of documents in the corpus has now been expanded to 1000

  8. Annotated Report <DOC id="000101? language="en-us" source="WHO" domain ="health"subdomain="disease" date="2007/3/2" relevancy="publish"><NAME cl="DISEASE">Avian influenza</NAME> situation in <NAME cl="LOCATION">Vietnam</NAME> update 21 <NAME cl="TIME">16 June 2005</NAME><NAME scl="ORGANIZATION">WHO</NAME> is aware of media reports that <NAME cl="PERSON" case="true" number="many">six additional patients</NAME><NAME scl="CONDITION">infected</NAME> with <NAME scl="DISEASE">H5N1 avian influenza</NAME> are undergoing treatment in a <NAME cl="LOCATION">Hanoi</NAME> hospital and that <NAME cl="PERSON" case="true" number="one">a health care worker</NAME> at the same hospital may also be <NAME scl="CONDITION">infected</NAME>. While these reports have not yet been officially confirmed by national authorities, they appear to be accurate. <NAME cl="ORGANIZATION">WHO</NAME> is seeking confirmation and further information from the <NAME scl="ORGANIZATION">Ministry of Health</NAME>.</DOC>

  9. FourResults: • Frequency • Negative affect words occur more frequently in disease outbreak reports compared to other kinds of news article at a statistical significant level. • Classification Experiments • Negative affect words alone are not useful for distinguishing between disease outbreak reports and other kinds of news article • Easily computable non-lexical features (word length, sentence length, etc.) are not useful for distinguishing between disease outbreak reports and other kinds of news article • Using statistically significant keywords increases classification accuracy compared to a standard “bag-of-words” type representation • Non topical function words alone do not perform well compared to a “bag-of-words” representation

  10. Negative Affect Resources • MPQA (Multi Perspective Question Answering) wordlist. University of Pittsburg (Wilson, 2001) • 4911 negative affect words abandoned, abandonment, abandon, abase, abasement, abash, abate, abdicate, aberration, aberration, abhor, abhor, abhorred, abhorrence, abhorrent, abhorrently, abhors, abhors, abject, abjectly, abjure, abnormal, abolish, abominable, abominably, abominate, abomination, abrade, abrasive, abrupt, abscond, absence, absentee, absent-minded, absurd, absurdity, absurdly, absurdness, abuse, abuse, abuse, abuses, abuses, abusive, abysmal, abysmally, abyss, accidental, accost, accountable, accursed, accusation, accusation, accusations, accusations, accuse, accuses, accusing, accusingly, acerbate, acerbic, acerbically, ache, acrid, acridly, acridness, acrimonious, acrimoniously, acrimony, adamant, adamantly, addict, addiction, admonish, admonisher, admonishingly…

  11. Negative Affect Word Frequency • 44% of sentences in standard news (WSJ) contain evaluative/subjective words (Wiebe et. al. 2001) • Example sentence: “Kaggwa attributed the increase to the stigma associated with the disease, poverty and inaccessibility of some health centres that offer diagnosis and treatment services” (BioCaster Corpus) • Sample of 78,472 words taken from relevant and reject categories • Relevant category: 5.94% negative • Reject category: 4.19% negative • Statistically significant (P < 0.05)

  12. Most Frequent Negative Words in Relevant Category

  13. Classification Experiments • Naïve Bayes Classification Algorithm used for binary classification task (Relevant OR Reject) • Weka data mining toolkit used • Two baselines: • One Rule: Assuming all classes are reject yields a classification accuracy of 52.4% • Bag-of-words text representation (i.e. all 18899 word types in the BioCaster corpus) achieves an accuracy of 89.6%

  14. Classification: Negative Affect Words • The 4,911 negative affect words provided a classification accuracy of only 74.4% (compared to bag-of-words performance of 89.6%)

  15. Classification: Non-Lexical Features • Easily computable non-lexical features failed to distinguish between the relevant and reject categories: • Word length - 56.2% • Sentence length - 54.8% • Punctuation density - 55.2% • Naïve Bayes classifier used. • Baseline: Assuming all classes are reject would yield an accuracy of 52.4%

  16. Classification: Keywords • Using statistically significant keywords increases classification accuracy compared to a standard “bag-of-words” (i.e. all the unigrams) type representation. • Keywords calculated using the Chi-squared method (using AntConc -- also available in WordSmith Tools) • All 950 statistically significant keywords yielded classification accuracy: 94.4% • “Bag-of-words” features (i.e. all 18899 word types in the BioCaster corpus) yielded accuracy: 89.6% • Statistically significant difference (P < 0.05) using the corrected resampled t-test (Bouckaert and Frank 2004).

  17. Classification: Keywords list

  18. Classification: Function Words • 319 function words (IR research group at Glasgow University) • Function words alone yielded a classification accuracy of 74.8% • Of the 950 statistically significant keywords, 55 (5.7%) were function words (including never, but, and, back, up, with, where, and yet) • To test how important function words are to the classification task, we removed them from the 950 statistically significant features. Resulting in a classification accuracy of 93.8%, compared to 94.4% using all 950 features. The difference was not statistically significant, indicating that function words do not contribute strongly to classification accuracy.

  19. Classification Results

  20. Conclusions • Infectious disease outbreak reports contain a greater number of negative affect words, compared to other kinds of news reports. • Negative affect words alone are useful for this classification task. • Easily computable non-lexical features not useful • Using statistically significant keywords increases classification accuracy. • Non-topical function words do not perform well • Statistically significant keywords are comprised of function words, negative affect words, disease and health words.

  21. Thank You Contact: mike@nii.ac.jp

  22. BioCaster Funders: • Japanese Society for the Promotion of Science • Research Organization of Information Systems • National Institute of Informatics, Japan • Collaborating Organizations: • National Institute of Infectious Diseases, Japan • Vietnam National University • Kasetsart University, Thailand • Okayama University, Japan

  23. References Bouckaert, R. & Frank, E. (2004) Evaluating the Replicability of Significance Tests for Comparing Learning Algorithms. Advances in Knowledge Discovery and Data Mining, Springer Wiebe, J. & Wilson, T. & Bell, M. (2001) Identifying Collocations for Recognizing Opinions. Proc. ACL 01 Workshop on Collocation. Toulouse, France, July 2001. Wilson, T. & Wiebe, J. Hoffmann, P (2005). Recognizing Contextual Polarity in Phrase-Level Sentiment Analysis. Human Language Technology Conference Conference on Empirical Methods in Natural Language Processing, 2005

More Related