1 / 74

Unsupervised Detection of Anomalous Text

Unsupervised Detection of Anomalous Text. David Guthrie. The University of Sheffield. Textual Anomalies. Computers are routinely used to detect differences from what is normal or expected fraud network attacks Principal focus of this research is to similarly detect text that is irregular

aisha
Télécharger la présentation

Unsupervised Detection of Anomalous Text

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Unsupervised Detection of Anomalous Text David Guthrie The University of Sheffield

  2. Textual Anomalies • Computers are routinely used to detect differences from what is normal or expected • fraud • network attacks • Principal focus of this research is to similarly detect text that is irregular • We view text that deviates from its context as a type of anomaly

  3. Anomalous Documents? New Document Collection

  4. Anomalous Documents? Find text that is unusual

  5. New Document Anomalous Segments?

  6. New Document Anomalous Segments?

  7. New Document Anomalous Segments?

  8. New Document Anomalous Segments? Anomalous

  9. Motivation • Plagiarism • Writing style of plagiarized passages anomalous with respect to the rest of the authors work • Detect such passages because writing is “odd” not by using external resources (web) • Improving Corpora • Automatically gathered corpora can contain errors. Improve the integrity and homogeneity. • Unsolicited Email • E.g. Spam constructed from sentences • Undesirable Bulletin Board or Wiki posts • E.g. rants on wikipedia

  10. Goals • To develop a general approach which recognizes: • different dimensions of anomaly • fairly small segments (50 to 100 words) • Multiple anomalous segments

  11. Unsupervised • For this task we assume there is no training data available to characterize “normal” or “anomalous” language • When we first look at a document we have no idea which segments are “normal” and which are “anomalous” • Segments are anomalous with respect to the rest of the document not to a training corpus

  12. Outlier Detection • Treat the problem as a type of outlier detection • We aim to find pieces of text in a corpus that differ significantly from the majority of text in that corpus and thus are ‘outliers’

  13. Characterizing Text • 166 features computed for every piece of text (many of which have been used successfully for genre classification by Biber, Kessler, Argamon, …) • Simple Surface Features • Readability Measures • POS Distributions (RASP) • Vocabulary Obscurity • Emotional Affect (General Inquirer Dictionary)

  14. Readability Measures • Attempt to provide a rough indication of the reading level required for a text • Purported to correspond how “easily” a text is read • Work well for differentiating certain texts ( Scores are Flesch Reading Ease) Romeo & Juliet 84 Plato’s Republic 69 Comic Books 92 Sports Illustrated 63 New York Times 39 IRS Code -6

  15. Readability Measures • Flesch-Kincaid Reading Ease • Flesch-Kincaid Grade Level • Gunning-Fog Index • Coleman-Liau Formula • Automated Readability Index • Lix Formula • SMOG Index

  16. Obscurity of Vocabulary • Implemented new features to capture vocabulary richness used in a segment of text • Lists of most frequent words in Gigaword • Measure distribution of words in a segment of text in each group of words • Top 1,000 words • Top 5,000 words • Top 10,000 words • Top 50,000 words • Top 100,000 words • Top 200,000 words • Top 300,000 words

  17. Part-of-Speech • All segments are passed through the RASP (Robust and Accurate Statistical Parser) part-of-speech tagger • All words tagged with one of 155 part-of-speech tags from the CLAWS 2 tagset

  18. • Diversity of POS trigrams Part-of-Speech • • % articles • • % prepositions • • % pronouns • • % adjectives • %conjuctions • Ratio of adjectives to nouns • % of sentences that begin with a subordinating or coordinating conjunctions (but, so, then, yet, if, because, unless, or…)

  19. were made thinking apples be + ed make + ed think + ing apple + s Morphological Analysis • Texts are also run through the RASP morphological analyser, which produces words lemmas and inflectional affixes • Gather statistics about the percentage of passive sentences and amount of nominalization

  20. Rank Features • Store lists ordered by the frequency of occurrence of certain stylistic phenomena Most frequent POS trigrams list Most frequent POS bigram list Most frequent POS list Most frequent Articles list Most frequent Prepositions list Most frequent Conjunctions list Most frequent Pronouns list

  21. List Rank Similarity • To calculate the similarity between two segments lists, we use the Spearman’s Rank Correlation measure

  22. Sentiment • General Inquirer Dictionary (Developed by social science department at Harvard) 7,800 words tagged with 114 categories: • Positive • Negative • Strong • Weak • Active • Passive • Overstated • Understated • Agreement • Disagreement and many more … • Negate • Casual slang • Think • Know • Compare • Person Relations • Need • Power Gain • Power Loss • Affection • Work

  23. Representation • Characterize each piece of text (document, segment, paragraph, …) in our corpus as a vector of features • Use these vectors to construct a matrix, X, which has number of rows equal to the pieces of text in the corpus and number of columns equal to the number of features

  24. Document or corpus Represent each piece of text as a vector of features Feature Matrix X

  25. Document or corpus Identify outlying Text Feature Matrix X

  26. Approaches • Mean Distance: • Compute average distance from other segments • Comp Distance: • compute a segment’s difference from its complement • SDE Distance • Find the projection of the data where segments appear farthest

  27. Mean Distance

  28. Finding Outlying Segments Feature Matrix Dist = .5 • Calculate the distance from segment 1 to segment 2

  29. Finding Outlying Segments Feature Matrix Dist=.3 • Calculate the distance from segment 1 to segment 3

  30. Finding Outlying Segments Feature Matrix Build a Distance Matrix

  31. Finding Outlying Segments Feature Matrix Choose the segment that is most different Distance Matrix outlier

  32. seg f1 f2 f3 f4 f5 f6 f7 … fn 1 2 3 seg 1 2 3 4 5 6 7 … n 4 1 5 2 6 3 7 4 … 5 n 6 7 … n Ranking Segments Feature Matrix Distance Matrix Produce a Ranking of Segments List of Segments

  33. Pearson Correlation Coefficient d = 1 - r Euclidean Distance City Block Distance Distance Measures Cosine Similarity Measure d = 1 - s

  34. Standardizing Variables • Desirable for all variables to have about the same influence • We can express them each as deviations from their means in units of standard deviations (Z score) • Or Standardize all variables to have a minimum of zero and a maximum of one

  35. Comp Distance

  36. Distance from complement New Document or corpus

  37. Distance from complement Segment the text

  38. Distance from complement Characterize one segment

  39. seg f1 f2 f3 f4 f5 f6 f7 … fn 1 2 3 4 5 6 7 … n Distance from complement Characterize the complement of the segment

  40. seg f1 f2 f3 f4 f5 f6 f7 … fn D=.4 1 2 3 4 5 6 7 … n Distance from complement Compute the distance between the two vectors

  41. seg f1 f2 f3 f4 f5 f6 f7 … fn D=.4 1 2 3 4 5 6 7 … n Distance from complement For all segments

  42. seg f1 f2 f3 f4 f5 f6 f7 … fn D=.4 1 2 3 4 5 6 7 … n Distance from complement D=.6 Compute distance between segments

  43. Rank by distance from complement • Next, segments are ranked by their distance from the complement • In this scenario we can make good use of list features

  44. SDE Dist

  45. SDE • Use the Stahel-Donoho Estimator (SDE) to identify outliers • Project the data down to one dimension and measure the outlyingness of each piece of text in that dimension • For every piece of text, the goal is to find a projection of the that maximizes its robust z-score • Especially suited to data with a large number of dimensions (features)

  46. Outliers are ‘hidden’

  47. Robust Zscore of furthest point is <3

More Related