Using TF-IDF Anomalies to Cluster Documents on Subject Matter

Natural Language Processing And Computational Linguistics Using TF-IDF Anomalies to Cluster Documents on Subject Matter An Analysis using Word, Simple Noun Phrase, and Complex Noun Phrase Frequencies Whitney St.Charles Research Alliance in Math and Science 2007 Mentors: Yu (Cathy) Jiao, Ph.D. Robert Patton, Ph.D. Computational Sciences and Engineering Division

Purposes of document clustering • Data overabundance • YouTube generates 200 terabytes of data per day • How do we sift through those kinds of quantities? • Searching • Reduces the set tremendously • Document Clustering • Is a knowledge discovery technique • Categorizes results into meaningful groups • Allows the user to browse quickly to the target

Document clustering users • Financial analysts • Identify certain trends to develop forecasts about a particular company • Business Intelligence • Identify products that are associated with or dependent upon one another • Military • Identify terrorist cells from blog activity and movement of materials • You! • Narrow down hundreds of thousands of internet search results to find the kinds of sites you want

Current document clustering technique • A word-by-word comparison of each document is made to determine similarity • Unfortunately, this method… • Does not handle context very well • Compares several hundred/ several thousand words for each document • Is very computationally expensive • Requires expensive SIMD machines

Contributions to the field • Identify only those words which are more indicative of the subject matter • If airline occurs 20% more than is “normal,” it has something to do with the subject • Examine both simple and complex noun phrases to address the context of the document • Generate much smaller vectors, containing an average of 82% fewer terms! • Cluster more accurately because only “important” words are chosen

Our method

Establishing the baseline • Train the program to recognize what is “normal” for a given term • Need an entire English language corpus • Corpus: a large, structured set of texts compiled to be representative of a language • uses hundreds of thousands of words in every allowable way • Using a corpus, the program can • Establish usage statistics • Learn linguistic rules Example: The Brown Corpus http://www.edict.com.hk/concordance/WWWConcappE.htm

Extracting words and phrases

Part-of-speech tagging • Tags every word in the sentence with the correct part-of-speech • Achieves an accuracy of 97.24% • Is necessary because token extraction methods are each dependent upon correct tagging • Passes the tagged sentence to the token extractor The/dtdesperate/adjsummer/n intern/n tried/vbdto/to keep/vb everyone/n awake/adj.

Token extractor • Extracts • Words • Simple noun phrases • Complex noun phrases Document Words Noun phrases

Word extraction • Uses POS tagged data to identify only adjectives, verbs, and nouns • Uses the Porter stemmer to identify unique words • cut common suffixes such as –ing, -tion, -e, -es, -s • Example: “recreation” and “recreational” are both identified as “recreat”

Why nouns? • Are named entities • Answer the question “What” • Are less ambiguous than verbs • Example: “cook up a good meal” or “cook up a new solution”

Simple noun phrase extraction • Accepts only consecutive nouns • Example: summer intern, union representative • Provides a set of short, highly descriptive phrases

Complex noun phrase extraction techniques • Static Rule-based/ Finite State Automata • Rely on the aptitude of linguist formulating rule set • Machine Learning • Rely on the “completeness” of the training set

noun/ pronoun/ determiner determiner/adjective noun/ pronoun NP S0 S1 adjective Relative clause/ Prepositional phrase/ noun Static rule-based extraction • Establishes a list of linguistic rules • A determiner preceding a noun marks the beginning of a noun phrase • A determiner may not precede a noun phrase

Static extraction shortcomings • Unanticipated rules • The subjective nature of language • Difficulty finding non-recursive, base NP’s • [The man [whose red hat [I borrowed yesterday]RC ]RC [in the street]PP [that is next to my house]RC]NPlives [next door]NP. • [The man]NPwhose [red hat]NPI borrowed [yesterday]NPin[the street]NPthat is next to [my house]NPlives [next door]NP. • Structural ambiguity

Structural ambiguity example “I saw the man with the telescope.”

Machine learning extraction TRAINING • Is all about • Uses a corpus • Is based on statistics • The more it sees a particular occurrence, the more likely it is to prefer it • Makes better educated guesses about structural ambiguity • Discovers thousands of unanticipated rules

Transformation-based complex noun phrase extraction An ‘error-driven’ approach for learning an ordered set of rules 1. Generate all rules that correct at least one error. 2. For each rule: (a) Apply to a copy of the most recent state of the training set. (b) Score result 3. Select rule with best score. 4. Update training set by applying selected rule. 5. Stop if score is smaller than some pre-set threshold T; otherwise repeat from step 1.

Determining anomaly sets • TF-IDF: Term Frequency – Inverse Document Frequency • Number of local occurrences of term multiplied by uniqueness measure of term in document set • TF-ICF: Term Frequency – Inverse Corpus Frequency • Average number of corpus occurrences of term multiplied by uniqueness measure of term in the corpus

Each document has its own anomaly vector

Clustering the data • Unweighted Pair Group Method with Average means

Performance Metrics Used • Precision = number of correct responses number of responses • Recall = number of correct responses number correct in key • F-measure = 2RP R+ P

RESULTS 80% 89% With 82% fewer comparisons!

Future Work • Determine clustering results for both simple and complex noun phrases • Could be applied to other clustering techniques, such as swarming

Acknowledgements • The Research Alliance in Math and Science program • Computational Sciences and Engineering Division, Office of Advanced Scientific Computing Research, U.S. Department of Energy. • Dr. Cathy Jiao • Dr. Robert Patton • Dr. Thomas Potok

QUESTIONS?

Using TF-IDF Anomalies to Cluster Documents on Subject Matter

Using TF-IDF Anomalies to Cluster Documents on Subject Matter

Presentation Transcript

Patentable Subject Matter

Teaching Subject Matter

TF-IDF

Faster TF-IDF

TF-IDF

Subject matter map

Improved TF-IDF Ranker

SUBJECT MATTER

Statistical subject-matter domains

Vector Space Model : TF - IDF

Using TF-IDF to Determine Word Relevance in Document Queries

Faster TF-IDF

Economic Subject Matter

Subject Matter of Patents

Recent Cases on Patentable Subject Matter

Subject Area (cluster) Committee

Patentable Subject Matter

A Novel TF-IDF Weighting Scheme for Effective Ranking

Economic Subject Matter Meetings

2019 Revised Guidance on Subject Matter Eligibility

Subject Matter

TF/IDF Ranking