Exploiting Internal and External Semantics for the Clustering of Short Texts Using World Knowledge

Exploiting Internal and External Semantics for the Clustering of Short Texts Using World Knowledge Xia Hu, Nan Sun, Chao Zhang, Tat-Seng Chu Proceeding of the 18th ACM Conference on Information and Knowledge Management, CIKM, 2009 Speaker: Chien-Liang Wu

Outline • Motivation • Framework for feature constructor • Hierarchical Resolution • Feature Generation • Feature Selection • Evaluation

Motivation • Aggregated Search • Gather the search results from various resources • Present them in a succinct format • One of key issue in aggregated technology • “How should the information be presented to the user?” • Traditional browsing search results • A ranked list • Inconvenient for users to effectively locate their interests

Motivation(contd.) • Some commercial aggregated search systems, such as DIGEST and Clusty • Provide clustering of relevant search results • Make the information more systematic and manageable • Short texts • Snippets, product descriptions, QA passages and image captions play important roles in Web and IR applications • Successful processing short texts is essential for aggregated search systems • Consist of a few phrases or 2–3 sentences • Present great challenges in clustering

Framework (1) Example Query: “The Dark Knight” Google Snippet: “Jul 18, 2008 ... It is the best American film of the year so far and likely to remain that way. Christopher Nolan’s The Dark Knight is revelatory, visceral ...” (2) (3)

Hierarchical Resolution • Exploit internal semantics of short texts • Preserve contextual information • Avoid data sparsity • Use NLP to construct syntax tree • Decompose the original text into a three-level top-down hierarchical structure • Segment level, phrase level and word level

Three Levels • Segment Level • Text is split into segments with the help of punctuations • Generally ambiguous • Often fail to convey the exact information to represent the short text • Phrase level • Adopt shallow parsing to divide sentences into a series of words • Stemming and removal of stop-words from the phrases • The NP and VP chunks are employed as phrase level features

Three Levels(contd.) • Word level • Decompose the phrase level features directly • Build word level features • Choose the non-stop words contained in NP and VP chunks • Further remove the meaningless words in the short texts • Original feature set • Select features at phrase level and word level • Phrase level: contain original contextual information • Word level: avoid problem of data sparseness

Feature Generation • Goal: • Build semantic relationship with other relevant concepts • Example: “The Dark Knight” and “Batman” • Two steps • Select seed phrases from the internal semantics • Generate external features from seed phrases.

Seed Phrase Selection • Use the features at segment level and phrase level to construct the seed phrases • Redundancy problem • Segment level feature: “Christopher Nolan’s The Dark Knight is revelatory visceral” • Three phrase level features: [NP Christopher Nolan’s], [NP The Dark Knight] and [NP revelatory visceral] • Eliminate information redundancy • Measure semantic similarity between phrase level features and its parent segment level feature

Semantic Similarity Measure • A phrase-phrase semantic similarity measure algorithm • Use co-occurrence double check in Wikipedia to reduce the semantic duplicates • Download XML corpus of Wikipedia • Build a Solr index of all XML articles • Let P = {p1, p2,…,pn} • P: segment level feature, pi: phrase level feature • InfoScore(pi): • Semantic similarity between pi and {p1, p2, . . . , pn}

Semantic Similarity Measure(contd.) • Given two phrases pi and pj • Use pi and pj separately as query to retrieve top C Wikipedia pages • f(pi): total occurrences of pi in the top C Wikipedia pages retrieved by query pi • f(pi|pj): total occurrences of pi in the top C Wikipedia pages retrieved by query pj • Variants of three popular co-occurrence measures

Semantic Similarity Measure(contd.) • Similarity scores are normalized into values in [0, 1] range • A linear combination:

Semantic Similarity Measure(contd.) • For each segment level feature • Rank the information score for its child node features at phrase level • Remove the phrase level feature p* • Delegate the most information duplicate to the segment level feature P

Feature Generator retrieve the top w articles (1)titles and bold terms (links) in retrieved Wikipedia pages (2)key phrases extracted from the Wikipedia pages by Lingo Algorithm  External features Example: "in his car“ WordNet synsets: "atuo", "automobile", "autocar"

Feature Selection • Overzealous external features • Bring adverse impact on the effectiveness • Dilute the influence of valuable original information • Empirical rules to refine the unstructured features obtained from Wikipedia pages • Remove too general seed phrase (occurrence more than 10,000) • Transform features used for Wikipedia management or administration e.g. "List of hotels" → "hotels", "List of twins" → "twins“ • Phrase sense stemming using Porter stemmer • Remove features related to chronology, e.g. “year”, “decade” and “centuries”

External Feature Collection • Construct n1 + n2 dimension feature space • n1 original features, n2 external features • , where θ=0  no external features, θ=1  no original features • One seed phrase si (0< i ≦m) may generate k external features {fi1, fi2, . . . , fik} • Select one feature fi* for each seed phrase • Top n2 - m features are extracted from the remaining external features based on their frequency

Evaluation • Datasets • Reuters-21578 • Remove the texts which contain more than 50 words • Filter those clusters with less than 5 texts or more than 500 texts • Leave 19 clusters comprising 879 texts • Web Dataset • Ten hot queries from Google Trends • Top 100 snippets for each query are retrieved • Build a 10-category Web Dataset with 1000 texts

Clustering Methods • Two clustering algorithms, K-means and EM • Six different text representation methods • BOW (baseline 1) : Traditional “bag of words” model with the tf-idf weighting schema • BOW+WN (baseline 2) : BOW integrated with additional features from WordNet • BOW+Wiki (baseline 3) : BOW integrated with additional features from Wikipedia • BOW+Know (baseline 4) : BOW integrated with additional features from Wikipedia and WordNet • BOF : The bag of original features extracted with the hierarchical view • SemKnow : Our proposed framework

Evaluation Criteria • F1 measure • F1 Score = 2*(precision * recall) / (precision + recall) • Average Accuracy

Performance Evaluation • Parameter Setting: C=100, w=20, α=β=⅓, θ=0.5 • Use k-means algorithm • Use EM algorithm

Effect of External Features • At a small external feature set size of θ= 0.2 or θ= 0.3, SemKnow achieve the best performance Reuters using K-means algorithm Web Dataset using K-means algorithm

Optimal results using two algorithms

Detail Analysis • Feature space for the example of snippet

Exploiting Internal and External Semantics for the Clustering of Short Texts Using World Knowledge

Exploiting Internal and External Semantics for the Clustering of Short Texts Using World Knowledge

Presentation Transcript

Descartes and Hume on knowledge of the external world

The External and Internal Environment

Internal and External Parasites

External and Internal Analyses

Exploiting Wikipedia as External Knowledge for Document Clustering

Internal and External Recruitment

Internal and External Variance

External and internal forces

“Hook” for internal and external audiences:

Respiration – external and internal

Short texts.

LOCKE ON KNOWLEDGE OF THE EXTERNAL WORLD

Internal and External Forces

Characterisation: internal and external

Using UM Reports for Internal/External Sales Applications

Exploiting Wikipedia as External Knowledge for Document Clustering

Internal and External Conflict

The external and internal evidences of the bible

ARIES Algorithm for Recovery and Isolation Exploiting Semantics

Internal and External Assessment for Employment

❤️get (⚡️pdf⚡️) download Our Knowledge of the External World