1 / 24

Exploiting Internal and External Semantics for the Clustering of Short Texts Using World Knowledge

Exploiting Internal and External Semantics for the Clustering of Short Texts Using World Knowledge. Xia Hu, Nan Sun, Chao Zhang, Tat-Seng Chu Proceeding of the 18th ACM Conference on Information and Knowledge Management, CIKM, 2009. Speaker: Chien-Liang Wu. Outline. Motivation

wwhitley
Télécharger la présentation

Exploiting Internal and External Semantics for the Clustering of Short Texts Using World Knowledge

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Exploiting Internal and External Semantics for the Clustering of Short Texts Using World Knowledge Xia Hu, Nan Sun, Chao Zhang, Tat-Seng Chu Proceeding of the 18th ACM Conference on Information and Knowledge Management, CIKM, 2009 Speaker: Chien-Liang Wu

  2. Outline • Motivation • Framework for feature constructor • Hierarchical Resolution • Feature Generation • Feature Selection • Evaluation

  3. Motivation • Aggregated Search • Gather the search results from various resources • Present them in a succinct format • One of key issue in aggregated technology • “How should the information be presented to the user?” • Traditional browsing search results • A ranked list • Inconvenient for users to effectively locate their interests

  4. Motivation(contd.) • Some commercial aggregated search systems, such as DIGEST and Clusty • Provide clustering of relevant search results • Make the information more systematic and manageable • Short texts • Snippets, product descriptions, QA passages and image captions play important roles in Web and IR applications • Successful processing short texts is essential for aggregated search systems • Consist of a few phrases or 2–3 sentences • Present great challenges in clustering

  5. Framework (1) Example Query: “The Dark Knight” Google Snippet: “Jul 18, 2008 ... It is the best American film of the year so far and likely to remain that way. Christopher Nolan’s The Dark Knight is revelatory, visceral ...” (2) (3)

  6. Hierarchical Resolution • Exploit internal semantics of short texts • Preserve contextual information • Avoid data sparsity • Use NLP to construct syntax tree • Decompose the original text into a three-level top-down hierarchical structure • Segment level, phrase level and word level

  7. Three Levels • Segment Level • Text is split into segments with the help of punctuations • Generally ambiguous • Often fail to convey the exact information to represent the short text • Phrase level • Adopt shallow parsing to divide sentences into a series of words • Stemming and removal of stop-words from the phrases • The NP and VP chunks are employed as phrase level features

  8. Three Levels(contd.) • Word level • Decompose the phrase level features directly • Build word level features • Choose the non-stop words contained in NP and VP chunks • Further remove the meaningless words in the short texts • Original feature set • Select features at phrase level and word level • Phrase level: contain original contextual information • Word level: avoid problem of data sparseness

  9. Feature Generation • Goal: • Build semantic relationship with other relevant concepts • Example: “The Dark Knight” and “Batman” • Two steps • Select seed phrases from the internal semantics • Generate external features from seed phrases.

  10. Seed Phrase Selection • Use the features at segment level and phrase level to construct the seed phrases • Redundancy problem • Segment level feature: “Christopher Nolan’s The Dark Knight is revelatory visceral” • Three phrase level features: [NP Christopher Nolan’s], [NP The Dark Knight] and [NP revelatory visceral] • Eliminate information redundancy • Measure semantic similarity between phrase level features and its parent segment level feature

  11. Semantic Similarity Measure • A phrase-phrase semantic similarity measure algorithm • Use co-occurrence double check in Wikipedia to reduce the semantic duplicates • Download XML corpus of Wikipedia • Build a Solr index of all XML articles • Let P = {p1, p2,…,pn} • P: segment level feature, pi: phrase level feature • InfoScore(pi): • Semantic similarity between pi and {p1, p2, . . . , pn}

  12. Semantic Similarity Measure(contd.) • Given two phrases pi and pj • Use pi and pj separately as query to retrieve top C Wikipedia pages • f(pi): total occurrences of pi in the top C Wikipedia pages retrieved by query pi • f(pi|pj): total occurrences of pi in the top C Wikipedia pages retrieved by query pj • Variants of three popular co-occurrence measures

  13. Semantic Similarity Measure(contd.) • Similarity scores are normalized into values in [0, 1] range • A linear combination:

  14. Semantic Similarity Measure(contd.) • For each segment level feature • Rank the information score for its child node features at phrase level • Remove the phrase level feature p* • Delegate the most information duplicate to the segment level feature P

  15. Feature Generator retrieve the top w articles (1)titles and bold terms (links) in retrieved Wikipedia pages (2)key phrases extracted from the Wikipedia pages by Lingo Algorithm  External features Example: "in his car“ WordNet synsets: "atuo", "automobile", "autocar"

  16. Feature Selection • Overzealous external features • Bring adverse impact on the effectiveness • Dilute the influence of valuable original information • Empirical rules to refine the unstructured features obtained from Wikipedia pages • Remove too general seed phrase (occurrence more than 10,000) • Transform features used for Wikipedia management or administration e.g. "List of hotels" → "hotels", "List of twins" → "twins“ • Phrase sense stemming using Porter stemmer • Remove features related to chronology, e.g. “year”, “decade” and “centuries”

  17. External Feature Collection • Construct n1 + n2 dimension feature space • n1 original features, n2 external features • , where θ=0  no external features, θ=1  no original features • One seed phrase si (0< i ≦m) may generate k external features {fi1, fi2, . . . , fik} • Select one feature fi* for each seed phrase • Top n2 - m features are extracted from the remaining external features based on their frequency

  18. Evaluation • Datasets • Reuters-21578 • Remove the texts which contain more than 50 words • Filter those clusters with less than 5 texts or more than 500 texts • Leave 19 clusters comprising 879 texts • Web Dataset • Ten hot queries from Google Trends • Top 100 snippets for each query are retrieved • Build a 10-category Web Dataset with 1000 texts

  19. Clustering Methods • Two clustering algorithms, K-means and EM • Six different text representation methods • BOW (baseline 1) : Traditional “bag of words” model with the tf-idf weighting schema • BOW+WN (baseline 2) : BOW integrated with additional features from WordNet • BOW+Wiki (baseline 3) : BOW integrated with additional features from Wikipedia • BOW+Know (baseline 4) : BOW integrated with additional features from Wikipedia and WordNet • BOF : The bag of original features extracted with the hierarchical view • SemKnow : Our proposed framework

  20. Evaluation Criteria • F1 measure • F1 Score = 2*(precision * recall) / (precision + recall) • Average Accuracy

  21. Performance Evaluation • Parameter Setting: C=100, w=20, α=β=⅓, θ=0.5 • Use k-means algorithm • Use EM algorithm

  22. Effect of External Features • At a small external feature set size of θ= 0.2 or θ= 0.3, SemKnow achieve the best performance Reuters using K-means algorithm Web Dataset using K-means algorithm

  23. Optimal results using two algorithms

  24. Detail Analysis • Feature space for the example of snippet

More Related