1 / 35

Text Similarity & Clustering

Text Similarity & Clustering. Qinpei Zhao 15.Feb.2011. Outline. String matching metrics Implementation and applications Online Resources Location-based clustering. String Matching Metrics. Exact String Matching.

meira
Télécharger la présentation

Text Similarity & Clustering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. TextSimilarity & Clustering Qinpei Zhao 15.Feb.2011

  2. Outline • String matching metrics • Implementation and applications • Online Resources • Location-based clustering

  3. String Matching Metrics

  4. Exact String Matching • Given a text stringT of length n and a pattern stringP of length m, the exact string matching problem is to find all occurrences of P in T. • Example: T=“AGCTTGA” P=“GCT” • Applications: • Searchingkeywords in a file • Searching engines (like Google) • Database searching

  5. Approximate String Matching Determine if a text stringT of length n and a pattern stringP of length m“partially” matches. • Consider the string “approximate”. Which of these are partial matches? aproximateapproximatelyappropriateproximateapproxapproximataproposapproxximate • A partial match can be thought of as one that has k differences from the string where k is some small integer (for instance 1 or 2) • A difference occurs if the string1.charAt(j) != string2.charAt(j) or if string1.charAt(j) does not appear in string2 (or vice versa) • The former case is known as a revise difference, the latter is a delete or insert difference. • What about two characters that appear out of position? For instance, approximate vs. apporximate?

  6. Approximate String Matching Schwarrzenger Query errors: • Limited knowledge about data • Typos • Limited input device (cell phone) input Data errors • Typos • Web data • OCR Similarity functions: • Edit distance • Q-gram • Cosine • … Applications • Spellchecking • Query relaxation • …

  7. Edit distance (Levenshtein distance) • Given two strings T and P, the edit distanceis the minimum number of substitutions, insertion and deletions, which will transform some characters of Tinto P. • Time complexity by dynamic programming: O(mn)

  8. Edit distance (1974) Dynamic programming: m[i][j] = min{m[i-1][j]+1, m[i][j-1]+1, m[i-1][j-1]+d(i,j)} d(i,j) =0 if i=j, d(i,j)=1 else

  9. b i n go n Q-grams 2-grams Fixed length (q) ed(T, P) <= k, then # of common grams >= # of T grams –k *q

  10. Q-grams T = “bingo”, P = “going” gram1 = {#b, bi, in, ng, go, o#} gram2 = {#g, go, oi, in, ng, g#} Unique(gram1, gram2) = {#b, bi, in, ng, go, o#, #g, oi, g#} gram1.length = (T.length + (q - 1) * 2 + 1) – q gram2.length = (P.length + (q - 1) * 2 + 1) - q L = gram1.length + gram2.length Similarity = (L- |common terms difference| )/ L

  11. Cosine similarity • Two vectors A and B,θ is represented using a dot product and magnitude as • Implementation: Cosine similarity = (Common Terms) / (sqrt(Number of terms in String1) + sqrt(Number of terms in String2))

  12. Cosine similarity T = “bingo right”, P = “going right” T1 = {bingo right}, P1 = {going right} L1 = unique(T1).length; L2 = unique(T2).length; Unique(T1&P1) = {bingo right going} L3 = Unique(T1&P1) .length; Common terms = (L1+L2)-L3; Similarity = common terms / (sqrt(L1)*sqrt(L2))

  13. Dice coefficient • Similar with cosine similarity • Dices coefficient = (2*Common Terms) / (Number of terms in String1 + Number of terms in String2)

  14. Implementation & Applications

  15. Similarity metrics • Edit distance • Q-gram • Cosine distance • Dice coefficient …… similarity between two strings: Demo

  16. Applications in MOPSI • Duplicated records clean • Spelling check • Communication & comunication • query relevance/expansion • Text-level Annotation recommendation * • Keyword clustering * • MOPSI search engine**

  17. Annotation recommendation 500ms

  18. String clustering • The similarity between every string pair is calculated as a basis for determining the clusters • Using the vector model for clustering • A similarity measure is required to calculate the similarity between two strings.

  19. String clustering (Cont.) • The final step in creating clusters is to determine when two objects (words) are in the same cluster • Hierarchical agglomerative clustering (HAC) – start with un-clustered items and perform pair-wise similarity measures to determine the clusters • Hierarchical divisive clustering – start with a cluster and breaking it down into smaller clusters

  20. Objectives of Hierarchy of Clusters • Reduce the overhead of search • Perform top-down searches of the centroids of the clusters in the hierarchy and trim those branches that are not relevant • Provide for visual representation of the information space • Visual cues on the size of clusters (size of ellipse) and strengths of the linkage between clusters (dashed line, sold line…) • Expand the retrieval of relevant items • A user, once having identified an item of interest, can request to see other items in the cluster • The user can increase the specificity of items by going to children clusters or by increasing the generality of items being reviewed by going to a parent clusters

  21. Keyword clustering (semantic) • Thesaurus-based:WordNet • An advanced web-interface to browse the WordNet database • Thesaurus are not available for every language, e.g. Finnish. • example

  22. Resources

  23. Useful resources • Similarity metrics (http://staffwww.dcs.shef.ac.uk/people/S.Chapman/stringmetrics.html ) • Similarity metrics (javascript) (http://cs.joensuu.fi/~zhao/Link/ ) • Flamingo package (http://flamingo.ics.uci.edu/releases/4.0/ ) • WordNet (http://wordnet.princeton.edu/wordnet/related-projects/ )

  24. Location-based clustering

  25. DBSCAN- density based clustering (KDD’96) Parameters: MinPts eps Time complexity O(logn) – getNeighburs O(nlogn) – total Advantages Data shape unlimited Noise considered

  26. DBSCAN result Joensuu: 29,76, 62.60 Helsinki: 24, 60

  27. Gaussian Mixture Model Maximization likelihood estimation (Expectation Maximization algorithm) Parameters required Number of components Iteration number Advantages: Probabilistic (fuzzy) theory

  28. GMMs Joensuu: 29,76, 62.60 Helsinki: 24, 60

  29. GMMs Joensuu: 29,76, 62.60 Helsinki: 24, 60

  30. My activity area

  31. thanks!

More Related