1 / 9

Hamming Distance

Hamming Distance. Very efficient, only for the strings with same length. Basically it simply counts the number of distinct characters. Wont help much for us.  Levenstein distance. It measures distance in terms of the number of "operations" required to transform one string to another.

joy
Télécharger la présentation

Hamming Distance

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hamming Distance • Very efficient, only for the strings with same length. • Basically it simply counts the number of distinct characters. • Wont help much for us.

  2.  Levenstein distance • It measures distance in terms of the number of "operations" required to transform one string to another. • These operations include insertion, deletion and substitution. •  In Damerau- Levenstein distance transposition is included. • This may useful for spelling correction but I am not sure how it will efficient in our case.

  3. Needlman-Wunsch • This algorithm is same like Damerau- Levenstein with weighted edit distance, this is used in biology. • Mainly used for Alignment • So obviously we don’t need it.

  4. Smith–Waterman algorithm • Like Needlman – Wunsch algorithm this is also mainly used for alignment. • This also used in biology. • Gotoh distance is also used to find the alignment.

  5. Jaro-Winkler Similarity • The order of occurrence is an essential determination of similarity. •  For instance, the strings "martha" and "marhta" are considered a complete match because the transposed "th" and "ht" are within 2 characters of each other. •  The more transposes found between the two strings, the smaller the overall matching weight.

  6. Matching coefficient • This is simple same as hamming distance with one change- position is not important • Simply counts the number of terms present • |a ∩ b| - It doesn’t take in to account the sizes of a and b • There are some metrics which use the same with including sizes of a and b. those are as follows Any one of this may helpful for us

  7. Jaccardcoefficient • The sentence is tokenized into words. Then words are compared with other sentence words. • |a ∩ b| / |a U b| • This is one of the most efficient algorithm. • Overlap Coefficient is similar with slight modulation is formula: |a ∩ b| / min(|a|,|b|)

  8. Sørensen Similarity • Same as jaccard similarity with different formula. • Similarity = 2* |Number of intersection| / |union number of words| • This is Identical to Dice’s coefficient These may all be considered to be normalised versions of the simple matching coefficient

  9. Other metrics • Other metrics like SFS, Tau, Confusion probability, Skew divergence, Cosine, TFIDF, etc are either not useful for us or contains big calculations which is not possible in our case.

More Related