Sequence Mining and String Recognition Techniques in Data Mining

7. Sequence Mining Sequences and Strings Recognition with Strings MM & HMM Sequence Association Rules Data Mining by H. Liu, ASU

Sequences and Strings • A sequence x is an ordered list of discrete items, such as a sequence of letters or a gene sequence • Sequences and strings are often used interchangeably • String elements (characters, letters, or symbols) are nominal • A particularly long string is called text • |x| denotes the length of sequence x • |AGCTTC| is 6 • Any contiguous string that is part of x is called a substring, segment, or factor of x • GCT is a factor of AGCTTC Data Mining by H. Liu, ASU

Recognition with Strings • String matching • Given x and text, determine whether x is a factor of text • Edit distance • Given two strings x and y, compute the min number of basic operations (character insertions, deletions and exchanges) needed to transform x into y Data Mining by H. Liu, ASU

String Matching • Given |text| >> |x|, each discrete character is taken from an alphabet A • A can be {0, 1}, {0, 1, 2,…, 9}, {A,G,C,T}, or {A, B,…} • A shift s is an offset needed to align the first character of x with character number s+1 in text • Find if there exists a valid shift where there is a perfect match between each character in x and the corresponding one in text Data Mining by H. Liu, ASU

Naïve String Matching • Given alphabet A, x, text, n = |text|, m = |x| s = 0 whiles ≤ n-m ifx[1 …m] = text [s+1 … s+m] then print “pattern occurs at shift” s s = s + 1 • Time complexity (worst case): O((n-m+1)m) • One character shift at a time is not necessary Data Mining by H. Liu, ASU

Boyer-Moore String Matching • Given A, x, text, n = |text|, m = |x| F(x) = last-occurrence function G(x) = good-suffix function; s = 0 whiles ≤ n-m j = m while j>0 andx[j] = text [s+j] j = j-1 if j = 0 then print “pattern occurs at shift” s s = s + G(0) else s = s + max[G(j), j-F(text[s+j0])] Data Mining by H. Liu, ASU

Edit Distance • ED between x and y describes how many fundamental operations are required to transform x to y. • Fundamental operations (x=‘excused’, y=‘exhausted’) • Substitutions, ‘c’ is replaced by ‘h’ • Insertions, ‘a’ is inserted into x after ‘h’ • Deletions, a character in x is deleted • ED is one way of measuring similarity between two strings Data Mining by H. Liu, ASU

Classification using ED • Nearest-neighbor algorithm can be applied for pattern recognition. • Training: data of strings with their class labels stored • Classification (testing): a test string is compared to each stored string and an ED is computed; the nearest stored string’s label is assigned to the test string. • The key is how to calculate ED. • An example of calculating ED Data Mining by H. Liu, ASU

Hidden Markov Model • Markov Model: transitional states • Hidden Markov Model: additional visible states • Evaluation • Decoding • Learning Data Mining by H. Liu, ASU

Markov Model • The Markov property: • given the current state, the transition probability is independent of any previous states. • A simple Markov Model • State ω(t) at time t • Sequence of length T: • ωT = {ω(1), ω(2), …, ω(T)} • Transition probability • P(ωj(t+1)| ωi(t)) = aij • It’s not required that aij =aji Data Mining by H. Liu, ASU

Hidden Markov Model • Visible states • VT = {v(1), v(2), …, v(T)} • Emitting a visible state vk(t) • P(v k(t)| ωj(t)) = bjk • Only visible states vk (t) are accessibleand states ωi (t) are unobservable. • A Markov model is ergodic if every state has a nonzero prob of occuring give some starting state. Data Mining by H. Liu, ASU

Three Key Issues with HMM • Evaluation • Given an HMM, complete with transition probabilities aij and bjk. Determine the probability that a particular sequence of visible states VT was generated by that model • Decoding • Given an HMM and a set of observations VT. Determine the most likely sequence of hidden states ωT that led to VT. • Learning • Given the number of states and visible states and a set of training observations of visible symbols, determine the probabilities aij and bjk. Data Mining by H. Liu, ASU

Sequence Association Rule Mining • SPADE (Sequential Pattern Discovery using Equivalence classes) • Constrained sequence mining (SPIRIT) Data Mining by H. Liu, ASU

Bibliography • R.O. Duda, P.E. Hart, and D.G. Stork, 2001. Pattern Classification. 2nd Edition. Wiley Interscience. Data Mining by H. Liu, ASU

Sequence Mining and String Recognition Techniques in Data Mining

Sequence Mining and String Recognition Techniques in Data Mining

Presentation Transcript

Chapter 7: Text mining

Approaching Process Mining with Sequence Clustering

Mining Sequence Classifiers for Early Prediction

Sequence in Mining

DATA MINING LECTURE 7

DATA MINING LECTURE 7

Science 7 Scope and Sequence

Pairwise sequence alignment Unit 7

A new algorithm for gap constrained sequence mining

Session - 7 Sequence - 2

Data Mining: Concepts and Techniques Mining sequence patterns in transactional databases

Mining Sequence Patterns in Transactional Databases

Homework 7: Sequence Models

7. Sequence Mining

Sequence Data Mining: Techniques and Applications

Session - 7 Sequence - 3

Pattern Directed Mining Of Sequence Data

Incremental and Interactive Sequence Mining

Mining Sequence Data

Chapter 7: Data Mining

Session - 7 Sequence - 5

Session - 7 Sequence - 1