160 likes | 292 Vues
This paper presents a novel approach for effectively finding approximate repeating patterns in music databases, addressing the challenges posed by traditional methods. Utilizing suffix trees and a combination of pitch and interval strings, it seeks to reduce extraction time while enhancing accuracy. Key definitions, such as edit distance and histogram distance, are introduced, along with an extraction procedure involving candidate generation and pruning. The proposed method demonstrates its efficiency and effectiveness through experimentation and detailed examples.
E N D
An Efficient Approach to Extracting Approximate Repeating Patterns in Music Databases Ning-Han Liu, Yi-Hung Wu, Arbee L.P. Chen Proceeding of International Conference on Database Systems for Advanced Applications (DASFAA 05) Speaker: Pei-Min Chou Date:2006/4/21
Introduction • Problem finding all the repeating patterns • exact repeating pattern • Suffix tree • String-join • Similarity pattern • costs a lot of time
Pitch String: 67,64,64, 65,62,62, 60,62,64,65, 67,67,67 Interval String: -3, 0, +1, -3, 0, -2, +2,+2,+1, +2, 0, 0 Definition • Pitch string: P=(p1,p2,…pm), ordered list, |P|=m • Interval string: D=(d1,d2,…dm-1), di=pi+1-pi • Interval segment: S[i:j]=(di,di+1,…,dj) • Constraints: Min_len , Max_len • Ex: String:(a,b,c,d) Min_len=2, Max_len=3 Segment (min_len): (a,b),(b.c),(c,d) Segment (max_len):(a,b,c),(b.c.d)
3 Def. Similar • Edit distance: edit(P,Q)=distance of transform segment P into segment Q • Distance threshold:δP=|P|*γ • |P|:segment length, γ:distance threshold ratio, γ<1 • Similar segment: if edit(P,Q)≤ δP • Q length is at least=|P|- δP • Ex:γ:50% • P=(1,2,3,1,1,2,3,1),Q=(1,3,3,1,2,3,1),edit(P,Q)=2 • δP=8*0.5=4 • edit(P,Q)=2 <4Q is similar to P
Def. Histogram vector • Histogram vector: • D={a1,a2,…an} • HV(S) = <h1S,h2S,….,hnS>, hkS: count of ak • Ex: • Segment S=(1,2,3,1,1,2,3,1) • D={1,2,3} • HV(S)=<4,2,2> • |HV(S)|=8 4 3 2 1 0 1 2 3
Def. Histogram distance • Histogram distance • HD(S1,S2)=max(ins(HV(S1), HV(S2)), ins(HV(S2),HV(S1))) • Ex1: • S1=(1,2,3,1,1,2,3,1) • S2=(1,3,3,1,2,3,1) • S1=S2={1,2,3} • HV(S1)=<4,2,2> ins(HV(S1), HV(S2))=1 • HV(S2)=<3,1,3> ins(HV(S2), HV(S1))=2 • Ex2: • S1=(1,2,3,1,1,2,3,1,3,4,5,3,4,5) • S2=(1,3,1,2,3,1,4,5,2,4,4) • edit(S1,S2)=5, HD(S1,S2)=4 ->HD(S1,S2)=2
Def. ARP • Extension P (Ext(P)) : • pivot P and all similar segments S • Each of segments satisfy overlapping threshold • Ex: P=(a,b,c,d) and min_len=2,max_len=3,δP=1 • segments=((a,b,d)(a,b,c)(b,c,d)(a,b,c,b)(a,b,c,d)) • Support: • The number of segments in an extension= |Ext(P)| • Approximate Repeating Pattern(APR) • |Ext(P)|≥min_sup
Def. Overlap • Overlapping degree: • I[a:b] and J[c:d], a≤c≤b and I similar to J • Overlapping degree=(b-c+1) • Ex:12122212 • Overlapping threshold: • OIJ=min(|I|,|J|)*ρ • ρ:overlapping threshold ratio,0≤ρ≤1 a c b d Degree=(7-4+1)=4
Extraction procedure • Index construction • Candidate generation • Range query formulation • MBR retrieval • Estimation for max number of similar segments • Candidate pruning after HD computations • Candidate pruning after HD computations • ARP extraction
S2 S5,S6 S7 S1,S3 S4 I=(0,0)(2,2) MBR(R1) R1 R3 RM pair={(1:5,2)} R2 Child-p2 I=(1,0)(2,1) I=(0,2)(1,2) R2 R3 RM pair={(1:2,2),(3:5,2)} RM pair={(1:4,2)} S1,S3,S4,S7 S2,S5,S6 Index construction • Construct parametric R*-tree • P(1,2,2,1,1) DP={1,2} • Segment len_2:S1(1,2),S2(2,2),S3(2,1),S4(1,1) • Segment len_3:S5(1,2,2),S6(2,2,1),S7(2,1,1) • Minimal bounding rectangle (MBR) • RM pair :(R,M), R: union range ,M: minimal length Level1 Level2 Segment level Dimention2 Child-p1 Dimention1 S1(1,2)->HV(S1)=<1,1> S2(2,2)->HV(S2)=<0,2>
S2 S5,S6 S7 S1,S3 R1 R3 S4 R2 Candidate generation • γ=0.5 , ρ=0.5 • Range query formulation • S7=P=(2,1,1) , Dp={1,2} , HV(P)=<2,1> • δP=3*0.5=1.5 • Range query=(HV(P), δP)=(<2,1>,1) • MBR retrieval • R1 and R2 included
Estimation for max number of similar segments • MLx=max(|P|-δP,Mx) • Mx :Each RM pair(Rx,Mx) in MBR • Numx:(n-L)/(L-m)+1,n=b-a+1,m=ρ*MLx • NumR: sum max num of MBR • Ex(Cont.) P=(2,1,1) • RM pair in R2=(1:2,2),(3:5,2) • ML1=max(3-1,2)=2, ML2=2 ,m=0.5*2=1 • num1=(2-2)/(2-1)+1=1,num2=1 • numR=1+1=2
m m m m m m m m Numx explain Numx:(n-L)/(L-m)+1 ,n=b-a+1,m=ρ*MLx =((b-a+1-L)/(L-m)) +1 L-m L a b n
Candidate pruning before & after HD computation • Before HD computation • Processed at level above segment level • If numR<min_sup ->prune and go next query • Else (numR≥min_sup)->recursive lower level • After HD computation • Processed at segment level • Compute the Hdistance between P and segment in MBRs. • All segment satisfy δPpermutated and compute max_num • If max_num<min_sup ->prune • Else(max_num≥min_sup) ->candidate ARP and segments
ARP extraction • Given a candidate ARP and its candidate segments • compute the edit distance • If candidate ARP and segments distance> threshold ->remove • Generate all the extention of candidate ARP by consider overlapping threshold • If support of extension < min_sup ->prune • Else Find a ARP
Experiment 1614 12 10 8 6 4 2 0 14 12 10 8 6 4 2 0 20 18 16 14 12 10 8 6 4 Min_len (interval number) 10 12 14 16 18 20 22 24 26 28 30 Max_len (interval number) 181614 12 10 8 6 4 2 0 20 15 10 5 0 0 5 10 15 20 25 30 35 40 45 50 distance threshold(%) 10 9 8 7 6 5 4 3 2 min_sup