Synthesizing High-Frequency Rules from Different Data Sources

Synthesizing High-Frequency Rules fromDifferent Data Sources Xindong Wu and Shichao Zhang IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 15, NO. 2, MARCH/APRIL 2003

Pre-work • Knowledge management. • Knowledge discovery • Data mining. • Data warehouse

Knowledge Management • Building data warehousing by Knowledge management

Knowledge Discovery and Data Mining • Data mining is a tool of knowledge discovery

Simon Why data mining If a supermarket manager, simon, want to arrange these commodities into supermarket, how to do will make more revenues, conveniences…. if one customer buys milk then he is likely to buy bread, so... Supermarket Commodities

Simon Why data mining Before long, if simon want to send some advertisement letters for customers, how to consider the individual differences is an important task. Mary always buys diapers and milk powders, she may have a baby, so ….

Useful patterns Knowledge and strategy Preprocess data The role of Data mining

Milk Bread Mining association rules IF bread is bought then milk is bought

Mining steps • step1: define minsup and minconf • ex: minsup=50% minconf=50% • step2: find large itemsets • step3: generate association rules

Example Large itemsets

Outline • Introduction • Weights of Data Sources • Rule Selection • Synthesizing High-Frequency Rules Algorithm • Relative Synthesizing Model • Experiments • Conclusion

DB2 DB1 Introduction • Framework ... DBn AB→C A→D B→E AB→C A→D B→E ... AB→C A→D B→E RD1 RD2 RDn GRB • Synthesizing High-Frequency Rules • Weighting • Ranking

Weights of Data Sources • Definition • Di : data sources • Si : set of association rules from Di • Ri : association rule • 3 Steps • Step 1 : union of all Si • Step 2 : assigning each Ri a weight • Step 3 : assigning each Di a weight & normalization

Example • 3 Data Sources (minsupp=0.2, minconf=0.3) • S1 • AB→C with supp=0.4, conf=0.72 • A→D with supp=0.3, conf=0.64 • B→E with supp=0.34, conf=0.7 • S2 • B→C with supp=0.45, conf=0.87 • A→D with supp=0.36, conf=0.7 • B→E with supp=0.4, conf=0.6 • S3 • AB→C with supp=0.5, conf=0.82 • A→D with supp=0.25, conf=0.62

Step 1 • Union of all Si • S’ = {S1, S2, S3} • R1 : AB→C • S1, S3 2 times • R2 : A→D • S1, S2, S3 3 times • R3 : B→E • S1, S2 2 times • R4 : B→C • S2 1 time

3 2 2 1 = 0.25 = 0.375 = 0.125 = 0.25 WR2= WR4= WR3= WR1= 2 + 3 + 2 + 1 2 + 3 + 2 + 1 2 + 3 + 2 + 1 2 + 3 + 2 + 1 Step 2 • Assigning each Ri a weight • R1 • R2 • R3 • R4

Step 3 • Assigning each Di a weight • WD1 • 2*0.25+3*0.375+2*0.25=2.125 • WD2 • 1*0.125+2*0.25+3*0.375=2 • WD3 • 2*0.25+3*0.375=1.625 • Normalization • WD1 2.125/(2.125+2+1.625)=0.3695 • WD2 2/(2.125+2+1.625)=0.348 • WD3 1.625/(2.125+2+1.625)=0.2825

Why Rule Selection ? • Goal • Extracting High-Frequency Rules • Low-Frequency Rules  Noise • Solution • If • Num(Ri) / n <  • n : data sources, Num(Ri) : frequency of Ri • Then • Rule Ri be wiped out

Rule Selection • Example : 10 Data Sources • D1~D9 : {R1 : X→Y} • D10 : {R1 : X→Y, R2: X1→Y1, …, R11: X10→Y10 } • Let =0.8 • Num(R1) / 10 = 10/10 = 1 • >   keep • Num(R2~11) / 10 = 1/10 = 0.1 • <   be wiped out • D1~D10 : {R1 : X→Y} • WR1: 10/10=1  WD1~10: 10*1 / 10*10*1 = 0.1 WR1 n Num(R1)

Comparison • Without Rules Selection • WD1~9 0.099 • WD10 0.109 • With Rules Selection • WD1~10 0.1 • From High-Frequency Rules Point of view • Weight Errors • D1~9  |0.1-0.099|  0.001 • D10  |0.1-0.109|  0.009 • Total Error = 0.01

Synthesizing High-Frequency Rules Algorithm • 5 Steps • Step 1 : Rules Selection • Step 2 : Weights of Data Sources • Step 2.1 : union of all Si • Step 2.2 : assigning each Ri a weight • Step 2.3 : assigning each Di a weight & normalization • Step 3 : computing supp & conf of each Ri • Step 4 : ranking all rules by support • Step 5 : output the High-Frequency Rules

An Example • 3 Data Sources • =0.4, minsupp=0.2, minconf=0.3

Step 1 • Rules Selection • R1 : AB→C • S1, S3 2 times • Num(R1) / 3 = 0.66  keep • R2 : A→D • S1, S2, S3 3 times • Num(R2) / 3 = 1  keep • R3 : B→E • S1, S2 2 times • Num(R3) / 3 = 0.66  keep • R4 : B→C • S2 1 time • Num(R4) / 3 = 0.33  wiped out

2 3 2 = 0.29 = 0.42 = 0.29 WR1= WR2= WR2= 2 + 3 + 2 2 + 3 + 2 2 + 3 + 2 Step 2 : Weights of Data Sources • Weights of Ri • Weight of Di • WD1  2*0.29+3*0.42+2*0.29=2.42 • WD2 3*0.42+2*0.29=1.84 • WD3 2*0.29+3*0.42=1.84 • Normalization • WD1 2.42/(2.42+1.84+1.84)=0.3695=0.396 • WD2 1.84/(2.42+1.84+1.84)=0.302 • WD3 1.84/(2.42+1.84+1.84)=0.302

Step 3 • WD1=0.396 • WD2 =0.302 • WD3=0.302 • Computing supp & conf of each Ri • Support • ABC • 0.396*0.4+0.302*0.5=0.3094 • AD • 0.396*0.3+0.302*0.36=0.228 • BE • 0.396*0.34+0.302*0.4=0.255 • Confidence • ABC • 0.396*0.72+0.302*0.82=0.532 • AD • 0.396*0.64+0.302*0.7=0.465 • BE • 0.396*0.7+0.302*0.6=0.458

Step 4 & Step 5 • Ranking all rules by support & output • minsupp=0.2, minconf=0.3 • ABC, BE, AD • Ranking • 1. ABC (0.3094) • 2. BE (0.255) • 3. AD (0.228) • Output – 3 rules • ABC(0.3094, 0.532) • BE (0.255, 0.458) • AD (0.228, 0.465)

Relative Synthesizing Model • Framework Unknown Di Internet journals books Web X→Y conf=0.7 X→Y conf=0.72 X→Y conf=0.68 • Synthesizing • clustering method • roughly method X→Y conf=?

Synthesizing Methods • Physical Meaning • if the confidences  irregularly distributed • Maximum synthesizing operator • Minimum synthesizing operator • Average synthesizing operator • if the confidences (X)  normal distribution • clustering  interval [a, b] • satisfy • 1. P{ a  Xb } (m/n)   • 2. | b – a |   • 3. a, b > minconf.

Clustering Method • 5 Steps • Step 1 : closeness  1 - | confi – confj | • The distance relation table • Step 2 : closeness degree measure • The confidence-confidence matrix • Step 3 : two confidences  close enough ?  • The confidence relationship matrix • Step 4 : classes creating • [a, b]  interval of the confidence of rule X→Y • Step 5 : interval verifying • satisfy the constraints ?

An Example • Assume • rule  X→Y • conf1=0.7, conf2=0.72, conf3=0.68, conf4=0.5 conf5=0.71, conf6=0.69, conf7=0.7, conf8=0.91 • 3 parameters • =0.7 • =0.08 • =0.69

Step 1 : Closeness • Example • conf1=0.7, conf2=0.72 • c1, 2= 1 - | conf1 - conf2 | = 1 - |0.70-0.72|=0.98

Step 2 : Closeness Degree Measure • Example         

Step 3 : Close Enough ? • Example • =6.9 > 6.9 < 6.9

Step 4 : Classes Creating • Example 1 Class 1 : conf1~3, conf5~7 3 Class 2 : conf4 Class 3 : conf8 2

Step 5 : Interval Verifying • Example • Class 1 • conf1=0.7, conf2=0.72, conf3=0.68, conf5=0.71, conf6=0.69, conf7=0.7 • [min, max] = [conf3, conf2] = [0.68, 0.72] • constraint 1 P{ 0.68  X 0.72 } (6/8)   (0.7) • constraint 2  |0.72-0.68| (0.04) <  (0.08) • constraint 3  0.68, 0.75 > minconf. (0.65) • In the same way • Class 2 & Class 3  be wiped out • Result  X→Y : conf=[0.68, 0.72] • Support ? • In the same way  Interval

Roughly Method • Example • R : AB→C • supp1=0.4, conf1=0.72 • supp2=0.5, conf2=0.82 • Maximum • max ( supp (R) )=max (0.4, 0.5)=0.5 • max ( conf (R) )=max (0.72, 0.82)=0.82 • Minimum & Average • min  0.4, 0.72 • avg  0.45, 0.77

Experiments • Time • SWNBS (without rules selection) • SWBRS (with rules selection) • SWNBS > SWBRS • Error • first 20 frequent itemset • Max=0.000065 • Avg=0.00003165

Conclusion • Synthesizing Model • Data Sources  known • weighting • Data Sources  unknown • clustering method • roughly method

Future works • Sequence pattern • Combine GA and other techniques

Synthesizing High-Frequency Rules from Different Data Sources

Synthesizing High-Frequency Rules from Different Data Sources

Presentation Transcript

Synthesizing sources

High-Frequency GW Sources

Synthesizing Data: from noise to music

Organizing and Synthesizing Sources

Synthesizing Information from Multiple Sources

BA 3: Synthesizing sources

SYNTHESIZING SOURCES

BA 3: Synthesizing Sources

Synthesizing Sources

Synthesizing Sources

GO Annotation from different sources

High Frequency Trade Data:

Lessons learned from disaggregating population data by using different data sources.

High-Frequency GPS sources from the AT20G survey

High Frequency Data Analysis

DATA FROM ADMINISTRATIVE SOURCES

Photos are from different sources

ICT IMPACT INDICATORS: LINKING DATA FROM DIFFERENT SOURCES

Organizing and Synthesizing Sources

Different From High School...

Integration services: used to integrate data from different data sources. Analysis Services: