1 / 17

Stratified Sampling for Data Mining on the Deep Web

Stratified Sampling for Data Mining on the Deep Web. Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa ,agrawal}@cse.ohio-state.edu Dec. 16, 2010. Outline. Introduction Background Knowledge Association Rule Mining Differential Rule Mining Basic Formulation

christiang
Télécharger la présentation

Stratified Sampling for Data Mining on the Deep Web

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Stratified Sampling for Data Mining on the Deep Web Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa ,agrawal}@cse.ohio-state.edu Dec. 16, 2010

  2. Outline • Introduction • Background Knowledge • Association Rule Mining • Differential Rule Mining • Basic Formulation • Main Technical Approach • A Greedy Stratification Method • Experiment Result • Conclusion

  3. Introduction • Deep Web • Query interface vs. backend database • Input attribute vs. Output attribute • Data mining on the deep web • High level summary of the data • Challenge • Databases cannot be accessed directly • Sampling • Deep web querying is time consuming • Efficient Sampling Method

  4. Background Knowledge-Association Rule Mining • Aim: co-occurrence patterns for items • Frequent Itemset: Support of the itemset is larger than a threshold • Rule: • is a frequent itemset • Confidence is larger than threshold

  5. Background Knowledge-Differential Rule Mining • Aim: differences between two deep web data sources • E.g. Price of the same hotels on two web sites • Identical attributes vs. Differential attributes • Same vs. different values • Rule: • X: Frequent itemset composed of identical attributes • t: differential or target attribute • D1, D2: data sources

  6. Basic Formulation-Problem Formulation • Two step sampling procedure • A pilot sample • Randomly drawn from the deep web • Interesting rules are identified • Additional sample • Verify identified rules • Association rules and differential rules • Sampling more data records satisfying X • X only contains input attributes – easy • X contains output attributes • Randomly sampling ? not efficient! • how?

  7. Basic Formulation-Problem Formulation in Detail • Considering rules with • A single output attribute in the left hand • Association Rule • Estimate or, • Differential Rule • Estimate mean of given A=a • Goal – sampling • High estimation accuracy • Low sampling cost

  8. Basic Formulation-Stratified Sampling • Sampling separately from strata • Heterogeneousacross strata & homogenouswithin stratum • Estimating mean value of : • : size, and sampled mean value • Association Rule Mining • : whether an itemset is contained in a transaction • If an itemset is contained in a transaction, • Differential Rule Mining • :the value of target attribute

  9. Background-Neymann Allocation • Sample Allocation • Determining sample size for each stratum • Fixed sum of sample size • Neymann Allocation • Minimizing variance of the stratified sampling • Problem of application in Deep Web • The probability of A = a in each stratum is not considered • Possible large sampling cost • Sampling cost: number of queries submitted to the deep web

  10. Sampling Cost • Sampling Cost on the Deep web • Aim: obtain data records with • Sampling Cost: • : number of data records with • : probability of finding a data record with • Integrated Cost • Combing sampling cost and estimation variance • Two adjustable weights

  11. Main technical Approach –Stratification Process • Stratification by a tree on the query space • A top-down construction manner • Best split to create child nodes • Input attribute with the smallest integrated cost • The splitting process stops • Integrated cost at each leaf node is small • Leaf nodes: final strata for sampling

  12. Experiment Result • Data Set: US census • The incomeof US households from 2008 US Census • 40,000 data records • 7 categorical and 2 numerical attributes • Two Metrics • Variance of Estimation • Sampling Cost

  13. Experiment Result-Settings • Five sampling procedures • Four different weights for variance and sampling cost • Full_Var: • Var7 : • Var5 : • Var3 : • Rand : simple random sampling

  14. Experiment Result – Variance of Estimation • Association Rule Mining • Increase of variance of estimation by decreasing • Random Sampling has higher estimation of variance

  15. Experiment Result – Sampling Cost • Association Rule Mining • Decrease of sampling cost by decreasing • Random Sampling has higher sampling cost

  16. Conclusion • Stratified sampling for data mining on the deep web • Considering estimation accuracy and sampling cost • A tree model for the relation between input attributes and output attributes • A greedy stratification to maximally reduce an integrated cost metric • Our experiments show that • Higher sampling accuracy and lower sampling cost compared with simple random sampling • Reducing sampling costs by trading-off a fraction of estimation error

  17. Questions & Comments?

More Related