Data Mining over Hidden Data Sources

Data Mining over Hidden Data Sources Tantan Liu Depart. Computer Science & Engineering Ohio State University July 23, 2012

Outline • Introduction • Deep Web • Data Mining on the deep web • Contributions • Stratified K-means Clustering Over A Deep Web Data Source (SIGKDD, 2012) • Two-phase Sampling Based Outlier Detection Over A Deep Web Data Source (Submitted to ICDM 2012) • Stratification Based Hierarchical Clustering on a Deep Web Data Source (SDM, 2012) • An Active Learning Based Frequent Itemset Mining (ICDE, 2011) • Differential Rule Mining (ICDM Workshops, 2010) • Stratified Sampling for Deep Web Mining (ICDM, 2010) • Conclusion and Future work

Deep Web • Data sources hidden from the Internet • Online query interface vs. Database • Database accessible through online Interface • Input attribute vs. Output attribute • An example of Deep Web

Data Mining over the Deep Web • High level summary of data • Scenario 1: a user wants to relocate to the county. • Summary of the residences of the county? • Age, Price, Square Footage • County property assessor’s web-site only allows simple queries • Scenario 2: a user is thinking about his or her career path • High level knowledge about the job posts in the market • Job type, Salary, education, experience, skills, .. • Job web-site, i.e. Linkedin and MSN careers, provide millions of job posts.

Challenges • Databases cannot be accessed directly • Sampling method for Deep web mining • Obtaining data is time consuming • Efficient sampling method • High accuracy with low sampling cost

Contributions • Stratified K-means Clustering Over A Deep Web Data Source (SIGKDD, 2012) • Two-phase Sampling Based Outlier Detection Over A Deep Web Data Source (submitted to ICDM, 2012) • Stratification Based Hierarchical Clustering on a Deep Web Data Source (SDM, 2012) • An Active Learning Based Frequent Itemset Mining (ICDE, 2011) • Differential Rule Mining (ICDM Workshops, 2010) • Stratified Sampling for Deep Web Mining (ICDM, 2010)

Roadmap • Introduction • Deep Web • Data Mining on the deep web • Contributions • Stratified K-means Clustering Over A Deep Web Data Source • Two-phase Sampling Based Outlier Detection Over A Deep Web Data Source • Stratification Based Hierarchical Clustering on a Deep Web Data Source • An Active Learning Based Frequent Itemset Mining • Differential Rule Mining • Stratified Sampling for Deep Web Mining • Conclusion and Future work

An Example of Deep Web for Real-Estate

k-means clustering over a deep web data source • Goal: Estimating k centers for theunderlying clusters, so that the estimated k centers based on the sample are close to the k true centers in the whole population.

Overview of Method Sample Allocation Stratification

Stratification on the deep web • Partitioning the entire population in to strata • Stratifies on the query space of input attributes • Goal: Homogenous Query subspaces • Radius of query subspace: • Rule: Choosing the input attribute that mostly decreases the radius of a node • For an input attribute , decrease of radius:

Partition on Space of Output Attributes

Sampling Methods • We have created c*k partitions and c*k subspaces • A pilot sample • C*k-mean clustering generate c*k partitions • Representativesampling • Good Estimation on statistics of c*k subspaces • Centers • Proportions

Representative Sampling-Centers • Center of a subspace • Mean vector of all data points belonging to the subspace • Let sample S={DR1, DR2, …, DRn}, • For i-th subspace, center :

Distance Function • For c*k estimated centers with true centers • Using Euclidean Distance • Integrated variance • In terms of sub-space, stratum and output attributes • Computed based on pilot sample • : # of sample drawn from j-th stratum

Optimized Sampling Allocation • Goal: • Using Lagrange multipliers: • We are going to sample stratum with large variance • Data is spread in a wide area, and more data are need to represent the population

Active Learning based sampling Method • In machine learning • Passive learning: data are randomly chosen • Active Learning • Certain data are selected, to help build a better model • Obtaining data is costly and/or time-consuming • Choosing stratum i, the estimated decrease of distance function is • Iterative Sampling Process • At each iteration, stratum with largest decrease of distance function is selected for sampling • Integrated variance is updated

Representative Sampling-Proportion • Proportion of a sub-space: • Fraction of data records belonging to the sub-space • Depends on proportion of the sub-space in each stratum • In j-th stratum, • Risk function • Distance between estimated parameters and their true values • Iterative Sampling Process • At each iteration, stratum with largest decrease of risk function is chosen for sampling • Parameters are updated

Stratified K-means Clustering • Weight for data records in i-th stratum • , : size of population, : size of sample • Similar to k-means clustering • Center for i-th cluster

Contribution • Sampling methods for solving the problem of clustering over a deep web data source • Representative Sampling • Partition on the space of output attributes • Centers • Optimized Sampling method • Active learning based sampling method • Proportions • Active learning based sampling method

Experiment Result • Data Set: • Noisy Synthetic data set: • 4,000 data records with 4 input attributes and 2 output attributes. • Adding 400 noise data points • Yahoo! data set: • Data on used cars • 8,000 data records • Average Distance

Representative Sampling-Noisy Data Set • Benefit of Stratification • Compared with rand, decrease of AvgDist are 26.9%, 35.5%, 37.4%, 38.6% • Benefit of Representative Sampling • Compared with rand_st, decrease of AvgDist are 11.8%, 14.4% and 16.1% • Center based sampling methods have better performance • Optimized sampling method has better performance in the long run

Representative Sampling-Yahoo! Data set • Benefit of Stratification • Compared with rand, decrease of AvgDist are 7.2%, 13.2%, 15.0% and 16.8% • Benefit of Representative Sampling • Compared with rand_st, decrease of AvgDist are 6.6%, 8.5%, 10.5% • Center based sampling methods have better performance • Optimized sampling method has better performance in the long run

Scalability • The execution time for each method is linear of the size of data set

Outlier Detection • Outlier • An observation that deviates greatly from other observations • DB(p;D) outlier • At least fraction p of the objects lie at a distance greater than D. • Challenges for Outlier Detection over a deep web data source • Recall: finding as large a fraction of outliers • Precision: accurately identify outliers from sampled data

Two-phase Sampling Method • Neighborhood Sampling • Aiming at improving recall • Query spaces with high probability of containing outliers are explored. • Uncertain Driven Sampling • Aiming at improving precision

Outliers in Stratified Sampling • Stratified Sampling has better performance • Stratification • Similar to stratification in k-means clustering over a deep web data source • Control the number of strata • Outlier detection • For a data object, let denote the fraction of data objects at distance greater than D • Estimates in a stratified sample

Neighbor Nodes • Similar data objects tend to be from same query subspaces or neighbor query subspaces • Neighbor Nodes for a node • Left and right cousin with same parent nodes

Neighborhood Sampling Root Y=1980 Y=1990 Y=2000 Y=2010 B=1 B=2 B=3 B=4 B=1 B=2 B=3 B=4 Ba=1 Ba=2

Post-Stratification • Original Strata are further stratified after additional sampling • New Stratum: Leaf nodes with same sample rate under the same original stratum • Each data record has estimated and variance • Fraction of data objects at distance greater than D • Probability of being an outlier

Uncertain Driven Sampling • For a sampled data record • Outlier: > • Normal data object < • Otherwise, uncertain data object • Task: Obtain a sample for identifying uncertain data object

Sample Allocation • For uncertain data objectswith estimated • To find better estimation of , Minimize • By using Lagrange multiplier

Outlier in Stratified Sampling • Distance between each pair of sampled data object is computed • For a sampled data record • Outlier: > • Otherwise, Normal data object • An outlier: • A normal data object • where denotes the fraction of neighbors in D neighborhood

Efficient Outlier Detection • It can be shown that • Sufficient condition • If • A normal data object • An outlier • Else • A normal data object • An outlier

Experiment Result • Data Set: • Yahoo! data set: • Data on used cars • 8,000 data records • Evaluation • Precision: fraction of outliers that are identified in the sample • Recall: fraction of outliers that are sampled

Recall • Benefit of Stratification • Increase over SRS: 108.2%, 116.7%, and 74.7% • Benefit of neighborhood Sampling • Increase over SSTS: 19.1% and 28.1% • Uncertain sampling decrease recall: 3.7%

Precision • All four methods have good performance • Average precision is over 0.9 • Stratified sampling methods have lower precision • Compared with SSTS, decrease: 1.7%, 0.68% and 4.3% • Benefits of uncertain sampling • Compared with NS, increase: 2.7%

Trade-off between Precision and Recall • Trade-off between Precision and Recall • Benefit ofStratification • TPS , NS and SSTS improves recall for precision in 0.75-0.975 • Benefit of Neighborhood Sampling • TPS , NS improves recallfor precision in 0.75-0.975 • Benefit of Uncertain Sampling • TPS improves recall for precision in 0.92-1.0

Stratification Based Hierarchical Clustering on a Deep Web Data Source • Hierarchical clustering based on stratified sampling • Stratification • Sample Allocation • Representative Sampling • Mean values of output attributes are close to true values • Uncertain Sampling • Sample heavily on boundary between clusters

An Active Learning Based Frequent Itemset Mining • Frequent Itemset Mining • Estimating support for itemsets • The size of itemsets could be huge • Considering 1-itemsets • Bayesian Network • Model the relationship between input attributes and output attributes • Risk Function on estimated parameters • Active learning Based Sampling • Data records are selected step by step • Sample query subspaces with greatest decrease on risk function

Differential Rule Mining • Different values for the same data object • e.g. prices of commodities • Goal: analyzing the difference between data sources • Differential Rule: • Left hand: a frequent itemset • Right hand: behavior of differential attribute • Differential Rule Mining • AprioriAlgorithm • Hypothesis Statistical test

Stratified Sampling for Association Rule Mining and Differential Rule Mining • Data Mining • Association Rule Mining & Differential Rule Mining • Stratified Sampling • Stratification • Combing estimation variance and sampling cost • A tree recursively built on the query space • Sampling Allocation • An optimized method for minimizing integrated cost on variance and sampling cost

Conclusion • Data mining on the deep web is challenging • We proposed methods for data mining on the deep web • A stratified K-means clustering method • A two-phase sampling based outlier detection • A stratified hierarchical clustering method • An Active learning based frequent itemset mining • A stratified sampling method for data mining on the deep web • Differential rule mining • The experiment results show the efficiency of our work

Future Work • Outlier Detection over a deep web data source • Consider the problem of statistical distribution based outlier detection • Mining Multiple Deep Web Data Sources • Instance-based Schema matching • Efficiently sampling instance from deep web to facilitate schema matching • Mining data coverage of multiple deep web data sources • Efficient sampling methods for estimating data coverage of multiple data sources

Questions?

Data Mining over Hidden Data Sources

Data Mining over Hidden Data Sources

Presentation Transcript

Data Sources

Data Mining: Data

Data Mining: Data

Data Sources

Data Mining: Data

Data Mining: P enelitian Data Mining

Data Sources

Data Sources

Data Sources

Data Sources

Data Sources

Data Sources

Finding County-Based Data from Hidden Sources

Data Mining: Data

Data Mining: Data

Data Mining: Data

Data Mining: Data

Data Mining: Data

Data Mining: Data

Data Mining: Data