1 / 18

Differential Analysis on Deep Web Data Sources

Differential Analysis on Deep Web Data Sources. Tantan Liu, Fan Wang, Jiedan Zhu, Gagan Agrawal {liut,wangfa,zhujie,agrawal}@cse.ohio-state.edu December 14, 2010. Outline. Introduction Problem Definition Differential Analysis and Approaches Experiment Result Conclusion.

edmund
Télécharger la présentation

Differential Analysis on Deep Web Data Sources

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Differential Analysis on Deep Web Data Sources Tantan Liu, Fan Wang, Jiedan Zhu, Gagan Agrawal {liut,wangfa,zhujie,agrawal}@cse.ohio-state.edu December 14, 2010

  2. Outline • Introduction • Problem Definition • Differential Analysis and Approaches • Experiment Result • Conclusion

  3. Introduction • Deep web • Query forms vs. backend databases • Similar information from multiple data sources • What’s their difference? • Application: guiding users’ search process • Higher-level knowledge summary • Patterns of values with respects to the same entity

  4. Problem definition • Goal • Difference between multiple data sources in the same domain • Patterns of values of the same entity • Different values for the same data entity • For example: prices of commodities • How different is the data, under what conditions? • Differential Rules • Capturing the difference of values

  5. Differential Analysis and Approaches • Summarizing difference between two data sources • Data queried from the deep web • A relational table • Attributes • Assumption: data sources have same attributes • Identical attributes • Same values for the same data object • Differential attributes • Different values for the same data object • Quantitative attributes • Differences in values of quantitative attributes

  6. Differential Analysis and Approaches-Useful Identifiers • Two data source and • Identical attributes • Differential attributes • :attribute in data source • Combining relation tables of A and B • Differential rule where • Profile X: the left hand of the rule

  7. Differential Analysis and Approaches-Differential Rule Mining • Frequent Item Set Mining • Apriori algorithm • A concept hierarchy • Identifying patterns for target attributes • For each frequent itemset X • Decide • Paired Z-test • : difference between two random variables • Hypothesis test vs. • if > , then • if >0, then

  8. Differential Analysis and Approaches-Pruning Rules • Pruning rules • A large number of rules are generated • Essential rules predict unessential rules • Identifying essential rules • Direction of rules

  9. Differential Analysis and Approaches-ancestors of rules • Rules R1, R2 are complementary ancestors of rule R • R1: Y->d, R2: Z->d • R: X->d, and • Rule R is predicated by complementary ancestors R1 and R2

  10. Differential Analysis and Approaches-Profile Representation • Identifying essential Rules • Rules are processed level by level • For rule R in k, all the rules from level 1 to k-1 are visited • Computation cost is expensive • Profile Representation • Uniquely describe items contained in the profile X of a rule R • For profile , define • would be extremely large when profile X is large • Thus, we modify

  11. Differential Analysis and Approaches-Process of Pruning • Hash table is used to store differential rules • Each level corresponds to a hash table • For each rule R in the k-the level • The ancestor rules from 1 to k/2 are visited • Identifying complementary rules by profile representation • R is unessential rules • Predicted by a pair of complementary ancestor rules • Process the next rule

  12. Experiment Results • Data Set: four of the most popular travel sites. • 120 randomly selected cities all over the world • Attributes • Hotel ID, City, Star, Customer Rating, Cleanness Rating, Price, Service Rating • Concept Hierarchy for attribute: city

  13. Experiment Results - effectiveness

  14. Experiment Results – Pruning effectiveness

  15. Experiment Results- Efficiency

  16. Experiment Results -Mining-Utility of the Approach

  17. Conclusion • A method to extract high-level summary of the differences in multiple data sources • Differential rule mining – A new data mining problem • Statistic test for discovering differential rules • A method to prune unessential rules • Hash-table is used to speedup the process. • Experiment results on four travel-related deep web data sources show good results.

  18. Questions?

More Related