200 likes | 324 Vues
This paper presents a clustering approach for organizing query schemas from various web sources, focusing on the implicit domains of data, such as airfares, automobiles, and books. The study introduces a new objective function for model-based clustering, maximizing the dissimilarity of representational models of clustered data. It evaluates the effectiveness of current clustering methods by analyzing categorical data and exploring the relationships between diverse web databases. The proposed methodology aims to improve the integration and retrieval of deep web query results.
E N D
Organizing Structured Web Sources by Query Schemas: A Clustering Approach Bin He Joint work with: Tao Tao, Kevin Chen-Chuan Chang Univ. Illinois at Urbana-Champaign
Background: MetaQuerier – Large-Scale Integration of the deep Web Query Result MetaQuerier The Deep Web
The Deep Web MetaQuerier: System architecture MetaQuerier Front-end: Query Execution Result Compilation Query Translation Source Selection Query Web databases Find Web databases Deep Web Repository Query Interfaces Query Capabilities Subject Domains Unified Interfaces Back-end: Semantics Discovery Database Crawler Interface Extraction Source Organization Schema Matching
In MetaQuerier, source organization is to cluster query interfaces into implicit domains Airfares Automobiles Books
[Author; {contain}; text] [Title; {contain}; text] … … [Format; {=}; {hardcopy, paperback, …}] … … Interface Extraction [ SIGMOD 2004 ] Query Interface Query Schema What are the representative feature of query interfaces? Is query schema the feature we are looking for?
Query schemas are appropriate representatives of Web databases: distinctive property Airfares Movies Hotels Number of observations Attributes Index Attributes Index Attributes Index • Each domain contains a dominant range of attributes, distinctive from other domains • Some attributes are only observed in one domain (anchor attributes): For example: ISBN for Books, MPAA Rating for Movies, • Source organization becomes the clustering of query schemas
Query schemas can be viewed as categorical data • Query schemas as transactions: S1: {author, title, subject, ISBN} S2: {author, title, category, publisher} S3: {make, model, price, zip code} S4: {manufacturer, model, price} S5: {from, to, departure date, return date, number of passengers} S6: {departure city, arrival city, number of adults, number of children} …… • Thus, we can apply algorithms for clustering categorical data
Clustering categorical data: Objective function • Clustering needs to have an objective function to evaluate the quality of clusters • Existing objective functions • Likelihood [1998] (Model-based clustering) • Context Linkage [ROCK 2000] • Entropy [COOLCAT 2002] • In this paper, we propose a new objective function • Model-Differentiation
Model-Differentiation: A new objective function for model-based clustering • Assumption of model-base clustering: Each cluster Ci has a generative model Mi to generate its data with probabilistic behavior • What is a good clustering result? (our observation) data in different clusters are very dissimilar • models of different clusters are very dissimilar • a new objective function: maximize the dissimilarity of models • To realize, we need to answer three questions: • How to model the data? • How to estimate the model, given data? • How to measure the dissimilarity of models?
Modeling: Multinomial distribution • Each attribute is an independent event • A schema is generated by a series of sampling from M Model M A schema: {title, author, ISBN} Vocabulary: author (P1) publisher (P2) title (P3) ISBN (P4) city (P5) price (P6) model (P7) … P1 ISBN author title P3 P4 Probability: P1*P3*P4
Model estimation: Given a set of data, how to estimate its model? • Maximum likelihood estimation S1 = {title, author, ISBN}, S2 = {author, ISBN, publisher} S3 = {author, title, price}, S4 = {author, title, price} Vocabulary: author, title, ISBN, price, publisher
Measuring the dissimilarity of models: Statistical hypothesis testing • Multinomial distribution can be directly tested by χ2 testing S1 = {title, author, ISBN}, S2 = {author, ISBN, price}, S3 = {make, model, price} Pro Pro M<1,2> M3 1. Combining S1 and S2: Attrs Attrs Pro Pro M<1,3> M2 2. Combining S1 and S3: Attrs Attrs Pro Pro M<2,3> M1 3. Combining S2 and S3: Attrs Attrs Inspire a hierarchical agglomerative clustering (HAC) algorithm
Hypothesis testing needs sufficient observations: Pre-clustering to form small clusters Distinguishable S2 S1: with anchor attributes S1 and S2 should be in the same domain and thus pre-clustered How to decide whether an S is “distinguishable” ? Sup(S1) Any Si, Sj in Sup(S1) S1
Post-classification: Handling “loners” Separate Pre-clustering Model clustering Loners: too small for X2 test after pre-clustering Naïve Bayesian
Experiments • Data • Questions to answer: • Can schema clustering effectively organize Web databases? • Can it build a domain hierarchy correctly?
We also try existing objective functions • Three existing objective functions • Likelihood: maximize likelihood • Entropy: maximize entropy • Context Linkage: minimize cross links • To be fair, keep pre-clustering and post classification, and only change the clustering step by different measures
Effectiveness of Clustering • 8 domains, 8 clusters • Most Web databases are clustered correctly • Quantitatively analysis: Conditional Entropy (the smaller, the better) Model-Differentiation: 0.32; Likelihood: 0.42; Entropy: 0.38; Context Linkage: 0.61
To build a domain hierarchy • After 8 clusters, continue to run the HAC algorithm to merge them together • It is consistent with common-sense: close concepts are merged first
Conclusions • Cluster Web databases using their query schemas • First work on clustering Web databases, not pages • Query schemas are good representatives • Essentially a problem of clustering categorical data • A new objective function: Model-Differentiation • Realized by statistical hypothesis testing • Derive different similarity measure for HAC