410 likes | 537 Vues
This paper presents a hierarchical, cost-sensitive framework for acquiring web resources and solving record matching problems. We motivate our approach with the challenges posed by existing methods, which often overlook the costs associated with web resource acquisition. Our framework utilizes a resource dependency graph to optimize the selection of records, balancing search costs and utility. We discuss solutions for record linkage, evaluation methods, and address common issues in web-based record matching, ultimately aiming to provide more efficient resource acquisition strategies.
E N D
A Framework for Hierarchical Cost-sensitive Web Resource Acquisition Tan Yee Fan WING Group Meeting 2009 November 10
Contents • Background and Motivation • Cost-sensitive Record Acquisition Framework • Solution for Record Matching Problems • Evaluation • Conclusion
Contents • Background and Motivation • Record linkage • Using web-based resources • Issues with existing work • Cost-sensitive Record Acquisition Framework • Solution for Record Matching Problems • Evaluation • Conclusion
Record Linkage A B • Input • Two lists of items, A and B • Output • For each item pair (a, b) in A× B, are a and b a match? KDD PAKDD DKMD KDID Knowledge Discovery and Data Mining Pacific-Asia Conference on Knowledge Discovery and Data Mining Research Issues on Data Mining and Knowledge Discovery Knowledge Discovery in Inductive Databases ? Input information is sometimes incomplete or insufficientMay be useful to acquire additional information from the web
Information from Search Engine Hit count Title Snippet URL → Web page Possible queries “jcdl” “jcdl” “joint conference on digital libraries” “jcdl” OR “joint conference on digital libraries” jcdl digital libraries …
Using Search Engine Results • Inverse host frequency • Counts from snippets • Number of snippets containing a particular term • Web page similarity • TF-IDF cosine similarity between two pages • Test instance for classification • Frequency counts • Similarity metrics • …
Web-based Record Matching • Matching entities to concepts(Cimiano et al., 2005) • Semantic similarity between words(Bollegala et al., 2007) • Word sense disambiguation(Mihalcea and Moldovan, 1999) • Machine transliteration(Ohand and Isahara, 2008) • Record linkage(Elmacioglu et al., 2007) • Author name disambiguation(Tan et al., 2006) • Web people search(Kalashnikov et al., 2008)
Web-based Record Matching • Most existing work use this scheme • Query search engine with queries of one or more of these types: a, b, and a b • Optionally, append predefined terms or tokens t, e.g.,a t • Compute similarity metrics • Optionally, construct test instances for classification
Cost of Acquiring Web Resources • Acquiring web resources are slow and may incur other access costs • For a dataset of 1000 records, querying all records individually takes a day or more, querying them pairwise takes even longer • For large datasets, not feasible to query and download everything • Existing work sidesteps or ignores this issue • Some discussion in Tan et al. (2008)
Hierarchical Dependencies • Basically ignored in existing work! • Same resource may be used to extract different types of attribute values • Different instances can share attribute values Search engine results Web pages Term frequencies
Contents • Background and Motivation • Cost-sensitive Record Acquisition Framework • Resource dependency graph • Resource acquisition problem • Observations for record matching problems • Solution for Record Matching Problems • Evaluation • Conclusion
search(ab) search(b) search(a) hitcount(ab) hitcount(b) hitcount(a) a b dice(a, b) instance xa, b search(a1b1) search(b1) search(a1) hitcount(a1) hitcount(a1b1) hitcount(b1) dice(a1, b1) search(a1b2) search(a2b1) hitcount(a2b1) hitcount(a1b2) dice(a1, b2) dice(a2, b1) search(a2) search(a2b2) search(b2) hitcount(a2b2) hitcount(b2) hitcount(a2) dice(a2, b2) Resource Dependency Graph • Directed acyclic graph G = (V, E) • Vertex types T with |T| = 7
Resource Acquisition Graph • Feasible vertex set V’ V: • V’ can be acquired as-is without violating acquisition dependencies • If v V’, then for all u → v E, we require u V’ • Set of feasible subsets of vertices F(G) 2V: • All sets of feasible vertices
search(ab) search(b) search(a) hitcount(ab) hitcount(b) hitcount(a) a b dice(a, b) instance xa, b Cost and Utility • Cost • search(.) cost 10 each • All other cost 1 each • Utility • Positive utility for vertices that are attribute values in test instances • Zero utility otherwise
Resource Acquisition Problem • Given • Resource dependency graph G = (V, E) • Acquisition cost function cost: V → R+ {0} • Utility function utility: F(G) → R • Vertex type function type: V → T • Budget budget • Select V’ F(G) with cost(V’) ≤ budget to maximize obj(V’) = utility(V’) – cost(V’)
Applications • Easy to construct resource dependency graphs • When web pages can be downloaded • When TF or TF-IDF cosine similarity is involved • Blocking is applied • …
search(ab) search(b) search(a) hitcount(ab) hitcount(b) hitcount(a) a b dice(a, b) instance xa, b Observations for Record Matching • Structure of resource acquisition graph • Depth of 2 to 4 is typical • Vertex out-degree varies widelye.g., from 1 to max(|A|, |B|)
search(b) search(a) search(ab) search(ab) search(b) search(a) hitcount(ab) hitcount(b) hitcount(ab) hitcount(b) hitcount(a) hitcount(a) dice(a, b) a b a b instance xa, b dice(a, b) instance xa, b Observations for Record Matching • Usually possible to flatten graph to two levels • Root level: expensive search engine results and web page downloads • Leaf level: cheap attribute values
Contents • Background and Motivation • Cost-sensitive Record Acquisition Framework • Solution for Record Matching Problems • Application of Tabu search • More intelligent legal moves • Propagation of utility values • Utility function when classifier is SVM • Evaluation • Conclusion
Search Algorithm • State space • F(G), set of feasible subsets of vertices • Search process • Current state denoted by V’ F(G) • Starting state V’ = φ • Legal moves add or remove vertices from V’ to reach next state • Objective • Maximize obj(V’) = utility(V’) – cost(V’)
search(b) search(a) search(ab) hitcount(ab) hitcount(b) hitcount(a) dice(a, b) a b instance xa, b Issues • Heuristic search required • Not feasible to enumerate all possible V’ to acquire • Many local maxima • V’ = φ is a local maxima • Simple greedy search does not work • Easy for states to be revisited • No guidance as to which root vertices to acquire • Each search(.) has the same objective value
Solutions • Application of Tabu search • More intelligent legal moves • Propagation of utility values
Tabu search • Simple hill climbing but with Tabu list • Moves in Tabu list are prohibited until expiry of prohibition, this avoids repeated visiting of states • Except when new objective value exceeds current known best objective value • When a move is made, moves that reverse the effect of the move are added to Tabu list • Expiration set by current iteration plus Tabu tenure • Stop when best objective value never improve for some number of iterations
Legal Moves • Let D(V’) be the set of children from any vertex in V’ that are not in V’ • If V’ is empty, then set D(V’) = V • Legal moves • Add(v): Add v from D(V’) and its ancestors into V’ • Remove(v): Remove v and its descendants from V’ • AddType(t): Add a maximal set of vertices (within acquisition budget) of type t from D(V’) into V’
Propagated Utility u v v
Surrogate Utility and Objective • Current state V’ and we consider adding V’’ • Surrogate utility function • Surrogate objective function • surr-obj(V’, V’’) = surr-utility(V’, V’’) – cost(V’’)
Utility Function for SVM • Decompose utility(V’) into sum ofutility(x, A(V’, x)) over test instances x • Compute utility(x, A’) as expected decrease in misclassification cost for acquiring attribute values A’ • Apply earlier work on cost-sensitive attribute value acquisition to compute utility(x, A’)
Contents • Background and Motivation • Cost-sensitive Record Acquisition Framework • Solution for Record Matching Problems • Evaluation • Experimental setup and datasets • Results • Conclusion
search(sf) search(lf) search(sflf) hitcount(sflf) hitcount(lf) hitcount(sf) cosine(sf, lf) sf lf snippetcount(sf → lf) snippetcount(sf ← lf) instance xsf, lf Evaluation Task • Linkage of short forms (SF) to long forms (LF) • Evaluation metric • Total cost of acquisitions and misclassifications cost = 10 cost = 1
Datasets • Human genomes • 307 short forms × 307 long forms • Training: 154 × 154 pairs • 23716 training instances • Testing: 153 × 153 pairs • 23409 test instances • 117657 vertices, 140760 edges • Misclassification costs • 50 false positive, 500 false negative • DBLP venues • 920 short forms ×906 long forms • Training: 460 × 453 pairs • Randomly selected 100000 training instances • Testing: 460 × 453 pairs • 213444 test instances • 1069068 vertices, 1281588 edges • Misclassification costs • 50 false positive, 1500 false negative
Algorithms • Baselines • Random • Least cost • Highest utility • Best cost-utility ratio • Best type • Our algorithm • Manual (with access to solution set)
Evaluation Methodology • Start with graph with no vertices acquired • For each iteration • Run acquisition algorithm with budget 100 • Record data point • Total cost of acquisitions • Total cost of misclassifications • Total cost of acquisitions and misclassifications • Run for 100 iterations
Results • GENOMES
Results • DBLP
Results • Almost 50% improvement in average difference in misclassification cost!
Contents • Background and Motivation • Cost-sensitive Record Acquisition Framework • Solution for Record Matching Problems • Evaluation • Conclusion • Contributions • Future work
Contributions • Framework for performing cost-sensitive acquisitions of resources with hierarchical dependencies • Model using resource dependency graph • Acquisition algorithm for record matching problems, given cost and utility function • Utility function when classifier is SVM • Evaluation demonstrate effectiveness on record matching problems when using SVM
Future Work • Apply resource acquisition framework to clustering problems