360 likes | 469 Vues
Building Ranked Mashups of Unstructured Sources with Uncertain Information. Mohamed Soliman Ihab Ilyas Mina Saleeb University of Waterloo. http://prefex.cs.uwaterloo.ca/MashRank/. Mashups. Situational applications that join data sources of different formats.
E N D
Building Ranked Mashups of Unstructured Sources with Uncertain Information Mohamed Soliman Ihab Ilyas Mina Saleeb University of Waterloo http://prefex.cs.uwaterloo.ca/MashRank/
Mashups • Situational applications that join data sources of different formats • Web data sources can be huge databases behind searchinterfaces • Ranking is an effective data exploration tool • Data are usually unstructured • Uncertain Web data (e.g., inexact/missing values) are common Mohamed A. Soliman - VLDB2010
Motivation • Finding a good hotel ! www.vianet.travel hotel prices rank www.tvtrip.comhotel ratings match . . . extract extract . . . Mohamed A. Soliman - VLDB2010
Motivation • Challenges for automating extract-match-rank process • Data Extraction • Web sources often lack schema and attribute annotations • Interleaving Extraction with Processing • Need to avoid exhaustive extraction to leverage early-out nature of ranking • Handling Uncertainty • Web data can contain missing/inexact values due to privacy concerns, data integration, presentation formats, etc. • Uncertainty can impact queried/ranked attributes Mohamed A. Soliman - VLDB2010
Mashing-up Sources with Uncertain Scores www.vianet.travel Rank Join Query www.tvtrip.com Uncertain Score Mohamed A. Soliman - VLDB2010
Agenda • MashRank Architecture • Information Extraction • Mashup Planning • Mashup Processing • Handling Uncertain Scores • Uncertain Rank Join • Probabilistic Ranking • Experiments • Summary and Conclusions Mohamed A. Soliman - VLDB2010
MashRank Architecture Mashup Execution Incremental output Executor Mashup Planning User Mashup Plan Mashup Editor Planner Data flow . . . Synchronized Buffers . . . asynchronous updates (tuples) User Content Wrappers Information Extraction DB WebAPI HTML asynchronous updates (raw data) Grabber Threads Local Buffers . . . AJAX JDBC Relational Sources Web Mohamed A. Soliman - VLDB2010
Information Extraction addExample <div class="search-card" id="location_id_24343"> <h2> <a href="/visit/24343">Black Jack Point</a> </h2> <div class="price-range">Priced from $125 to $149 </div> …. <div class="search-card" id="location_id_26487"> <h2> <a href="/visit/26487">Bayview Studio Matapouri</a> </h2> <div class="price-range">Priced from $90 </div> Left Wrapper Right Wrapper learn LR Wrapper [Kushmerick et al, AAAI 2000] Mohamed A. Soliman - VLDB2010
Information Extraction extract text enclosed between the learned left and right wrappers Mohamed A. Soliman - VLDB2010
Mashup Planning • Rank Join assumes sorted inputs • If an index on a ranking attribute is available, cheap sorted access of the input can be leveraged by rank join • Otherwise, we may need to apply early sort • Many Web sources provide sorted access methods (similar to indexes in relational databases) • www.vianet.travel/list?region=auckland&sort by=highest-price • Offloading sort to source side • Pipeline sorted results, while extraction is still in progress • Limit extraction to a subset of pages by upper-bounding scores in non-extracted pages Mohamed A. Soliman - VLDB2010
Mashup Planning • For each source, user can specify a sort key representing the order of extracted results from the source • If sort key is a part of mashup ranking function, sorting is offloaded to source side {price_up, rating} Join NL-RankJoin Sort is avoided here {rating} {price_up} Sort Extractor Extractor Scan Scan {price_up} {rating} Vianet TVTrip Sync Buffer1 Sync Buffer2 Mashup Data Flow (created by user) Mashup Physical Plan (created by MashRank) Mohamed A. Soliman - VLDB2010
Mashup Processing Probabilistic Ranking • Monte-Carlo simulation for ordering join results • Join-aware sampling • Incremental ranking Non-pruned query results(based on scores) Ranking+Uncertainty Aware Plan Raw tuples • Rank-aware query planning • Uncertain rank join for early pruning Uncertain Database Mohamed A. Soliman - VLDB2010
Modeling Uncertain Scores • The score of a tuple ti’s is a PDF fi defined on an interval [loi, upi] • Scoring PDFs of base tuples are assumed independent • Probabilistic Partial Order Model [Soliman and Ilyas, ICDE 2009] • Non-intersecting intervals are totally ordered. • Intersecting intervals define a Probabilistic Dominance relationship [7,7] t5 [4,8] t2 t1 [6,6] [3,5] t4 t3 [2,3.5] t6 [1,1] Mohamed A. Soliman - VLDB2010
URankJoin Query • Uncertain score dominance • ti > tj iff loi >= upj • Compute a total order ω* of join results dominated by less than k join results • The total order ω* is defined based on probabilistic ranking semantics k F URankJoin({R,S}, (R.a1+S.a1)/2, 2 ) F a1 jk a1 jk r1,s2 r1 s1 (r1,s2) (r2,s2) r3,s1 s2 r2 r2,s2 s3 r3 (r3,s1) (r2,s2) (r1,s2) r4,s3 S r4 R Top-2 Join Results Possible Orderings Space Mohamed A. Soliman - VLDB2010
URankJoin Query • Baisc Idea • Prune all join results whose up-scores < the kth largest lo-score • Run two rank join instances on lo and up scores • Pipeline join results whose up-scores >= score upper bound of the lo-score rank join • Stop after reporting k results from the lo-score rank join Lo joins Up joins RJup RJlo a1 jk a1 jk R S r2 s2 U=infinity U=.65 R S k = 1 F=(R.a1+S.a1)/2 Mohamed A. Soliman - VLDB2010
URankJoin Query • Baisc Idea • Prune all join results whose up-scores < the kth largest lo-score • Run two rank join instances on lo and up scores • Pipeline join results whose up-scores >= score upper bound of the lo-score rank join • Stop after reporting k results from the lo-score rank join Lo joins Up joins RJup RJlo a1 a1 jk jk a1 jk a1 jk r1 r2 s1 s2 U=.75 U=.65 R S R S k = 1 F=(R.a1+S.a1)/2 Mohamed A. Soliman - VLDB2010
URankJoin Query • Baisc Idea • Prune all join results whose up-scores < the kth largest lo-score • Run two rank join instances on lo and up scores • Pipeline join results whose up-scores >= score upper bound of the lo-score rank join • Stop after reporting k results from the lo-score rank join Lo joins Up joins RJup RJlo a1 a1 jk jk a1 jk a1 jk r1 r2 s1 s2 U=.65 r2 U=.75 R S S R k = 1 F=(R.a1+S.a1)/2 Mohamed A. Soliman - VLDB2010
URankJoin Query • Baisc Idea • Prune all join results whose up-scores < the kth largest lo-score • Run two rank join instances on lo and up scores • Pipeline join results whose up-scores >= score upper bound of the lo-score rank join • Stop after reporting k results from the lo-score rank join Lo joins Up joins RJup RJlo a1 a1 jk jk a1 jk a1 jk r1 r2 s1 s2 U=.65 r2 U=.7 R S s2 R S k = 1 F=(R.a1+S.a1)/2 Mohamed A. Soliman - VLDB2010
URankJoin Query • Baisc Idea • Prune all join results whose up-scores < the kth largest lo-score • Run two rank join instances on lo and up scores • Pipeline join results whose up-scores >= score upper bound of the lo-score rank join • Stop after reporting k results from the lo-score rank join Lo joins Up joins RJup RJlo a1 a1 jk jk a1 jk a1 jk r1 r2 s1 s2 U=.65 r2 R S s2 r3 U=.55 s3 k = 1 F=(R.a1+S.a1)/2 R S Mohamed A. Soliman - VLDB2010
URankJoin Query • Baisc Idea • Prune all join results whose up-scores < the kth largest lo-score • Run two rank join instances on lo and up scores • Pipeline join results whose up-scores >= score upper bound of the lo-score rank join • Stop after reporting k results from the lo-score rank join Lo joins Up joins RJup RJlo a1 a1 jk jk a1 jk a1 jk r1 r2 s1 s2 U=.5 r2 r1 s1 s2 r3 U=.55 R S s3 k = 1 F=(R.a1+S.a1)/2 R S Mohamed A. Soliman - VLDB2010
URankJoin Query • Baisc Idea • Prune all join results whose up-scores < the kth largest lo-score • Run two rank join instances on lo and up scores • Pipeline join results whose up-scores >= score upper bound of the lo-score rank join • Stop after reporting k results from the lo-score rank join Lo joins Up joins RJup RJlo a1 a1 jk jk a1 jk a1 jk r1 r2 s1 s2 U=.5 r2 r1 s1 s2 r3 U=.45 R S s3 Stop ! r4 k = 1 F=(R.a1+S.a1)/2 s4 R S Mohamed A. Soliman - VLDB2010
Probabilistic Ranking Space of score combinations with t at rank i…j Space of all possible score combinations • Monte-Carlo Simulation • Transform the space of possible orderings into a space of possible score combinations • Can be easily sampled at random • Sample from each [ loi, upi ] a score value biased by fi(x) • Allows estimating the probabilitiesof having a tupletin a rank rangei … j Can’t be applied for joins ! x Avg( πfi(x) ) Pr(t at rank i …j) = x Vol ( ) M.A. Soliman - University of Waterloo
Probabilistic Ranking a1 jk a1 jk r1 s1 • Compute probability of a ranking of join results • Join results scores may be correlated(can’t be sampled independently) • Join-aware sampling s2 r2 s3 r3 S r4 R Mohamed A. Soliman - VLDB2010
Probabilistic Ranking • In many Web application scenarios, users only inspect a small prefix of the ranked answers list • Computing a full ranking of all answers can be an overkill • Incremental Ranking • Computing an approximate ranking that improves in accuracy as more results are discovered • We make use of the incremental output of intermediate results to compute an approximation of ω∗ Mohamed A. Soliman - VLDB2010
Experiments • Data Sources Mohamed A. Soliman - VLDB2010
Experiments • Mashup Examples M1 M2 SELECT * FROM Vianet v, TvTrip t, Menus m WHERE v.Hotel ≈ t.Hotel AND v.City=m.City ORDER BY 500-v.Price+ 100* (t.Rating+m.Rating) LIMIT k SELECT * FROM Epinion e, Flickr f WHERE e.Brand contains f.Brand ORDER BY e.Price+ (100-f.Rank) LIMIT k M3 M4 SELECT * FROM Pubs p, GScholar g WHERE p.PaperTitle ≈ g.Paper ORDER BY nCitations LIMIT k SELECT * FROM Apartments a, Restaurants r WHERE a.Zip = r.Zip ORDER BY a.Price LIMIT k Mohamed A. Soliman - VLDB2010
Experiments Mohamed A. Soliman - VLDB2010
Experiments Mohamed A. Soliman - VLDB2010
Experiments Mohamed A. Soliman - VLDB2010
Summary and Conclusions • We address integrating information extraction with joining and ranking under uncertainty • A system architecture for managing and processing mashups of live unstructured sources • Formulating the problem of rank join under uncertainty and providing an implementation for ranking & uncertainty-aware join operator • An infrastructure for supporting probabilistic ranking queries using MC simulation • MashRank prototype is accessible at http://prefex.cs.uwaterloo.ca/MashRank/ Mohamed A. Soliman - VLDB2010
MashRank • A mashup authoring and processing system focusing on integrating probabilistic ranking with information extraction • Leveraging early-out capabilities of ranking to conduct extraction on demand guided by rank-aware processing • Formulating rank join under uncertainty, and extending rank join methods to handle uncertain scores • A new infrastructure for processing uncertain rank join queries based on Monte-Carlo simulation Mohamed A. Soliman - VLDB2010
Information Extraction • MashRank uses wrapper induction techniques to transform unstructured HTML data into the relational model • An interface to a generic wrapper induction algorithm with three main functions • addExample: adds a new training example (e.g., a text node representing the value of some data attribute) • learn: processes training examples to compute an extraction rule • extract: applies learned extraction rule to a given page, and returns a set of extracted records Mohamed A. Soliman - VLDB2010
Mashup Planning • Planner starts by labeling each node in the mashup data flow with its corresponding ranking attributes • Labeling starts with leaves (data sources), and moves up in the data flow tree • The union of ranking attributes of all children of a node p gives the ranking attributes of p -Join Vianet and TVTrip sources based on hotel name -Order results by price+rating NL-RankJoin {price, rating} Join Sort Sort {price} {rating} Extractor Extractor Scan Scan {price} {rating} Sync Buffer1 Sync Buffer2 Vianet TVTrip Mashup Physical Plan (created by MashRank) Mashup Data Flow (created by user) Mohamed A. Soliman - VLDB2010
Handling Uncertain Scores • Probabilistic Ranking Semantics • Uncertainty gives rise to multiple possible semantics of ranking queries • Expected scores • Expected ranks • Most probable ranking E[t] ER[t] ID Score Mohamed A. Soliman - VLDB2010
Probabilistic Ranking • Incremental Ranking • For each top rank r, compute bounds on Pr(ti , r) for each join result ti produced by a URankJoin plan • Use bounds of Pr(ti , r) to approximate a prefix of ω∗ • Bounds are progressively tightened as more tuples are retrieved Mohamed A. Soliman - VLDB2010