1 / 36

Building Ranked Mashups of Unstructured Sources with Uncertain Information

Building Ranked Mashups of Unstructured Sources with Uncertain Information. Mohamed Soliman Ihab Ilyas Mina Saleeb University of Waterloo. http://prefex.cs.uwaterloo.ca/MashRank/. Mashups. Situational applications that join data sources of different formats.

berit
Télécharger la présentation

Building Ranked Mashups of Unstructured Sources with Uncertain Information

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Building Ranked Mashups of Unstructured Sources with Uncertain Information Mohamed Soliman Ihab Ilyas Mina Saleeb University of Waterloo http://prefex.cs.uwaterloo.ca/MashRank/

  2. Mashups • Situational applications that join data sources of different formats • Web data sources can be huge databases behind searchinterfaces • Ranking is an effective data exploration tool • Data are usually unstructured • Uncertain Web data (e.g., inexact/missing values) are common Mohamed A. Soliman - VLDB2010

  3. Motivation • Finding a good hotel ! www.vianet.travel hotel prices rank www.tvtrip.comhotel ratings match . . . extract extract . . . Mohamed A. Soliman - VLDB2010

  4. Motivation • Challenges for automating extract-match-rank process • Data Extraction • Web sources often lack schema and attribute annotations • Interleaving Extraction with Processing • Need to avoid exhaustive extraction to leverage early-out nature of ranking • Handling Uncertainty • Web data can contain missing/inexact values due to privacy concerns, data integration, presentation formats, etc. • Uncertainty can impact queried/ranked attributes Mohamed A. Soliman - VLDB2010

  5. Mashing-up Sources with Uncertain Scores www.vianet.travel Rank Join Query www.tvtrip.com Uncertain Score Mohamed A. Soliman - VLDB2010

  6. Agenda • MashRank Architecture • Information Extraction • Mashup Planning • Mashup Processing • Handling Uncertain Scores • Uncertain Rank Join • Probabilistic Ranking • Experiments • Summary and Conclusions Mohamed A. Soliman - VLDB2010

  7. MashRank Architecture Mashup Execution Incremental output Executor Mashup Planning User Mashup Plan Mashup Editor Planner Data flow . . . Synchronized Buffers . . . asynchronous updates (tuples) User Content Wrappers Information Extraction DB WebAPI HTML asynchronous updates (raw data) Grabber Threads Local Buffers . . . AJAX JDBC Relational Sources Web Mohamed A. Soliman - VLDB2010

  8. Information Extraction addExample <div class="search-card" id="location_id_24343"> <h2> <a href="/visit/24343">Black Jack Point</a> </h2> <div class="price-range">Priced from $125 to $149 </div> …. <div class="search-card" id="location_id_26487"> <h2> <a href="/visit/26487">Bayview Studio Matapouri</a> </h2> <div class="price-range">Priced from $90 </div> Left Wrapper Right Wrapper learn LR Wrapper [Kushmerick et al, AAAI 2000] Mohamed A. Soliman - VLDB2010

  9. Information Extraction extract text enclosed between the learned left and right wrappers Mohamed A. Soliman - VLDB2010

  10. Mashup Planning • Rank Join assumes sorted inputs • If an index on a ranking attribute is available, cheap sorted access of the input can be leveraged by rank join • Otherwise, we may need to apply early sort • Many Web sources provide sorted access methods (similar to indexes in relational databases) • www.vianet.travel/list?region=auckland&sort by=highest-price • Offloading sort to source side • Pipeline sorted results, while extraction is still in progress • Limit extraction to a subset of pages by upper-bounding scores in non-extracted pages Mohamed A. Soliman - VLDB2010

  11. Mashup Planning • For each source, user can specify a sort key representing the order of extracted results from the source • If sort key is a part of mashup ranking function, sorting is offloaded to source side {price_up, rating} Join NL-RankJoin Sort is avoided here {rating} {price_up} Sort Extractor Extractor Scan Scan {price_up} {rating} Vianet TVTrip Sync Buffer1 Sync Buffer2 Mashup Data Flow (created by user) Mashup Physical Plan (created by MashRank) Mohamed A. Soliman - VLDB2010

  12. Mashup Processing Probabilistic Ranking • Monte-Carlo simulation for ordering join results • Join-aware sampling • Incremental ranking Non-pruned query results(based on scores) Ranking+Uncertainty Aware Plan Raw tuples • Rank-aware query planning • Uncertain rank join for early pruning Uncertain Database Mohamed A. Soliman - VLDB2010

  13. Modeling Uncertain Scores • The score of a tuple ti’s is a PDF fi defined on an interval [loi, upi] • Scoring PDFs of base tuples are assumed independent • Probabilistic Partial Order Model [Soliman and Ilyas, ICDE 2009] • Non-intersecting intervals are totally ordered. • Intersecting intervals define a Probabilistic Dominance relationship [7,7] t5 [4,8] t2 t1 [6,6] [3,5] t4 t3 [2,3.5] t6 [1,1] Mohamed A. Soliman - VLDB2010

  14. URankJoin Query • Uncertain score dominance • ti > tj iff loi >= upj • Compute a total order ω* of join results dominated by less than k join results • The total order ω* is defined based on probabilistic ranking semantics k F URankJoin({R,S}, (R.a1+S.a1)/2, 2 ) F a1 jk a1 jk r1,s2 r1 s1 (r1,s2) (r2,s2) r3,s1 s2 r2 r2,s2 s3 r3 (r3,s1) (r2,s2) (r1,s2) r4,s3 S r4 R Top-2 Join Results Possible Orderings Space Mohamed A. Soliman - VLDB2010

  15. URankJoin Query • Baisc Idea • Prune all join results whose up-scores < the kth largest lo-score • Run two rank join instances on lo and up scores • Pipeline join results whose up-scores >= score upper bound of the lo-score rank join • Stop after reporting k results from the lo-score rank join Lo joins Up joins RJup RJlo a1 jk a1 jk R S r2 s2 U=infinity U=.65 R S k = 1 F=(R.a1+S.a1)/2 Mohamed A. Soliman - VLDB2010

  16. URankJoin Query • Baisc Idea • Prune all join results whose up-scores < the kth largest lo-score • Run two rank join instances on lo and up scores • Pipeline join results whose up-scores >= score upper bound of the lo-score rank join • Stop after reporting k results from the lo-score rank join Lo joins Up joins RJup RJlo a1 a1 jk jk a1 jk a1 jk r1 r2 s1 s2 U=.75 U=.65 R S R S k = 1 F=(R.a1+S.a1)/2 Mohamed A. Soliman - VLDB2010

  17. URankJoin Query • Baisc Idea • Prune all join results whose up-scores < the kth largest lo-score • Run two rank join instances on lo and up scores • Pipeline join results whose up-scores >= score upper bound of the lo-score rank join • Stop after reporting k results from the lo-score rank join Lo joins Up joins RJup RJlo a1 a1 jk jk a1 jk a1 jk r1 r2 s1 s2 U=.65 r2 U=.75 R S S R k = 1 F=(R.a1+S.a1)/2 Mohamed A. Soliman - VLDB2010

  18. URankJoin Query • Baisc Idea • Prune all join results whose up-scores < the kth largest lo-score • Run two rank join instances on lo and up scores • Pipeline join results whose up-scores >= score upper bound of the lo-score rank join • Stop after reporting k results from the lo-score rank join Lo joins Up joins RJup RJlo a1 a1 jk jk a1 jk a1 jk r1 r2 s1 s2 U=.65 r2 U=.7 R S s2 R S k = 1 F=(R.a1+S.a1)/2 Mohamed A. Soliman - VLDB2010

  19. URankJoin Query • Baisc Idea • Prune all join results whose up-scores < the kth largest lo-score • Run two rank join instances on lo and up scores • Pipeline join results whose up-scores >= score upper bound of the lo-score rank join • Stop after reporting k results from the lo-score rank join Lo joins Up joins RJup RJlo a1 a1 jk jk a1 jk a1 jk r1 r2 s1 s2 U=.65 r2 R S s2 r3 U=.55 s3 k = 1 F=(R.a1+S.a1)/2 R S Mohamed A. Soliman - VLDB2010

  20. URankJoin Query • Baisc Idea • Prune all join results whose up-scores < the kth largest lo-score • Run two rank join instances on lo and up scores • Pipeline join results whose up-scores >= score upper bound of the lo-score rank join • Stop after reporting k results from the lo-score rank join Lo joins Up joins RJup RJlo a1 a1 jk jk a1 jk a1 jk r1 r2 s1 s2 U=.5 r2 r1 s1 s2 r3 U=.55 R S s3 k = 1 F=(R.a1+S.a1)/2 R S Mohamed A. Soliman - VLDB2010

  21. URankJoin Query • Baisc Idea • Prune all join results whose up-scores < the kth largest lo-score • Run two rank join instances on lo and up scores • Pipeline join results whose up-scores >= score upper bound of the lo-score rank join • Stop after reporting k results from the lo-score rank join Lo joins Up joins RJup RJlo a1 a1 jk jk a1 jk a1 jk r1 r2 s1 s2 U=.5 r2 r1 s1 s2 r3 U=.45 R S s3 Stop ! r4 k = 1 F=(R.a1+S.a1)/2 s4 R S Mohamed A. Soliman - VLDB2010

  22. Probabilistic Ranking Space of score combinations with t at rank i…j Space of all possible score combinations • Monte-Carlo Simulation • Transform the space of possible orderings into a space of possible score combinations • Can be easily sampled at random • Sample from each [ loi, upi ] a score value biased by fi(x) • Allows estimating the probabilitiesof having a tupletin a rank rangei … j Can’t be applied for joins ! x Avg( πfi(x) ) Pr(t at rank i …j) = x Vol ( ) M.A. Soliman - University of Waterloo

  23. Probabilistic Ranking a1 jk a1 jk r1 s1 • Compute probability of a ranking of join results • Join results scores may be correlated(can’t be sampled independently) • Join-aware sampling s2 r2 s3 r3 S r4 R Mohamed A. Soliman - VLDB2010

  24. Probabilistic Ranking • In many Web application scenarios, users only inspect a small prefix of the ranked answers list • Computing a full ranking of all answers can be an overkill • Incremental Ranking • Computing an approximate ranking that improves in accuracy as more results are discovered • We make use of the incremental output of intermediate results to compute an approximation of ω∗ Mohamed A. Soliman - VLDB2010

  25. Experiments • Data Sources Mohamed A. Soliman - VLDB2010

  26. Experiments • Mashup Examples M1 M2 SELECT * FROM Vianet v, TvTrip t, Menus m WHERE v.Hotel ≈ t.Hotel AND v.City=m.City ORDER BY 500-v.Price+ 100* (t.Rating+m.Rating) LIMIT k SELECT * FROM Epinion e, Flickr f WHERE e.Brand contains f.Brand ORDER BY e.Price+ (100-f.Rank) LIMIT k M3 M4 SELECT * FROM Pubs p, GScholar g WHERE p.PaperTitle ≈ g.Paper ORDER BY nCitations LIMIT k SELECT * FROM Apartments a, Restaurants r WHERE a.Zip = r.Zip ORDER BY a.Price LIMIT k Mohamed A. Soliman - VLDB2010

  27. Experiments Mohamed A. Soliman - VLDB2010

  28. Experiments Mohamed A. Soliman - VLDB2010

  29. Experiments Mohamed A. Soliman - VLDB2010

  30. Summary and Conclusions • We address integrating information extraction with joining and ranking under uncertainty • A system architecture for managing and processing mashups of live unstructured sources • Formulating the problem of rank join under uncertainty and providing an implementation for ranking & uncertainty-aware join operator • An infrastructure for supporting probabilistic ranking queries using MC simulation • MashRank prototype is accessible at http://prefex.cs.uwaterloo.ca/MashRank/ Mohamed A. Soliman - VLDB2010

  31. Mohamed A. Soliman - VLDB2010

  32. MashRank • A mashup authoring and processing system focusing on integrating probabilistic ranking with information extraction • Leveraging early-out capabilities of ranking to conduct extraction on demand guided by rank-aware processing • Formulating rank join under uncertainty, and extending rank join methods to handle uncertain scores • A new infrastructure for processing uncertain rank join queries based on Monte-Carlo simulation Mohamed A. Soliman - VLDB2010

  33. Information Extraction • MashRank uses wrapper induction techniques to transform unstructured HTML data into the relational model • An interface to a generic wrapper induction algorithm with three main functions • addExample: adds a new training example (e.g., a text node representing the value of some data attribute) • learn: processes training examples to compute an extraction rule • extract: applies learned extraction rule to a given page, and returns a set of extracted records Mohamed A. Soliman - VLDB2010

  34. Mashup Planning • Planner starts by labeling each node in the mashup data flow with its corresponding ranking attributes • Labeling starts with leaves (data sources), and moves up in the data flow tree • The union of ranking attributes of all children of a node p gives the ranking attributes of p -Join Vianet and TVTrip sources based on hotel name -Order results by price+rating NL-RankJoin {price, rating} Join Sort Sort {price} {rating} Extractor Extractor Scan Scan {price} {rating} Sync Buffer1 Sync Buffer2 Vianet TVTrip Mashup Physical Plan (created by MashRank) Mashup Data Flow (created by user) Mohamed A. Soliman - VLDB2010

  35. Handling Uncertain Scores • Probabilistic Ranking Semantics • Uncertainty gives rise to multiple possible semantics of ranking queries • Expected scores • Expected ranks • Most probable ranking E[t] ER[t] ID Score Mohamed A. Soliman - VLDB2010

  36. Probabilistic Ranking • Incremental Ranking • For each top rank r, compute bounds on Pr(ti , r) for each join result ti produced by a URankJoin plan • Use bounds of Pr(ti , r) to approximate a prefix of ω∗ • Bounds are progressively tightened as more tuples are retrieved Mohamed A. Soliman - VLDB2010

More Related