1 / 28

Merging Ranks from Heterogeneous Internet Sources

Merging Ranks from Heterogeneous Internet Sources. Hector Garcia-Molina Luis Gravano Stanford University. Users Have Many Available Information Sources. Source 1. h 11 , h 12 , h 13 ,. “ Houses near Palo Alto for around $300K .”. Source 2. Nothing!. User Query. Query Results.

keaira
Télécharger la présentation

Merging Ranks from Heterogeneous Internet Sources

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Merging Ranks from Heterogeneous Internet Sources Hector Garcia-Molina Luis Gravano Stanford University

  2. Users Have Many Available Information Sources Source 1 h11, h12, h13, ... “Houses near Palo Altofor around $300K.” Source 2 Nothing! ... User Query Query Results Luis Gravano

  3. Challenges • Sources are too numerous • Sources are heterogeneous(query language, model, results) • Users want a single query result Luis Gravano

  4. Metasearcher • Selects the good sources for a query • Extracts and combines the query results from the sources Luis Gravano

  5. Text Sources Rank Query Results “Distributed Databases” Text Source Doc 1: 0.8 Doc 2: 0.6 ... Luis Gravano

  6. Structured Sources on the Internet also Rank Results A real-estate agent receives queries onLocationand Price: Q: “Houses with preferred location in Palo Alto and preferred price around $300K.” Luis Gravano

  7. The Agent Ranks its Houses Based on its Own Scoring Function Q: “Houses with preferred location in Palo Alto and preferred price around $300K.” Luis Gravano

  8. A Metasearcher then Faces Two Problems • Extracting the top objects from the underlying sources • Merging the results from the various sources Luis Gravano

  9. Merging Query Results is Easy with Enough Information Given a record like: the metasearcher ignores theSource score and computes itsTarget score from the Location and Price Luis Gravano

  10. Extracting the Top Objects from a Source is Hard The metasearcher’s scoring function might be different from the source’s! Luis Gravano

  11. We Want to Avoid Extracting All the Source’s Contents Assume a house h with: • Source(Q, h) = 0(worst for source) • Target(Q, h) = 1 (best for metasearcher) Problem! Luis Gravano

  12. The Example Query is Not Manageable at the Agent A query Q is manageable at a source if $e < 1 such that: Source(Q, h) ³Target(Q, h)-e (1,1) Source 1-e e (0,0) Target Luis Gravano

  13. Single-Attribute Queries Are More Likely to be Manageable Single-attribute queries for Q: • Q1:Location = Palo Alto • Q2:Price = $300K Luis Gravano

  14. The Example Becomes Tractable! … if the top Target objects for Q are among the top Source objects for Q1and Q2 Luis Gravano

  15. A Cover Bounds the Target Scores for Q Q1, …, Qm single-attribute queries form a cover for Q if $ g1, …, gm, G such that: Target(Qi, h) £ giTarget(Q, h) £ G Luis Gravano

  16. Having a Manageable Cover for a Query is Sufficient... “Efficient” Executions Possible at S Manageable Cover for query Q at source S Luis Gravano

  17. Having a Manageable Cover for a Query is Sufficient... (1) Pick a manageable cover C = {Q1, ..., Qm} for Q at S (2) For i = 1 to m: Find ei for Qi (3) Pick 0 £g1, ..., gm, G < 1 for cover C (4) For i = 1 to m (5) Retrieve all objects t with Source(Qi, t) ³ Gi = gi - ei (6) Compute Target(Q, t) for all objects t retrieved (7) If $ i such that Gi £ 0 Then Go to Step (11) (8) If for all t retrieved, Target(Q, t) £ G Then (9) Find new, lower 0 £ g1, ..., gm, G < 1 for C (10) Go to Step (4) (11) Output those objects retrieved with the highest Target score Luis Gravano

  18. Algorithm to Extract Top Target Objects Q1 Q2 1 g2 g1 Target(Q, h) £ G 0 Luis Gravano

  19. Algorithm to Extract Top Target Objects Q1 Q2 1 h’ Target(Q, h’) > G’! g2’ g1’ Target(Q, h) £ G’ 0 Luis Gravano

  20. Preliminary Performance Results for our Algorithm • Target=Min: 14% objects retrieved • Target=Max: 4% objects retrieved 10,000 objects 4 query attributes e=0 Luis Gravano

  21. Preliminary Performance Results for our Algorithm • Target=Min: 25% objects retrieved • Target=Max: 44% objects retrieved 10,000 objects 4 query attributes e=0.10 Luis Gravano

  22. Having a Manageable Cover for a Query is Also Necessary... No Manageable Cover for query Q at source S Efficient Executions Impossible at S Luis Gravano

  23. A Manageable Cover is Necessary: Proof Consider Q1, Q2, Q3 minimal cover for Q with: Q1, Q2 manageable, Q3 not manageable • For any “efficient “execution, build h such that: • h is not retrieved • Target(Q, h) > G = max{Target(Q, o) | o retrieved} Luis Gravano

  24. A Manageable Cover is Necessary: Proof Q1 Q2 Q3 1 g1 g3 g2 0 Luis Gravano

  25. h’ h’ h’ Target(Q, h’) > G!

  26. h’ h’h h’h h Target(Q, h’) > G Target(Q3, h) ³Target(Q, h’) Target(Q, h) > G!

  27. We Studied Two Metasearching Problems • Extracting the top objects from the underlying sources • Merging the results from the various sources Luis Gravano

  28. Related Work:Collection Fusion • Voorhees et al. • Callan/Lu/Croft • Gauch/Wang Luis Gravano

More Related