Internet Search Engine freshness by Web Server help - PowerPoint PPT Presentation

internet search engine freshness by web server help n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Internet Search Engine freshness by Web Server help PowerPoint Presentation
Download Presentation
Internet Search Engine freshness by Web Server help

Loading in 2 Seconds...

play fullscreen
1 / 26
Internet Search Engine freshness by Web Server help
102 Views
Download Presentation
varden
Download Presentation

Internet Search Engine freshness by Web Server help

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Internet Search Engine freshness by Web Server help Presented by: Barilari Alessandro

  2. Introduction • Search engines are an important source of information and keeping them up-to-date will result in more accurate answers to search queries. • Search engines create their databases by probing web servers on a per-URL basis with a little help from the web servers. Alessandro Barilari

  3. Main Problem • There are no standard for facilitating the push of updates from servers to search engines: • It takes up to six months for a few page to be indexed by popular web search engines; • The data which is indexed by the search engines is often stale. Alessandro Barilari

  4. Solution… • Web server help to facilitate search engine freshness results in a favorable situation for web sites, search engines and users. Alessandro Barilari

  5. …and its problems • The number of updates per second is very large. • Must balance between: • The number of interactions between web sites and search engines, and • The freshness of the search engines. Alessandro Barilari

  6. Page rank impact • Pages which are popular will have higher page ranks: • Use popularity in addition to age and freshness to compute the mismatch between a web site and a search engine Alessandro Barilari

  7. Summary • Definitions and Cost Model • Algorithm • Analysis • Pratical issues Alessandro Barilari

  8. Some definitions • Update: an update u to a file f is a modification to f that has been flushed to the disk; • Propagation of an update: an update is said to be propagate when the web site has informed the search engine about the update. A SE may or may not retrieve that update; • Meta-update propagation: At any time t, let U(t) be the set of unpropagated updates. The web site informs the search engine about all the updates U(t); Alessandro Barilari

  9. Some definitions (2) • Weight of a file: given a content file, its weight f (non-negative) denotes the importance of the file; the weights are chosen such that: • Last_modification_time(u,t): the last time before t when the file f(u) was updated. Alessandro Barilari

  10. The Cost Model • Components: • Communication cost; • Opportunity cost: represents the stalenes of the search engine data as compared to the data on the web server. • CPU cost is ignored Alessandro Barilari

  11. Opportunity cost (OC) • Given an unpropagated update u to a content file f; the opportunity cost for update u at time t is: OC(u,t)=f(u)x(t - last_modification_time(u,t)) • Definition for meta-update propagation: Alessandro Barilari

  12. Communication cost (CC) • sizef(u)(t): the size of file f(u) at time t; Alessandro Barilari

  13. Potential Communication cost (PCC) • Represents the communication cost which would need to be incurred in case update u were to be propagated after time t: Alessandro Barilari

  14. The Cost Function • Given that an update u is unpropagated at time t, the cost function for that update at time t is given by: Alessandro Barilari

  15. Summary • Definition and Cost Model • Algorithm • Analysis • Pratical issues Alessandro Barilari

  16. FreshFlow Algorithm When OC_tot equals PCC_tot at any time t, the web server can inform the search engine about all the unpropagated updates. Alessandro Barilari

  17. Summary • Definition and Cost Model • Algorithm • Analysis • Pratical issues Alessandro Barilari

  18. Analysis • The cost of the FreshFlow algorithm (called FF) is compared with the cost of an optimal off-line algorithm (called ADV) Alessandro Barilari

  19. Analysis (2) • Lemma (1): OC(u,t) is monotonically non-decreasing; • Lemma (2): suppose an update u to a file f, and suppose FF transmits but ADV does not. Then OCADV(u,t)≥OCFF(u,t). • Lemma (3): if the update is transmitted by the adversary (ADV), then CCADV(u,t) ≥CCFF(u,t). Alessandro Barilari

  20. Theorem • FF is 2-competitive: CostFF(u,t) ≤ 2 x CostADV(u,t) Alessandro Barilari

  21. Summary • Definition and Cost Model • Algorithm • Analysis • Pratical issues Alessandro Barilari

  22. Pratical issues • There are multiple search engines: • Synchronization effect: pushing the updates would put pressure on the last-hop link to the web server; • Search engine load: some search engines might deny the receipt of updates. Alessandro Barilari

  23. The middleman approach • Each web server contacts only one middleman for sending its updates; • Could be a group of middlemen. Alessandro Barilari

  24. Benefits • The middleman can solve some additional issues: • Verifying trustworthiness of web servers; • Restricting the rate at which updates get transmitted to search engines; Alessandro Barilari

  25. Limitations • The algorithm has not been used in practice; • The search engines need the cooperation of the web servers to keep track of updates to their URLs. Whether web servers will incorporate such a service remains to be seen. Alessandro Barilari

  26. Conclusions • The FreshFlow algorithm is a solution that improve the data updates of the search engines, mantaining high level efficiency and performance; • The authors are planning to implement the algorithm in a real system (and have a future pubblication!) Alessandro Barilari