Download Presentation
## Internet Search Engine freshness by Web Server help

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Internet Search Engine freshness by Web Server help**Presented by: Barilari Alessandro**Introduction**• Search engines are an important source of information and keeping them up-to-date will result in more accurate answers to search queries. • Search engines create their databases by probing web servers on a per-URL basis with a little help from the web servers. Alessandro Barilari**Main Problem**• There are no standard for facilitating the push of updates from servers to search engines: • It takes up to six months for a few page to be indexed by popular web search engines; • The data which is indexed by the search engines is often stale. Alessandro Barilari**Solution…**• Web server help to facilitate search engine freshness results in a favorable situation for web sites, search engines and users. Alessandro Barilari**…and its problems**• The number of updates per second is very large. • Must balance between: • The number of interactions between web sites and search engines, and • The freshness of the search engines. Alessandro Barilari**Page rank impact**• Pages which are popular will have higher page ranks: • Use popularity in addition to age and freshness to compute the mismatch between a web site and a search engine Alessandro Barilari**Summary**• Definitions and Cost Model • Algorithm • Analysis • Pratical issues Alessandro Barilari**Some definitions**• Update: an update u to a file f is a modification to f that has been flushed to the disk; • Propagation of an update: an update is said to be propagate when the web site has informed the search engine about the update. A SE may or may not retrieve that update; • Meta-update propagation: At any time t, let U(t) be the set of unpropagated updates. The web site informs the search engine about all the updates U(t); Alessandro Barilari**Some definitions (2)**• Weight of a file: given a content file, its weight f (non-negative) denotes the importance of the file; the weights are chosen such that: • Last_modification_time(u,t): the last time before t when the file f(u) was updated. Alessandro Barilari**The Cost Model**• Components: • Communication cost; • Opportunity cost: represents the stalenes of the search engine data as compared to the data on the web server. • CPU cost is ignored Alessandro Barilari**Opportunity cost (OC)**• Given an unpropagated update u to a content file f; the opportunity cost for update u at time t is: OC(u,t)=f(u)x(t - last_modification_time(u,t)) • Definition for meta-update propagation: Alessandro Barilari**Communication cost (CC)**• sizef(u)(t): the size of file f(u) at time t; Alessandro Barilari**Potential Communication cost (PCC)**• Represents the communication cost which would need to be incurred in case update u were to be propagated after time t: Alessandro Barilari**The Cost Function**• Given that an update u is unpropagated at time t, the cost function for that update at time t is given by: Alessandro Barilari**Summary**• Definition and Cost Model • Algorithm • Analysis • Pratical issues Alessandro Barilari**FreshFlow Algorithm**When OC_tot equals PCC_tot at any time t, the web server can inform the search engine about all the unpropagated updates. Alessandro Barilari**Summary**• Definition and Cost Model • Algorithm • Analysis • Pratical issues Alessandro Barilari**Analysis**• The cost of the FreshFlow algorithm (called FF) is compared with the cost of an optimal off-line algorithm (called ADV) Alessandro Barilari**Analysis (2)**• Lemma (1): OC(u,t) is monotonically non-decreasing; • Lemma (2): suppose an update u to a file f, and suppose FF transmits but ADV does not. Then OCADV(u,t)≥OCFF(u,t). • Lemma (3): if the update is transmitted by the adversary (ADV), then CCADV(u,t) ≥CCFF(u,t). Alessandro Barilari**Theorem**• FF is 2-competitive: CostFF(u,t) ≤ 2 x CostADV(u,t) Alessandro Barilari**Summary**• Definition and Cost Model • Algorithm • Analysis • Pratical issues Alessandro Barilari**Pratical issues**• There are multiple search engines: • Synchronization effect: pushing the updates would put pressure on the last-hop link to the web server; • Search engine load: some search engines might deny the receipt of updates. Alessandro Barilari**The middleman approach**• Each web server contacts only one middleman for sending its updates; • Could be a group of middlemen. Alessandro Barilari**Benefits**• The middleman can solve some additional issues: • Verifying trustworthiness of web servers; • Restricting the rate at which updates get transmitted to search engines; Alessandro Barilari**Limitations**• The algorithm has not been used in practice; • The search engines need the cooperation of the web servers to keep track of updates to their URLs. Whether web servers will incorporate such a service remains to be seen. Alessandro Barilari**Conclusions**• The FreshFlow algorithm is a solution that improve the data updates of the search engines, mantaining high level efficiency and performance; • The authors are planning to implement the algorithm in a real system (and have a future pubblication!) Alessandro Barilari