260 likes | 371 Vues
This presentation by Alessandro Barilari discusses the significance of maintaining freshness in search engines to provide accurate results. Current challenges include outdated data and a lack of standard processes for real-time updates. The FreshFlow algorithm addresses these issues by balancing the interaction between web servers and search engines, ensuring timely propagation of updates. It analyzes communication costs and offers solutions to improve update efficiency and resource management, ultimately leading to a better experience for users and webmasters alike.
E N D
Internet Search Engine freshness by Web Server help Presented by: Barilari Alessandro
Introduction • Search engines are an important source of information and keeping them up-to-date will result in more accurate answers to search queries. • Search engines create their databases by probing web servers on a per-URL basis with a little help from the web servers. Alessandro Barilari
Main Problem • There are no standard for facilitating the push of updates from servers to search engines: • It takes up to six months for a few page to be indexed by popular web search engines; • The data which is indexed by the search engines is often stale. Alessandro Barilari
Solution… • Web server help to facilitate search engine freshness results in a favorable situation for web sites, search engines and users. Alessandro Barilari
…and its problems • The number of updates per second is very large. • Must balance between: • The number of interactions between web sites and search engines, and • The freshness of the search engines. Alessandro Barilari
Page rank impact • Pages which are popular will have higher page ranks: • Use popularity in addition to age and freshness to compute the mismatch between a web site and a search engine Alessandro Barilari
Summary • Definitions and Cost Model • Algorithm • Analysis • Pratical issues Alessandro Barilari
Some definitions • Update: an update u to a file f is a modification to f that has been flushed to the disk; • Propagation of an update: an update is said to be propagate when the web site has informed the search engine about the update. A SE may or may not retrieve that update; • Meta-update propagation: At any time t, let U(t) be the set of unpropagated updates. The web site informs the search engine about all the updates U(t); Alessandro Barilari
Some definitions (2) • Weight of a file: given a content file, its weight f (non-negative) denotes the importance of the file; the weights are chosen such that: • Last_modification_time(u,t): the last time before t when the file f(u) was updated. Alessandro Barilari
The Cost Model • Components: • Communication cost; • Opportunity cost: represents the stalenes of the search engine data as compared to the data on the web server. • CPU cost is ignored Alessandro Barilari
Opportunity cost (OC) • Given an unpropagated update u to a content file f; the opportunity cost for update u at time t is: OC(u,t)=f(u)x(t - last_modification_time(u,t)) • Definition for meta-update propagation: Alessandro Barilari
Communication cost (CC) • sizef(u)(t): the size of file f(u) at time t; Alessandro Barilari
Potential Communication cost (PCC) • Represents the communication cost which would need to be incurred in case update u were to be propagated after time t: Alessandro Barilari
The Cost Function • Given that an update u is unpropagated at time t, the cost function for that update at time t is given by: Alessandro Barilari
Summary • Definition and Cost Model • Algorithm • Analysis • Pratical issues Alessandro Barilari
FreshFlow Algorithm When OC_tot equals PCC_tot at any time t, the web server can inform the search engine about all the unpropagated updates. Alessandro Barilari
Summary • Definition and Cost Model • Algorithm • Analysis • Pratical issues Alessandro Barilari
Analysis • The cost of the FreshFlow algorithm (called FF) is compared with the cost of an optimal off-line algorithm (called ADV) Alessandro Barilari
Analysis (2) • Lemma (1): OC(u,t) is monotonically non-decreasing; • Lemma (2): suppose an update u to a file f, and suppose FF transmits but ADV does not. Then OCADV(u,t)≥OCFF(u,t). • Lemma (3): if the update is transmitted by the adversary (ADV), then CCADV(u,t) ≥CCFF(u,t). Alessandro Barilari
Theorem • FF is 2-competitive: CostFF(u,t) ≤ 2 x CostADV(u,t) Alessandro Barilari
Summary • Definition and Cost Model • Algorithm • Analysis • Pratical issues Alessandro Barilari
Pratical issues • There are multiple search engines: • Synchronization effect: pushing the updates would put pressure on the last-hop link to the web server; • Search engine load: some search engines might deny the receipt of updates. Alessandro Barilari
The middleman approach • Each web server contacts only one middleman for sending its updates; • Could be a group of middlemen. Alessandro Barilari
Benefits • The middleman can solve some additional issues: • Verifying trustworthiness of web servers; • Restricting the rate at which updates get transmitted to search engines; Alessandro Barilari
Limitations • The algorithm has not been used in practice; • The search engines need the cooperation of the web servers to keep track of updates to their URLs. Whether web servers will incorporate such a service remains to be seen. Alessandro Barilari
Conclusions • The FreshFlow algorithm is a solution that improve the data updates of the search engines, mantaining high level efficiency and performance; • The authors are planning to implement the algorithm in a real system (and have a future pubblication!) Alessandro Barilari