270 likes | 360 Vues
Disseminate web feeds in a distributed manner to boost server scalability using P2P methods. Explore challenges in content searching, data dissemination, and free riding prevention, proposing an adaptive fetching algorithm. Analyze feed publishing rates and entry counts to address network management overhead and neighbor selection issues. Experiment with entry searching, entry storage, and neighbor evaluation methods to improve data delivery efficiency in a collaborative news feed ecosystem, enhancing scalability and user experience.
E N D
FeedEx: Collaborative Exchange of News Feeds Seung Jun, Mustaque Ahamad Georgia Institute of Technology WWW 2006
Outline • One line comment • Motivation/Problem • Approach • Analysis of feed publishing • Challenges • Experiments • Critique
One line comment • Disseminate web feeds in a distributed (P2P) manner to increase scalability of web servers Traditional method P2P method RSS A B A B RSS reveals visitors to content providers RSS decoupled fetch operation from read
Scalability Motivation & Problem • RSS/Atom feeds have become increasingly popular • Published by most traditional media and blogs • Feeding mechanism nyt.com http://nyt.com/../feed.xml HTTP response HTTP request … … Update page as contents are added RSS reader: Poll server to check updates
Approach • The Approach • P2P overlay + gossip based protocol • P2P: Scalable growth in resources with service demand • Gossip: Scalable, Robustness (Join & Leave) • Feature of this overlay • Don’t have to guarantee delivery or delay • Challenges content searching ? Data dissemination Free riding prevention Fetching interval determination Overlay construction
Analysis of Feed Publishing • Methodology • 245 popular feeds monitored for 10 days • Most popular feeds – information from Gmail’s web clips, Bloglines • Feeds fetched every 2 minutes • Measured.. • Publishing rate • Entry count in a feed • Entry lifetime
Publishing Rate by Rank • Great difference between publishers • Partly zipf distribution
Entry Count • High publish rate, More entry counts? – NO • Lifetime of entries are short Entries can be lost with infrequent requests
Publishing Rate by Time • 4 types of publishing patterns
Challenges – Overlay Construction (1/2) – • Goal: Minimize network management overhead • Join • Well known host OR Contact previous neighbors • Share subscription set info • Update subscription set info to the network • Leave • Soft-state • Update subscription set periodically Gateway Neighbor list Subscription set
Challenges – Overlay Construction (1/2) – • Neighbor selection • Many neighbors may incur overhead • Need to adapt to my resource status • select “useful” neighbors to me • Whose subscription set is similar to me A 1 direct, 1 one-hop, 1 two-hop B
Challenges – Fetching interval determination – • Adaptive Fetching • Problem: Little hints about the publishing rate or entry lifetime • Frequent polling: overload servers, consume clients’ net bandwidth • Lazy polling: increase delay or miss entries • Adaptive Algorithm • Intuition: Frequent fetching few new entries • Freshness rate: fraction of new entries in the fetched document • If Freshness rate < target freshness Halve the fetching rate • If Freshness rate > target freshness Double the fetching rate Entries in a feed HANI • Report 1 • Report 2 • Report 3 • … Fetch
Challenges – Data dissemination– • Goal: Minimize bandwidth consumption • Limit the boundary of delivery • Forward only to matching neighbors (subscription set, hop_count) reduce forwarding overhead • Reduce the unit of delivery • Unit of delivery : Entry bundle • A set of new entries (Filter out old entries) Reduce redundant content delivery • Check before forwarding • Exchange id of an entry bundle (ID: SHA-1 digest of the bundle) • If it is an undelivered bundle deliver it Max subset hops = 1 HANI Fetch
Challenges – Free riding prevention– • Nodes may manifest selfish behavior • Only receive, without forwarding • Lie subscription set to become a preferred neighbor • Solution: Provide a neighbor evaluation method • Contribution metric • Nodes who forwards feeds I subscribe, and my near neighbors subscribe • Level of contribution: direct subscription, 1 hop subscription, 2 hop sub, … • cmi, j += wf−hf • Cut out unhelpful neighbors: I helped, but it doesn’t helped me • di,j = cmi,j − cmj,i • Feature • Uses local information only Easy to implement and enforce the mechanism
Challenges – Entry searching – • Overlay as a distributed storage • Iterative searching • Strong points: Searching latency, query traffic • Recursive searching (flooding) • Strong points: low overhead of a requester, caching for popular queries, reflect to neighbor evaluation ?
Benefits of FeedEx • Scalability • Archivability • Storage of entries • Controllability • Compared to web based readers : e.g. Fetch interval • Filtering and recommendation • Share opinions on entries (e.g. voting) • Feed recommendation • Privacy • Users can fetch documents for others • anonymize actual users
Architecture of FeedEx • Prototpye: python • Networking: Twisted • Protocol : XML-RPC • Interoperability, fast-prototyping • Entry Storage: SQLite (Lightweight RDB) • RSS parser : feedparser.org
Experimental Setup • Two modes • Stand-alone mode SLN • FeedEx mode XCH • Metrics • Time lag • Missing entries • Communication cost • Experiments • Use 189 PlanetLab nodes • Run 22 hours on a weekday • Primary factor: 6 fetching intervals • Let each node subscribe 20 out of 70 feeds
Results: Time Lag • Average Time Lag • Average of node averages • Without applying adaptive fetching algorithm Despite of fetching interval, contents are delivered soon 15.8times
Results: Missing Entries • Rate of Missing entries • # enrtries in a node / # of entries in a reference node • Low missing rate • despite of a problem(DNS error or routing error) in the network • Sometimes better than the reference node
Results: Communication Cost • Two most frequently called precedures: check_did, put_entries • Check_did call: single IP packet • Put_entries: 2 calls / minute deliver 2.67 entries / call • Low communication cost
Critique • Strong points • Made an new problem from an old domain “web caching” • Free from delay / failure of nodes • Draw out possible benefits/extensions • simple! • Practically deployable • Tried to find a mechanism both good for servers and clients
Critique • Weak points • Overload due to RSS feed delivery? • Only a small text file delivery • Should have considered podcasting(Multimedia RSS) • Will the clients donate their resource? • Is “short delay” a strong incentive? • Is “low bandwidth consumption” a strong incentive? • Will the subscription sets of people really overlap a lot? • Net effective to SPs providing diverse RSS feeds • e.g. Naver blog, egloos.. • Is it really robust to frequent leave and join? • Lack of server side evaluation • Server load & network resource • Delivering critical data (e.g. timely news) using RSS?
Entry Lifetime • Generally CNN, • Publishers have policies (probably)
Topic of interest (Maybe Tags?) feeds Topic based feed pub/sub (P2P based) Contents related to the topic feeds Web Content providers New idea • Topic based feed pub/sub system • Why should we register the address of a feed? • Need to find addresses providing contents I want • A feed may contain contents that I don’t want
New idea • Topic based feeding services are already launched • Baebo • Create new feeds by keywords from the Amazon, Yahoo, eBay feeds • Say4 • Extract entries containing sentences in the bible from the BBC feed. • But centralized server runs the service • Limitation in the number of input feeds • Hard to add input feed dynamically compared to P2P approach