Distributed News Feed Exchange: Enhancing Web Scalability

FeedEx: Collaborative Exchange of News Feeds Seung Jun, Mustaque Ahamad Georgia Institute of Technology WWW 2006

Outline • One line comment • Motivation/Problem • Approach • Analysis of feed publishing • Challenges • Experiments • Critique

One line comment • Disseminate web feeds in a distributed (P2P) manner to increase scalability of web servers Traditional method P2P method RSS A B A B RSS reveals visitors to content providers RSS decoupled fetch operation from read

Scalability Motivation & Problem • RSS/Atom feeds have become increasingly popular • Published by most traditional media and blogs • Feeding mechanism nyt.com http://nyt.com/../feed.xml HTTP response HTTP request … … Update page as contents are added RSS reader: Poll server to check updates

Approach • The Approach • P2P overlay + gossip based protocol • P2P: Scalable growth in resources with service demand • Gossip: Scalable, Robustness (Join & Leave) • Feature of this overlay • Don’t have to guarantee delivery or delay • Challenges content searching ? Data dissemination Free riding prevention Fetching interval determination Overlay construction

Analysis of Feed Publishing • Methodology • 245 popular feeds monitored for 10 days • Most popular feeds – information from Gmail’s web clips, Bloglines • Feeds fetched every 2 minutes • Measured.. • Publishing rate • Entry count in a feed • Entry lifetime

Publishing Rate by Rank • Great difference between publishers • Partly zipf distribution

Entry Count • High publish rate, More entry counts? – NO • Lifetime of entries are short  Entries can be lost with infrequent requests

Publishing Rate by Time • 4 types of publishing patterns

Challenges – Overlay Construction (1/2) – • Goal: Minimize network management overhead • Join • Well known host OR Contact previous neighbors • Share subscription set info • Update subscription set info to the network • Leave • Soft-state • Update subscription set periodically Gateway Neighbor list Subscription set

Challenges – Overlay Construction (1/2) – • Neighbor selection • Many neighbors may incur overhead • Need to adapt to my resource status • select “useful” neighbors to me • Whose subscription set is similar to me A 1 direct, 1 one-hop, 1 two-hop B

Challenges – Fetching interval determination – • Adaptive Fetching • Problem: Little hints about the publishing rate or entry lifetime • Frequent polling: overload servers, consume clients’ net bandwidth • Lazy polling: increase delay or miss entries • Adaptive Algorithm • Intuition: Frequent fetching  few new entries • Freshness rate: fraction of new entries in the fetched document • If Freshness rate < target freshness  Halve the fetching rate • If Freshness rate > target freshness  Double the fetching rate Entries in a feed HANI • Report 1 • Report 2 • Report 3 • … Fetch

Challenges – Data dissemination– • Goal: Minimize bandwidth consumption • Limit the boundary of delivery • Forward only to matching neighbors (subscription set, hop_count)  reduce forwarding overhead • Reduce the unit of delivery • Unit of delivery : Entry bundle • A set of new entries (Filter out old entries)  Reduce redundant content delivery • Check before forwarding • Exchange id of an entry bundle (ID: SHA-1 digest of the bundle) • If it is an undelivered bundle  deliver it Max subset hops = 1 HANI Fetch

Challenges – Free riding prevention– • Nodes may manifest selfish behavior • Only receive, without forwarding • Lie subscription set to become a preferred neighbor • Solution: Provide a neighbor evaluation method • Contribution metric • Nodes who forwards feeds I subscribe, and my near neighbors subscribe • Level of contribution: direct subscription, 1 hop subscription, 2 hop sub, … • cmi, j += wf−hf • Cut out unhelpful neighbors: I helped, but it doesn’t helped me • di,j = cmi,j − cmj,i • Feature • Uses local information only  Easy to implement and enforce the mechanism

Challenges – Entry searching – • Overlay as a distributed storage • Iterative searching • Strong points: Searching latency, query traffic • Recursive searching (flooding) • Strong points: low overhead of a requester, caching for popular queries, reflect to neighbor evaluation ?

Benefits of FeedEx • Scalability • Archivability • Storage of entries • Controllability • Compared to web based readers : e.g. Fetch interval • Filtering and recommendation • Share opinions on entries (e.g. voting) • Feed recommendation • Privacy • Users can fetch documents for others •  anonymize actual users

Architecture of FeedEx • Prototpye: python • Networking: Twisted • Protocol : XML-RPC • Interoperability, fast-prototyping • Entry Storage: SQLite (Lightweight RDB) • RSS parser : feedparser.org

Experimental Setup • Two modes • Stand-alone mode  SLN • FeedEx mode  XCH • Metrics • Time lag • Missing entries • Communication cost • Experiments • Use 189 PlanetLab nodes • Run 22 hours on a weekday • Primary factor: 6 fetching intervals • Let each node subscribe 20 out of 70 feeds

Results: Time Lag • Average Time Lag • Average of node averages • Without applying adaptive fetching algorithm  Despite of fetching interval, contents are delivered soon 15.8times

Results: Missing Entries • Rate of Missing entries • # enrtries in a node / # of entries in a reference node • Low missing rate • despite of a problem(DNS error or routing error) in the network • Sometimes better than the reference node

Results: Communication Cost • Two most frequently called precedures: check_did, put_entries • Check_did call: single IP packet • Put_entries: 2 calls / minute  deliver 2.67 entries / call • Low communication cost

Critique • Strong points • Made an new problem from an old domain “web caching” • Free from delay / failure of nodes • Draw out possible benefits/extensions • simple! • Practically deployable • Tried to find a mechanism both good for servers and clients

Critique • Weak points • Overload due to RSS feed delivery? • Only a small text file delivery • Should have considered podcasting(Multimedia RSS) • Will the clients donate their resource? • Is “short delay” a strong incentive? • Is “low bandwidth consumption” a strong incentive? • Will the subscription sets of people really overlap a lot? • Net effective to SPs providing diverse RSS feeds • e.g. Naver blog, egloos.. • Is it really robust to frequent leave and join? • Lack of server side evaluation • Server load & network resource • Delivering critical data (e.g. timely news) using RSS?

Supplementary slides

Entry Lifetime • Generally CNN, • Publishers have policies (probably)

Topic of interest (Maybe Tags?) feeds Topic based feed pub/sub (P2P based) Contents related to the topic feeds Web Content providers New idea • Topic based feed pub/sub system • Why should we register the address of a feed? • Need to find addresses providing contents I want • A feed may contain contents that I don’t want

New idea • Topic based feeding services are already launched • Baebo • Create new feeds by keywords from the Amazon, Yahoo, eBay feeds • Say4 • Extract entries containing sentences in the bible from the BBC feed. • But centralized server runs the service • Limitation in the number of input feeds • Hard to add input feed dynamically compared to P2P approach

Distributed News Feed Exchange: Enhancing Web Scalability

Distributed News Feed Exchange: Enhancing Web Scalability

Presentation Transcript

Blogs, news and journalists Or Whose News Is It Anyway?

The Art of Breaking Bad News

Blogs and RSS feeds

Foreign Exchange

INTRODUCTION TO EXCHANGE RATES AND THE FOREIGN EXCHANGE MARKET

LECTURE 7

Exchange Rates

PubMed Review

Collaborative Filtering: A Tutorial

Co-product Feeds are “Taylor Made” for Sheep

Gas Exchange at the Muscles

Single Cell Protein

¾ News Briefing

In The Name Of God

Energy Feed Ingredients

Collaborative Applications

Gas Exchange

Collaborative Programming