CSC407: Software Architecture Winter 2007 Peer to Peer

CSC407: Software ArchitectureWinter 2007Peer to Peer Greg Wilson BA 4234 gvwilson@cs.utoronto.ca

Overview • A peer-to-peer (P2P) system is one that relies on the computing power and bandwidth of all participating machines, rather than on that of a relatively small number of distinguished servers • Each servent (SERVer + cliENT) has the same capabilities, and fills the same role • Fluid, asynchronous membership • No single point of failure • Or censorship

Centralization • First architectural issue is how the overlay network is structured • Purely decentralized: all nodes perform exactly the same tasks • Partially centralized: some otherwise-normal nodes (temporarily) play a special role • “You’re the boss today” • Hybrid decentralized: central server(s) coordinate or bootstrap the P2P overlay network

Network Structure • Is the overlay network completely ad hoc, or are rules followed when adding nodes? • Unstructured: placement of content and capabilities is completely arbitrary • Means that content and capabilities must somehow be located each time they’re needed • Best for highly-transient populations with simple requirements • Structured: content placement follows rules • Rules help participants know where to look for things • More scalable, but only if you know exactly what you’re looking for

SETI@Home • Screensaver looking for signals from outer space • Anyone can participate… • …but reliant on central servers • Is it P2P or client/server? • Does it matter to anyone besides the marketing team?

Napster • A relatively small number of distinguished servers provide an indexing service • Once files are located, further communication between network participants is direct Napster 1 2 3 you me 4

Instant Messaging • Most IM systems work the same way as Napster • Signing in tells the system where you are located • Communication with your friends can then travel point-to-point • Q1: where does account information live? • In particular, is it replicated or not? • Q2: how are multi-party chats implemented? • Centralized, leader/follower, broadcast, or other?

Gnutella • No centralization of any kind • Protocol (on top of TCP/IP) uses four message types: • Ping: ask a host if it’s a member of the network • Pong: confirmation (including IP and port, and inventory of files being shared) • Query: what to look for, and speed requirements • Query Hits: IP, port, and speed of host, number of matching files, and an indexed result set

…Gnutella • Bootstrap via gnutellahosts.com • Ping any node to “get on the network” • Use flood (broadcast) to find files • Ask your neighbors, who ask their neighbors • Prevent overload by including a time to live (TTL) header in each message • Use unique message IDs to prevent cycles • Once a file is found, download point-to-point

Random Walks • Flooding (even with TTL) quickly overloads the network • Use random walks instead • Message wanders around until it finds the desired file • Works best with proactive object replication • Eventually evolve into distributed agent systems • Move a bit of code from place to place instead of trying to squeeze the query into a straitjacket

Kazaa • Some nodes elect to be supernodes • Chosen based on bandwidth and processing power • Nodes may opt out (configuration file) • Sueprnodes index the files shared by peers connected to them, and proxy search requests • Reduces discovery time • Takes advantage of heterogeneity • Without introducing single point of failure

Freenet • Loosely structured: nodes can estimate which other node is most likely to store certain content • Use chain mode propagation to forward messages along the most likely path • Each file identified by three keys: • Simplest is hash of short descriptive text string • Files are placed at nodes possessing files with similar keys (and replicated) • Propagation radius limited

…Freenet • Search messages are propagated most-likely-first • When successful replies come back, intermediate nodes remember them to speed up future searches • Freenet also supports indirect files • Named according to likely keyword searches • “Content” is a reference to the real file • Distributed equivalent of symbolic links (?)

…Freenet • Nodes tend to specialize in searching for similar keys over time • Nodes store similar keys over time (due to caching of files after successful queries) • System stays balanced because similarity of keys does not reflect similarity of files • Routing independent of underlying network topology

RReepplliiccaattiioonn • Passive: occurrs naturally as nodes copy files • Cache-based: keep a copy of everything that passes through you • Active: proactively migrate content to: • Balance load • Reduce search radius • Accommodate failure

Validating Content • How to be sure the file you’re downloading is the file that was uploaded? • Self-certifying data: data is indexed by a hash of its key • Doesn’t support fuzzy or partial lookoup • Separate forwarding from storing, so that file location(s) are hidden • Only defers the problem

…Validating Content • What about malicious routing? • A node joins the network, then pretends to be forwarding messages when in fact it’s responding locally with fabricated data • We don’t have an answer to this even on centralized systems

Garbage Collection • Every file system is eventually 99% full • Owner deletes? • Hard in asynchronous network • Content expiration? • Requires confirmation that the file found is the file the user was searching for • In practice, people don’t fill in forms

Anonymity • The raison d’être of many P2P systems • Politics, payment, and porn • May want to anonymize any or all of: • The author/publisher of content • The identity of a node storing content • The identity and details of the content itself • The details of a query for content retrieval

…Anonymity • Freenet replies retrace the request’s steps to make tracing as difficult as possible • Any node in the chain can claim to be the source, or claim that someone else was • Hops-to-live value is randomized to obscure search radius • OceanStore and PAST store encrypted content without keys • How you get the key is your business

Incentives • The tragedy of the commons: everyone wants to be a client, no-one wants to be a server • Paying people per download will bankrupt you • eBay centralized reputations? • EigenTrust collates upload histories from a dynamic set of servents • Possible (though not easy) to lie • Resource trading becoming popular • But again, how to verify?

Legal Issues • Is this part of software architecture? • Accessibility and safety are part of physical architecture • Never mind the hosting: to what extent are the designers and coders responsible? • If you make a bomb, you’re an accomplice to murder • What if you publish a description of how to smuggle pamphlets past a dictator’s border guards?

CSC407: Software Architecture Winter 2007 Peer to Peer