CC5212-1 Procesamiento Masivo de Datos Otoño 2014

CC5212-1ProcesamientoMasivo de DatosOtoño2014 Aidan Hogan aidhog@gmail.com Lecture III: 2014/03/23

Lab 1.1: Mensaje • New deadline: Tuesday 10am  • −1 (out of 10) for every day late after that

TYPES OFDISTRIBUTED SYSTEMS …

Client–Server Model • Client makes request to server • Server acts and responds (For example: Email, WWW, Printing, etc.)

Client–Server:Three-Tier Server Server Data Logic Presentation Add all the salaries Create HTML page SQL: Query salary of all employees HTTP GET: Total salary of all employees

Peer-to-Peer: Unstructured Pixie’s new album? (For example: Kazaa, Gnutella)

Peer-to-Peer: Structured (DHT) 000 • Circular DHT: • Only aware of neighbours • O(n) lookups • Implement shortcuts • Skips ahead • Enables binary-search-like behaviour • O(log(n)) lookups 111 001 110 010 101 011 100 Pixie’s new album? 111

Desirable Criteria for Distributed Systems • Transparency: • Appears as one machine • Flexibility: • Supports more machines, more applications • Reliability: • System doesn’t fail when a machine does • Performance: • Quick runtimes, quick processing • Scalability: • Handles more machines/data efficiently

LIMITATIONS OF DISTrIBUTED SYSTEMS: EIGHT FALLACIES

Eight Fallacies • By L. Peter Deutsch (1994) • James Gosling (1997) “Essentially everyone, when they first build a distributed application, makes the following eight assumptions. All prove to be false in the long run and all cause big trouble and painful learning experiences.” — L. Peter Deutsch • Each fallacy is a false statement!

What might these fallacies of distributed computing be based on our experience?

1. The network is reliable Machines fail, connections fail, firewall eats messages • flexible routing • retry messages • acknowledgements!

2. Latency is zero M2: Copy X from M1 There are significant communication delays • avoid “races” • local order ≠ remote order • acknowledgements • minimise remote calls • batch data! • avoid waiting • multiple-threads M1: Store X M2 M1

3. Bandwidth is infinite M1: Copy X (10GB) M1: Copy X (10GB) Limited in amount of data that can be transferred • avoid resending data • direct connections • caching!! M2 M1

4. The network is secure Network is vulnerable to hackers, eavesdropping, viruses, etc. • send sensitive data directly • isolate hacked nodes • hack one node ≠ hack all nodes • authenticate messages • secure connections M1: Send Medical History M1

5. Topology doesn’t change How machines are physically connected may change (“churn”)! • avoid fixed routing • next-hop routing? • abstract physical addresses • flexible content structure M3 M2 M4 M1 Message M5 thru M2, M3, M4 M5

6. There is one administrator Different machines have different policies! • Beware of firewalls! • Don’t assume most recent version • Backwards compat.

7. Transport cost is zero It costs time/money to transport data: not just bandwidth (Again) • minimise redundant data transfer • avoid shuffling data • caching • direct connection • compression?

8. The network is homogeneous Devices and connections are not uniform • interoperability! • Java vs. .NET? • route for speed • not hops • load-balancing

Eight Fallacies (to avoid) • Severity of fallacies vary in different scenarios! Which fallacies apply/do not apply for: • Gigabit ethernet LAN? • BitTorrent • The Web • Laboratorio II • The network is reliable • Latency is zero • Bandwidth is infinite • The network is secure • Topology doesn’t change • There is one administrator • Transport cost is zero • The network is homogeneous

LIMITATIONS OF DISTRIBUTED COMPUTING: CAP THEOREM

But first … ACID Have you heard of ACID guarantees in a database class? For traditional (non-distributed) databases … • Atomicity: • Transactions all or nothing: fail cleanly • Consistency: • Doesn’t break constraints/rules • Isolation: • Parallel transactions act as if sequential • Durability • System remembers changes

What is CAP? Three guarantees a distributed sys. could make • Consistency: • All nodes have a consistent view of the system • Availability: • Every read/write is acted upon • Partition-tolerance: • The system works even if messages are lost

A Distributed System (Replication) K–S F–J A–E T–Z T–Z A–E F–J K–S

Consistency K–S F–J A–E T–Z There’s 891 users in ‘M’ There’s 891 users in ‘M’ T–Z A–E F–J K–S

Availability 891 How many users start with ‘M’ K–S F–J A–E T–Z T–Z A–E F–J K–S

Partition-Tolerance 891 How many users start with ‘M’ K–S F–J A–E T–Z T–Z A–E F–J K–S

The CAPQuestion Can a distributed system guarantee consistency(all nodes have the same up-to-date view), availability (every read/write is acted upon) and partition-tolerance(the system works even if messages are lost) at the same time? What do you think? Can a distributed system guarantee consistency and availability and partition-tolerance at the same time, or not?

The CAP Answer

The CAP “Proof” 891 There’s 891 users in ‘M’ How many users start with ‘M’ K–S F–J A–E T–Z T–Z A–E F–J There’s 892 users in ‘M’ There’s 891 users in ‘M’ K–S

The Cap “Proof” (in boring words) • Consider machines m1 and m2 on either side of a partition: • If an update is allowed on m2 (Availability), then m1 cannot see the change: (loses Consistency) • To make sure that m1 and m2have the same, up-to-date view (Consistency), neither m1nor m2 can accept any requests/updates (lose Availability) • Thus, only when m1and m2can communicate (lose Partition tolerance) can Availability and Consistency be guaranteed

The CAP Theorem A distributed system cannot guarantee consistency(all nodes have the same up-to-date view), availability(every read/write is acted upon) and partition-tolerance(the system works even if messages are lost) at the same time. (“Proof” as shown on previous slide )

The CAP Triangle C Choose Two A P

CAP Systems CA: Guarantees to give a correct response but only while network works fine (Centralised / Traditional) CP: Guarantees responses are correct even if there are network failures, but response may fail (Weak availability) C A P (No intersection) AP: Always provides a “best-effort” response even in presence of network failures (Eventual consistency)

CA System 892 There’s 891 users in ‘M’ There’s 892 users in ‘M’ How many users start with ‘M’ K–S F–J A–E T–Z T–Z A–E F–J There’s 892 users in ‘M’ There’s 891 users in ‘M’ K–S

CP System 891 There’s 891 users in ‘M’ How many users start with ‘M’ K–S F–J A–E T–Z T–Z A–E F–J There’s 891 users in ‘M’ K–S

AP System 891 There’s 891 users in ‘M’ How many users start with ‘M’ K–S F–J A–E T–Z T–Z A–E F–J There’s 892 users in ‘M’ There’s 891 users in ‘M’ K–S

BASE(AP) • Basically Available • Pretty much always “up” • Soft State • Replicated, cached data • Eventual Consistency • Stale data tolerated, for a while • Amazon, eBay, Google, DNS …

The CAP Theorem • C,A in CAP ≠ C,A in ACID • Simplified model • Partitions are rare • Systems may be a mix of CA/CP/AP • C/A/P often continuous in reality! • But concept useful/frequently discussed: • How to handle Partitions? • Availability? or • Consistency?

LABS PREP:AIDAN LEARNS SPANISH 

Help me learn Spanish! What are the top 500 most common words in Spanish Word Count

Help me learn Spanish! • How should we design the distributed system? • (for now it will be in-memory) • How can we distribute the word count? • How can we call the machines / send the data? • How can we merge the word counts? • How to implement in the lab?

RECAP

Distributed Systems have limitations • Eight fallacies and what they mean The network is reliable Latency is zero Bandwidth is infinite The network is secure Topology doesn’t change There is one administrator Transport cost is zero The network is homogeneous

Distributed Systems have limitations CAP Theorem A distributed system cannot guarantee consistency(all nodes have the same up-to-date view and will give a correct answer), availability(every request is acted upon) and partition-tolerance(the system works even if messages are lost) at the same time.

CAP Systems CA: Guarantees to give a correct response but only while network works fine (Centralised / Traditional) CP: Guarantees responses are correct even if there are network failures, but response may fail (Weak availability) C A P (No intersection) AP: Always provides a “best-effort” response even in presence of network failures (Eventual consistency)

Design of a Distributed Algorithm • How to distribute/split data for processing • Embarrassingly parallel execution • How to merge data (naively for now) • How to help me learn Spanish

Questions?

CC5212-1 Procesamiento Masivo de Datos Otoño 2014

CC5212-1 Procesamiento Masivo de Datos Otoño 2014

Presentation Transcript

DNA: Estructura , replicacion , transcripcion , procesamiento y mutaciones