A Brief History of Google's BackRub: From 24GB to Petabytes of Data Processing

CSC313: Advanced Programming Topics Map-Reduce:Win-or-Epic Win

Brief History of Google BackRub: 1996 4 disk drives 24 GB total storage

Brief History of Google BackRub: 1996 4 disk drives 24 GB total storage =

Brief History of Google Google: 1998 44 disk drives 366 GB total storage

Brief History of Google Google: 1998 44 disk drives 366 GB total storage =

Traditional Design Principles • If big enough, supercomputer processes work • Use desktop CPUs, just a lot more of them • But it also provides huge bandwidth to memory • Equivalent to many machines bandwidth at once • But supercomputers are VERY, VERY expensive • Maintenance also expensive once machine bought • But do get something: high-quality == low downtime • Safe, expensive solution to very large problems

Why Trade Money for Safety?

How Was Search Performed? http://www.yahoo.com/search?p=pager DNS

How Was Search Performed? http://www.yahoo.com/search?p=pager http://209.191.122.70 DNS

How Was Search Performed? http://www.yahoo.com/search?p=pager http://209.191.122.70/search?p=pager DNS

Google’s Big Insight • Performing search is “embarrassingly parallel” • No need for supercomputer and all that expense • Can instead do this using lots & lots of desktops • Identical effective bandwidth & performance

Google’s Big Insight • Performing search is “embarrassingly parallel” • No need for supercomputer and all that expense • Can instead do this using lots & lots of desktops • Identical effective bandwidth & performance • But problem is desktop machines unreliable • Budget for 2 replacements, since machines cheap • Just expect failure; software provides quality

A brief history of Google Google: 2012 ?0,000 total servers ??? PB total storage

How Is Search Performed Now? http://209.85.148.100/search?q=android

How Is Search Performed Now? http://209.85.148.100/search?q=android Spell Checker Ad Server Document Servers (TB) Index Servers (TB)

Google’s Processing Model • Buy cheap machines & prepare for worst • Machines going to fail, but still cheaper approach • Important steps keep whole system reliable • Replicate data so that information losses limited • Move data freely so can always rebalance loads • These decisions lead to many other benefits • Scalability helped by focus on balancing • Search speed improved; performance much better • Utilize resources fully, since search demand varies

Heterogeneous processing • By buying cheapest computers, variances are high • Programs must handle homo- & hetero- systems • Centralized workqueuehelps with different speeds

Heterogeneous processing • By buying cheapest computers, variances are high • Programs must handle homo- & hetero- systems • Centralized workqueuehelps with different speeds • This process also leads to a few small downsides • Space • Power consumption • Cooling costs

Complexity at Google

Complexity at Google Avoid this nightmare using abstractions

Google Abstractions • Google File System • Handles replication to provide scalability & durability • BigTable • Manages largerelational data sets • Chubby • Gonna skip past that joke; distributed locking service • MapReduce • Ifjob fits, easy parallelism possible without much work

Google Abstractions • Google File System • Handles replication to provide scalability & durability • BigTable • Manageslarge relational data sets • Chubby • Gonna skip past that joke; distributed locking service • MapReduce • Ifjob fits, easy parallelism possible without much work

Remember Google’s Problem

MapReduce Overview • Programming model makes details simple • Automatic parallelization & load balancing • Network and disk I/O optimization • Robust performance even if machines fail

MapReduce Overview • Programming model provides good Façade • Automatic parallelization & load balancing • Network and disk I/O optimization • Robust performance even if machines fail

MapReduce Overview • Programming model provides good Façade • Automatic parallelization & load balancing • Network and disk I/O optimization • Robust performance even if machines fail • Idea came from 2 Lisp (functional) primitives • Map • Reduce

MapReduce Overview • Programming model provides good Façade • Automatic parallelization & load balancing • Network and disk I/O optimization • Robust performance even if machines fail • Idea came from 2 Lisp (functional) primitives • Map:process each entry in list using some function • Reduce: recombines data using given function

Typical MapReduce problem • Read lots and lots of data (e.g., TBs) • Map • Extract important data from each entry in input • Combine Maps and sort entries by key • Reduce • Process each key’s entries to get result for that key • Output final result & watch money roll in

Typical MapReduce problem • Read lots and lots of data (e.g., TBs) • Map • Extract important data from each entry in input • Combine Maps and sort entries by key • Reduce • Process each key’s entries to get result for that key • Output final result & watch money roll in Outline always same; Just map & reduce functions change

Typical MapReduce problem • Read lots and lots of data (e.g., TBs) • Map • Extract important data from each entry in input • Combine Maps and sort entries by key • Reduce • Process each key’s entries to get result for that key • Output final result & watch money roll in Algorithm always same; Just map & reduce functions change

Typical MapReduce problem • Read lots and lots of data (e.g., TBs) • Map • Extract important data from each entry in input • Combine Maps and sort entries by key • Reduce • Process each key’s entries to get result for that key • Output final result & watch money roll in Template method always same; Just the hook methods change

Pictorial View of MapReduce

Ex: Count Word Frequencies • Processes files separately Map Key=URL Value=text on page

Ex: Count Word Frequencies • Processes files separately & count word freq. in each Key’=word Value’=count Map Key’=word Value’=count Key=URL Value=text on page Key’=word Value’=count Key’=word Value’=count

Ex: Count Word Frequencies • In shuffle step, Maps combined & entries sorted by key • Reduce Key’=“be” Value’=“1” Key’=“be” Value’=“1” Key’=“or” Value’=“1” Reduce Key’=“not” Value’=“1” Key’=“to” Value’=“1” Key’=“to” Value’=“1”

Ex: Count Word Frequencies • In shuffle step, Maps combined & entries sorted by key • Reduce combines key’s results to compute final output Key’=“be” Value’=“1” Key’=“be” Value’=“1” Key’’=“be” Value’’=“2” Key’=“or” Value’=“1” Key’’=“or” Value’’=“1” Reduce Key’=“not” Value’=“1” Key’’=“not” Value’’=“1” Key’’=“to” Value’’=“2” Key’=“to” Value’=“1” Key’=“to” Value’=“1”

Word Frequency Pseudo-code Map(String input_key, String input_values) {String[] words = input_values.split(“ ”);foreachw in words{EmitIntermediate(w, "1");} } Reduce(String key, Iterator intermediate_values){intresult = 0;foreachv in intermediate_values{result += ParseInt(v);}Emit(result); }

Ex: Build Search Index • Processes files separately & record words found on each Key’=word Value’=count Map Key’=word Value’=count Key=URL Value=text on page Key’=word Value’=count Key’=word Value’=URL

Ex: Build Search Index • Processes files separately & record words found on each • To get search Map, combine key’s results in Reduce Key’=word Value’=count Map Key’=word Value’=count Key=URL Value=text on page Key’=word Value’=count Key’=word Value’=URL Key’=word Value’=count Reduce Key’=word Value’=count Key=word Value=URLs with word Key’=word Value’=count Key’=word Value’=URL

Search Index Pseudo-code Map(String input_key, String input_values) {String[] words = input_values.split(“ ”);foreachw in words{EmitIntermediate(w, input_key);} } Reduce(String key, Iterator intermediate_values){List result = new ArrayList();foreachv in intermediate_values{result.addLast(v);}Emit(result); }

Ex: Page Rank Computation • Google’s algorithm ranking pages’ relevance

Ex: Page Rank Computation Key’=word Value’=count Map Key’=word Value’=count Key=<URL, rank> Value=links on page Key’=word Value’=count Key’=link on page Value’=<URL, rank/N> + Key=<URL, rank> Value=links on page Key=<URL, rank> Value=links on page + Key’=word Value’=count Reduce Key’=word Value’=count Key=<URL, rank> Value=links on page Key’=word Value’=count Key’=link to URL Value’=<src, rank/N>

A Brief History of Google's BackRub: From 24GB to Petabytes of Data Processing

A Brief History of Google's BackRub: From 24GB to Petabytes of Data Processing

Presentation Transcript

The Epic of Beowulf Translated by Burton Raffel

epic hero paragraphs

BEOWULF

Buoyancy

Indiana Clean Manufacturing Technology and Safe Materials Institute (CMTI) and the Coating Applications Research Laborat

Engaging Families as Partners to Reduce the Risk of Neglect

Reduce Data Sprawl

Once Upon A Time...

NLP in Scala with Breeze and Epic

World War II

The Erosion Control Inspection One Man’s Journey of Epic Proportions

Map/Reduce Programming Model

Treatment Strategy Of Chronic Stable Angina

Welcome!

Reduce Your Printing Costs

PSAT Preparation

$100

Building Blocks

The Epic

Lecture 09: Parallel Databases , Big Data, Map/Reduce, Pig-Latin

Home Remedies Me Janiye Face Fat Ko Karne Ke Upaye

Pillow wars