550 likes | 695 Vues
This overview explores the evolution of Google's search engine, starting from its inception as BackRub in 1996 with a mere 24GB of storage, to the monumental advancements by 2012. It discusses the traditional design principles favoring supercomputers versus distributed desktop CPUs, highlighting Google's innovative parallel processing using simple, cost-effective machines. The text provides insights into the complexities of Google's processing model, emphasizing how they handle failures, maintain reliability, and maximize performance through advances in systems like Google File System, BigTable, and MapReduce.
E N D
CSC313: Advanced Programming Topics Map-Reduce:Win-or-Epic Win
Brief History of Google BackRub: 1996 4 disk drives 24 GB total storage
Brief History of Google BackRub: 1996 4 disk drives 24 GB total storage =
Brief History of Google Google: 1998 44 disk drives 366 GB total storage
Brief History of Google Google: 1998 44 disk drives 366 GB total storage =
Traditional Design Principles • If big enough, supercomputer processes work • Use desktop CPUs, just a lot more of them • But it also provides huge bandwidth to memory • Equivalent to many machines bandwidth at once • But supercomputers are VERY, VERY expensive • Maintenance also expensive once machine bought • But do get something: high-quality == low downtime • Safe, expensive solution to very large problems
How Was Search Performed? http://www.yahoo.com/search?p=pager DNS
How Was Search Performed? http://www.yahoo.com/search?p=pager DNS
How Was Search Performed? http://www.yahoo.com/search?p=pager http://209.191.122.70 DNS
How Was Search Performed? http://www.yahoo.com/search?p=pager http://209.191.122.70/search?p=pager DNS
How Was Search Performed? http://www.yahoo.com/search?p=pager http://209.191.122.70/search?p=pager DNS
Google’s Big Insight • Performing search is “embarrassingly parallel” • No need for supercomputer and all that expense • Can instead do this using lots & lots of desktops • Identical effective bandwidth & performance
Google’s Big Insight • Performing search is “embarrassingly parallel” • No need for supercomputer and all that expense • Can instead do this using lots & lots of desktops • Identical effective bandwidth & performance • But problem is desktop machines unreliable • Budget for 2 replacements, since machines cheap • Just expect failure; software provides quality
Google’s Big Insight • Performing search is “embarrassingly parallel” • No need for supercomputer and all that expense • Can instead do this using lots & lots of desktops • Identical effective bandwidth & performance • But problem is desktop machines unreliable • Budget for 2 replacements, since machines cheap • Just expect failure; software provides quality
Google’s Big Insight • Performing search is “embarrassingly parallel” • No need for supercomputer and all that expense • Can instead do this using lots & lots of desktops • Identical effective bandwidth & performance • But problem is desktop machines unreliable • Budget for 2 replacements, since machines cheap • Just expect failure; software provides quality
A brief history of Google Google: 2012 ?0,000 total servers ??? PB total storage
How Is Search Performed Now? http://209.85.148.100/search?q=android
How Is Search Performed Now? http://209.85.148.100/search?q=android Spell Checker Ad Server Document Servers (TB) Index Servers (TB)
How Is Search Performed Now? http://209.85.148.100/search?q=android Spell Checker Ad Server Document Servers (TB) Index Servers (TB)
How Is Search Performed Now? http://209.85.148.100/search?q=android Spell Checker Ad Server Document Servers (TB) Index Servers (TB)
How Is Search Performed Now? http://209.85.148.100/search?q=android Spell Checker Ad Server Document Servers (TB) Index Servers (TB)
Google’s Processing Model • Buy cheap machines & prepare for worst • Machines going to fail, but still cheaper approach • Important steps keep whole system reliable • Replicate data so that information losses limited • Move data freely so can always rebalance loads • These decisions lead to many other benefits • Scalability helped by focus on balancing • Search speed improved; performance much better • Utilize resources fully, since search demand varies
Heterogeneous processing • By buying cheapest computers, variances are high • Programs must handle homo- & hetero- systems • Centralized workqueuehelps with different speeds
Heterogeneous processing • By buying cheapest computers, variances are high • Programs must handle homo- & hetero- systems • Centralized workqueuehelps with different speeds • This process also leads to a few small downsides • Space • Power consumption • Cooling costs
Complexity at Google Avoid this nightmare using abstractions
Google Abstractions • Google File System • Handles replication to provide scalability & durability • BigTable • Manages largerelational data sets • Chubby • Gonna skip past that joke; distributed locking service • MapReduce • Ifjob fits, easy parallelism possible without much work
Google Abstractions • Google File System • Handles replication to provide scalability & durability • BigTable • Manageslarge relational data sets • Chubby • Gonna skip past that joke; distributed locking service • MapReduce • Ifjob fits, easy parallelism possible without much work
MapReduce Overview • Programming model makes details simple • Automatic parallelization & load balancing • Network and disk I/O optimization • Robust performance even if machines fail
MapReduce Overview • Programming model provides good Façade • Automatic parallelization & load balancing • Network and disk I/O optimization • Robust performance even if machines fail
MapReduce Overview • Programming model provides good Façade • Automatic parallelization & load balancing • Network and disk I/O optimization • Robust performance even if machines fail • Idea came from 2 Lisp (functional) primitives • Map • Reduce
MapReduce Overview • Programming model provides good Façade • Automatic parallelization & load balancing • Network and disk I/O optimization • Robust performance even if machines fail • Idea came from 2 Lisp (functional) primitives • Map:process each entry in list using some function • Reduce: recombines data using given function
Typical MapReduce problem • Read lots and lots of data (e.g., TBs) • Map • Extract important data from each entry in input • Combine Maps and sort entries by key • Reduce • Process each key’s entries to get result for that key • Output final result & watch money roll in
Typical MapReduce problem • Read lots and lots of data (e.g., TBs) • Map • Extract important data from each entry in input • Combine Maps and sort entries by key • Reduce • Process each key’s entries to get result for that key • Output final result & watch money roll in Outline always same; Just map & reduce functions change
Typical MapReduce problem • Read lots and lots of data (e.g., TBs) • Map • Extract important data from each entry in input • Combine Maps and sort entries by key • Reduce • Process each key’s entries to get result for that key • Output final result & watch money roll in Algorithm always same; Just map & reduce functions change
Typical MapReduce problem • Read lots and lots of data (e.g., TBs) • Map • Extract important data from each entry in input • Combine Maps and sort entries by key • Reduce • Process each key’s entries to get result for that key • Output final result & watch money roll in Template method always same; Just the hook methods change
Ex: Count Word Frequencies • Processes files separately Map Key=URL Value=text on page
Ex: Count Word Frequencies • Processes files separately & count word freq. in each Key’=word Value’=count Map Key’=word Value’=count Key=URL Value=text on page Key’=word Value’=count Key’=word Value’=count
Ex: Count Word Frequencies • In shuffle step, Maps combined & entries sorted by key • Reduce Key’=“be” Value’=“1” Key’=“be” Value’=“1” Key’=“or” Value’=“1” Reduce Key’=“not” Value’=“1” Key’=“to” Value’=“1” Key’=“to” Value’=“1”
Ex: Count Word Frequencies • In shuffle step, Maps combined & entries sorted by key • Reduce combines key’s results to compute final output Key’=“be” Value’=“1” Key’=“be” Value’=“1” Key’’=“be” Value’’=“2” Key’=“or” Value’=“1” Key’’=“or” Value’’=“1” Reduce Key’=“not” Value’=“1” Key’’=“not” Value’’=“1” Key’’=“to” Value’’=“2” Key’=“to” Value’=“1” Key’=“to” Value’=“1”
Word Frequency Pseudo-code Map(String input_key, String input_values) {String[] words = input_values.split(“ ”);foreachw in words{EmitIntermediate(w, "1");} } Reduce(String key, Iterator intermediate_values){intresult = 0;foreachv in intermediate_values{result += ParseInt(v);}Emit(result); }
Ex: Build Search Index • Processes files separately & record words found on each Key’=word Value’=count Map Key’=word Value’=count Key=URL Value=text on page Key’=word Value’=count Key’=word Value’=URL
Ex: Build Search Index • Processes files separately & record words found on each • To get search Map, combine key’s results in Reduce Key’=word Value’=count Map Key’=word Value’=count Key=URL Value=text on page Key’=word Value’=count Key’=word Value’=URL Key’=word Value’=count Reduce Key’=word Value’=count Key=word Value=URLs with word Key’=word Value’=count Key’=word Value’=URL
Search Index Pseudo-code Map(String input_key, String input_values) {String[] words = input_values.split(“ ”);foreachw in words{EmitIntermediate(w, input_key);} } Reduce(String key, Iterator intermediate_values){List result = new ArrayList();foreachv in intermediate_values{result.addLast(v);}Emit(result); }
Ex: Page Rank Computation • Google’s algorithm ranking pages’ relevance
Ex: Page Rank Computation Key’=word Value’=count Map Key’=word Value’=count Key=<URL, rank> Value=links on page Key’=word Value’=count Key’=link on page Value’=<URL, rank/N> + Key=<URL, rank> Value=links on page Key=<URL, rank> Value=links on page + Key’=word Value’=count Reduce Key’=word Value’=count Key=<URL, rank> Value=links on page Key’=word Value’=count Key’=link to URL Value’=<src, rank/N>