620 likes | 905 Vues
„Big data ”. Benczúr András MTA SZTAKI. Big Data – the new hype. “big data” is when the size of the data itself becomes part of the problem “big data” is data that becomes large enough that it cannot be processed using conventional methods Google sorts 1PB in 33 minutes (07-09-2011)
E N D
„Big data” Benczúr András MTA SZTAKI
Big Data – thenewhype • “big data” is when the size of the dataitself becomes part of the problem • “big data” is data that becomes largeenough that it cannot be processedusingconventionalmethods • Googlesorts 1PB in 33 minutes (07-09-2011) • Amazon S3 store contains 499B objects (19-07-2011) • New Relic: 20B+ applicationmetrics/day (18-07-2011) • Walmart monitors 100M entities in realtime (12-09-2011) Source: The Emerging Big Data slide from the Intelligent Information Management DG INFSO/E2 Objective ICT-2011.4.4 Info day in Luxembourg on 26 September 2011
Big Data Planes frauddetection mediapricing custom hardware powerstationsensors Web content navigationmobility Matlab Revolution GraphLab SPSS Big Data Services custom software news curation SAS R online reputation IT logs realtime MOA SciPy KDB Mahout Vertica S4 Netezza Esper Big analytics Speed HBase Greenplum MapR InfoBright Progress Fastdata MySql batch Hadoop Size MegaByte PetaByte
Overview • Introduction – Buzzwords • Part I: Background • Examples • Mobility and navigation traces • Sensors, smart city • IT logs • Wind power • Scientific and Business Relevance • Part II: Infrastructures • NoSQL, Key-valuestore, Hadoop, H*, Pregel, … • Part III: Algorithms • BriefHistory of Algorithms • Web processing, PageRankwithalgorithms • Stream algoritmusok • Entityresolution – detailedcomparison (QDB 2011)
Navigation and MobilityTraces • Streamingdataat mobile basestations • Privacyissues • Regulationstoletonlyanonymizeddataleavebeyondnetworkoperations and billing • Doregulation policy makersknowaboutdeanonymizationattacks? • Whatyourwife/husbandwillnotknow – your mobile providerwill
Sensors – smarthome, city, country, … • Road and parking slotsensors • Mobile parking traces • Public transport, Oystercards • Bike hireschemes Source: Internet of ThingsComicBook, http://www.smartsantander.eu/images/IoT_Comic_Book.pdf
Corporate IT log processing Traditional methods fail Aggregation into Data Warehouse ? Our experience: 30-100+ GB/day3-60 M events Identify bottlenecks Optimize procedures Detect misuse, fraud, attacks
Scientific and business relevance • VLDB 2011 (~100 papers): • 6 papers on MapReduce/Hadoop, 10 on big data (+keynote), 11 NoSQL architectures, 6 GPS/sensory data • tutorials, demos (Microsoft, SAP, IBM NoSQL tools) • session: Big Data Analysis, MapReduce, Scalable Infrastructures • EWEA 2011: 28% of papers on wind power raise data size issues • SIGMOD 2011: out of 70 papers, 10 on new architectures and extensions for analytics • Gartner 2011 trend No. 5: Next Generation Analytics - „significant changes to existing operational and business intelligence infrastructures” • The Economist 2010.02.27: „Monstrous amounts of data … Information is transforming traditional businesses” • News special issue on Big DatathisApril
New challengesindatabase technologies Question of research and practice: Applicability to a specific problem? Applicability as a general technique?
Overview • Part I: Background • Examples • Scientific and Business Relevance • Part II: Infrastructures • NoSQL • Key-valuestores • Hadoop és Hadoopra épülő eszközök • BulkSynchronous Parallel, Pregel • Streaming, S4 • Part III: Algorithms, Examples
Most jön sok külső slide show … • NoSQL bevezető – www.intertech.com/resource/usergroup/NoSQL.ppt • Key-valuestores • BerkeleyBD – nem osztott • Voldemort – behemoth.strlen.net/~alex/voldemort-nosql_live.ppt • Cassandra, Dynamo, … • Hadoop alapon is létezik (lent): HBase • Hadoop – Erdélyi Miki fóliái • HBase – datasearch.ruc.edu.cn/course/cloudcomputing20102/slides/Lec07.ppt • Cascading – nem lesz • Mahout – cwiki.apache.org/MAHOUT/faq.data/Mahout%20Overview.ppt • Miért kell más? Mi más kell? • BulkSynchronous Parallel • Graphlab – DannyBicksonslides • MOA – http://www.slideshare.net/abifet/moa-5636332/download • Streaming • S4 – http://www.slideshare.net/alekbr/s4-stream-computing-platform
BulkSynchronous Parallel architecture HAMA: Pregel klón
Use of large matrices • Main step in all distributed algorithms • Network based features in classification • Partitioning for efficient algorithms • Exploring the data, navigation (e.g. ranking to select a nice compact subgraph) • Hadoop apps (e.g. PageRank) move the entire data around in each iteration • Baseline C++ code keeps data local Hadoop Hadoop + KeyValue store Best C++ custom code
BSP vs. MapReduce • MapReduce: Data locality not preserved between Map and Reduce invocations or MapReduce iterations. • BSP: Tailored towards processing data with locality. • Proprietary: Google Pregel • Open-source (will be??… several flaws now): HAMA • Home developed C++ code base • Both: Easy parallelization and distribution. Sidlo et al.,Infrastructures and bounds for distributed entity resolution. QDB 2011
Overview • Part I: Background • Part II: Infrastructures • NoSQL, Key-value store, Hadoop, H*, Pregel, … • Part III: Algorithms, Examples, Comparison • Data and computationintensetasks, architectures • History of Algorithms • Web processing, PageRankwithalgorithms • Entityresolution – detailedwithalgorithms • Summary, conclusions
Types of Big Data problems • Data intense • Web processing, info retrieval, classification • Log processing (telco, IT, supermarket, …) • Compute intense: • Expectation Maximization, Gaussian mixture decomposition, image retrieval, … • Genom matching, phylogenetic trees, … • Data AND compute intense: • Network (Web, friendship, …) partitioning, finding similarities, centers, hubs, … • Singular value decomposition
Hardware Data intense: Map-reduce (Hadoop), cloud, … • Compute intense: • Shared memory • Message passing • Processor arrays, … • → became affordable choice recently, as graphics co-procs! Data AND compute intense??
Big data: Why now? • Hardware is just getting better, cheaper? • But data is getting larger, easier to access • Bad news for algorithms slower than ~ linear
Moore’s Law: doublingin 18 months Butin a keyaspect, the trend has changed! Fromspeedto no of cores
„Numbers Everyone Should Know” • Disk • 10+TB • RAM • 100+ GB • CPU • L2 1+ MB • L1 10+ KB • GPU onboardmemory • Global 4-8 GB • Blockshared 10+ KB • RAM • L1 cache reference 0.5 ns • L2 cache reference 7 ns • Main memory reference 100 ns • Read 1 MB sequentially from memory 250,000 ns • Intra-process communication • Mutex lock/unlock 100 ns • Read 1 MB sequentially from network 10,000,000 ns • Disk • Disk seek 10,000,000 ns • Read 1 MB sequentially from disk 30,000,000 ns Jeff Dean, Google
Back to Databases, this means … 2000/Sec 1600/Sec Sub-linear speed-up 1000/Sec 16 CPUs 10 CPUs 5 CPUs • Read 1 MB sequentially… • memory 250,000 ns • network 10,000,000 ns • disk 30,000,000 ns MEMORY CPU M CPU CPU CPU CPU M M M M CPU CPU CPU CPU CPU Linear speed-up (ideal) CPU • Cost • Security • Integrity control more difficult • Lack of standards • Lack of experience • Complexity of management and control • Increased storage requirements • Increased training cost Number of transactions/second Number of CPUs Connolly, Begg: Database systems: a practical approach to design, implementation, and management], International computer science series, Pearson Education, 2005
“The brief history of Algorithms” P, NP ThinkingMachines: hypercube, … PRAM theoretic models External memory algs SIMD, MIMD, messagepassing CM-5: manyvectorprocs Map-reduce Google Multi-core Many-core Cloud Flash disk Cray: vectorprocessors
Earliest history: P, NP 15 5 2 2 1 1 15 1 2 2 1 1 2 5 1 25 1 1 • P: Graph traversal Spanning tree • NP: Steiner trees
Why do we care about graphs, trees? Image segmentation name ID e-mail Entity Resolution 1 2 3
History of algs: spanning trees in parallel • iterative minimum spanningforest • everynode is a treeat start; everyiterationmergestrees Bentley: A parallel algorithm for constructing minimum spanning trees 1980 Harish et al. Fast Minimum Spanning Tree for Large Graphs on the GPU 2009 3 2 1 4 5 6 8 7
Overview • Part I: Background • Part II: Infrastructures • NoSQL, Key-value store, Hadoop, H*, Pregel, … • Part III: Algorithms, Examples, Comparison • Data and computationintensetasks, architectures • History of Algorithms • Web processing, PageRankwithalgorithms • Streaming algoritmusok • Entityresolution – detailedwithalgorithms • Summary, conclusions
Posted by John Klossner on Aug 03, 2009 WEB 1.0 (browsers) – Users find dataWEB 2.0 (social networks) – Users find each otherWEB 3.0 (semantic Web) – Data find each other WEB 4.0 – Data create their own Facebook page, restrict friends. WEB 5.0 – Data decide they can work without humans, create their own language. WEB 6.0 –Human users realize that they no longer can find data unless invited by data. WEB 7.0 – Data get cheaper cell phone rates. WEB 8.0 – Data horde all the good YouTube videos, leaving human users with access to bad ’80′s music videos only. WEB 9.0 – Data create and maintain own blogs, are more popular than human blogs. WEB 10.0 – All episodes of BattlestarGallactica will now be shown from the Cylons’ point of view. The Web is aboutdatatoo Big Data interpetation: recommenders, personalization, infoextraction
Building a Virtual Web Observatoryonlargetemporaldata of Internet archives
Partner approachesto hardware • HanzoArchives (UK): Amazon EC2 cloud + S3 • Internet MemoryFoundation: 50 low-end servers • We: indexing 3TB compressed, .5B pages • Open sourcetoolsnotyetmature • Oneweek of processingon 50 old dualcores • Hardware worthapprox €10,000; Amazon pricearound €5000
DocumentsstoredinHBasetables over theHadoop file system (HDFS) Indexelés: 200 példány saját C++ kereső 40 Lucene példány, utána top 50,000 találat saját kereső SolR? Katta? Tényleg realtime működnek? Ranking? Realistic- Even spam is important! Text REtrieval Conference measurement • Spam • Obvious parallelization: each node processes all pages of one host • Link features (eg. PageRank) cannot be computed in this way M. Erdélyi, A. Garzó, and A. A. Benczúr: Web spam classification: a few features worth more (WebQuality 2011)
Distributedstorage: HBasevs WARC files • WARC • Many, many medium sized files very inefficient w/ Hadoop • Either huge block size wasting space • Or data locality lostasblocksmaycontinueat non-local HDFS node • HBase • Data locality preserving ranges – cooperation w/ Hadoop • Experiments up to 3TB compressed Web data • WARC to HBase • One-time expensive step, no data locality • One-by-one inserts fail, very low performance • MapReduce jobs to create HFiles, the native HBase format • HFiletransfer • HBaseinsertionrate 100,000 per hour
The Random SurferModel Starts at a random page—arrives atquality page u Nodes = Web pages Edges = hyperlinks
PageRank: The Random SurferModel Chooses random neighbor with probability 1- u
The Random SurferModel Or with probability “teleports” to random page—gets bored and types a new URL u
The Random SurferModel And continues with the random walk …
The Random SurferModel And continues with the random walk …
The Random SurferModel Until convergence … ? [Brin, Page 98]
PageRank asQuality A quality page is pointed to by several quality pages PR(k+1) = PR(k)( (1 - ) M + · U) = PR(1)( (1 - ) M + · U)k
Personalized PageRank Orwithprobability “teleports” to random page—selected from her bookmarks u
Algorithmics • Estimated 10+ billions of Web pages worldwide • PageRank (as floats) • fits into 40GB storage • Personalization just to single pages: • 10 billions of PageRank scores for each page • Storage exceeds several Exabytes! • NB single-page personalization is enough:
Forcertainthingsarejusttoobig? • For light to reach the other side of the Galaxy … takes rather longer: five hundred thousand years. • The record for hitch hiking this distance is just under five years, but you don't get to see much on the way. D Adams, The Hitchhiker's Guide to the Galaxy. 1979
MarkovChain Monte Carlo • Reformulation by simple tricks of linear algebra • From u simulateN independent random walks • Database of fingerprints: ending vertices of the walks from all vertices • Query • PPR(u,v) := # ( walks u→v ) / N • N ≈ 1000 approximates top100well Fogaras-Racz: TowardsScalingFullyPersonalizedPageRank, WAW 2004
SimRank: similarityingraphs “Twopagesaresimilarifpointedtobysimilarpages” [Jeh–Widom KDD 2002]: • Same trick: pathpairsummation (can be sampled [Fogaras–Rácz WWW 2005]) over u = w0,w1, . . . ,wk−1,wk = v2 u = w’0 ,w’1 , . . . ,w’k−1,w’k = v1 • DB applicatione.g: Yin, Han, Yu. LinkClus: efficientclusteringviaheterogeneoussemanticlinks, VLDB '06
Communicationcomplexitybounding • Bit-vector probing (BVP) • Theorem: B ≥ mfor any protocol • Reduction from BVPtoExact-PPR-compare Alice has a bit vector Input: x = (x1, x2, …, xm) Bob has a number Input: 1 ≤ k ≤ m Xk= ? Communication B bits Alice has x = (x1, x2, …, xm) G graph with V vertices, where V2 = m Pre-compute an Exact PPR dataof size D Bob has 1 ≤ k ≤ m u, v, w vertices PPR(u,v) ? PPR(u,w) Xk= ? Communication Exact PPR, D bits Thus D = B ≥ m= V2
Theory of Streamingalgorithms • Distinctvalues példa – Motwanislides • Szekvenciális, RAM algoritmusok • Külső táras algoritmusok • Mintavételezés negatív eredmény • „Sketching” technika
Overview • Part I: Background • Examples • Scientific and Business Relevance • Part II: FoundationsIllustrated • Data and computationintensetasks, architectures • History of Algorithms • Web processing, PageRankwithalgorithms • Streaming algoritmusok • Entityresolution – detailedwithalgorithms • Summary, conclusions
DistributedComputingParadigms and Tools • DistributedKey-ValueStores: • distributedB-tree index forallattributes • Project Voldemort • MapReduce: • map → reduceoperations • ApacheHadoop • BulkSynchronous Parallel: • supersteps: computation → communication → barriersync • ApacheHama