Hadoop and its Real-world Applications

Hadoop and its Real-world Applications Xiaoxiao Shi, Guan Wang Experience: work at Yahoo! in 2010 summer, on developing hadoop-based machine learning models.

Contents • Motivation of Hadoop • History of Hadoop • The current applications of Hadoop • Programming examples • Research with Hadoop • Conclusions

Motivation of Hadoop • How do you scale up applications? • Run jobs processing 100’s of terabytes of data • Takes 11 days to read on 1 computer • Need lots of cheap computers • Fixes speed problem (15 minutes on 1000 computers), but… • Reliability problems • In large clusters, computers fail every day • Cluster size is not fixed • Need common infrastructure • Must be efficient and reliable

Motivation of Hadoop • Open Source Apache Project • Hadoop Core includes: • Distributed File System - distributes data • Map/Reduce - distributes application • Written in Java • Runs on • Linux, Mac OS/X, Windows, and Solaris • Commodity hardware

Fun Fact of Hadoop "The name my kid gave a stuffed yellow elephant. Short, relatively easy to spell and pronounce, meaningless, and not used elsewhere: those are my naming criteria. Kids are good at generating such. Googol is a kid’s term." ---- Doug Cutting, Hadoop project creator

History of Hadoop “It is an important technique!” Reads paper “Map-reduce” 2004 Joins Yahoo! at 2006 Doug Cutting Extended Apache Nutch The great journey begins…

History of Hadoop • Yahoo! became the primary contributor in 2006

History of Hadoop • Yahoo! deployed large scale science clusters in 2007. • Tons of Yahoo! Research papers emerge: • WWW • CIKM • SIGIR • VLDB • …… • Yahoo! began running major production jobs in Q1 2008. • Nowadays…

Nowadays… • When you visit yahoo, you are interacting with data processed with Hadoop!

Nowadays… • When you visit yahoo, you are interacting with data processed with Hadoop! Content Optimization Search Index Ads Optimization Content Feed Processing

Nowadays… • When you visit yahoo, you are interacting with data processed with Hadoop! Content Optimization Search Index Machine Learning (e.g. Spam filters) Ads Optimization Content Feed Processing

Nowadays… • Yahoo! has ~20,000 machines running Hadoop • The largest clusters are currently 2000 nodes • Several petabytes of user data (compressed, unreplicated) • Yahoo! runs hundreds of thousands of jobs every month

Nowadays… • Who use Hadoop? • Amazon/A9 • AOL • Facebook • Fox interactive media • Google • IBM • New York Times • PowerSet (now Microsoft) • Quantcast • Rackspace/Mailtrust • Veoh • Yahoo! • More at http://wiki.apache.org/hadoop/PoweredBy

Nowadays (job market on Nov 15th)… • Software Developer Intern - IBM - Somers, NY +3 locations- Agile development - Big data / Hadoop / data analytics a plus • Software Developer - IBM - San Jose, CA +4 locations- include Hadoop-powered distributed parallel data processing system, big data analytics ... multiple technologies, including Hadoop

It is important • Details…

Nowadays… • Hadoop Core • Distributed File System • MapReduce Framework • Pig (initiated by Yahoo!) • Parallel Programming Language and Runtime • Hbase (initiated by Powerset) • Table storage for semi-structured data • Zookeeper (initiated by Yahoo!) • Coordinating distributed systems • Hive (initiated by Facebook) • SQL-like query language and metastore

HDFS Hadoop's Distributed File System is designed to reliably store very large files across machines in a large cluster. It is inspired by the Google File System. Hadoop DFS stores each file as a sequence of blocks, all blocks in a file except the last block are the same size. Blocks belonging to a file are replicated for fault tolerance. The block size and replication factor are configurable per file. Files in HDFS are "write once" and have strictly one writer at any time. Hadoop Distributed File System – Goals: • Store large data sets • Cope with hardware failure • Emphasize streaming data access

Typical Hadoop Structure • Commodity hardware • Linux PCs with local 4 disks • Typically in 2 level architecture • 40 nodes/rack • Uplink from rack is 8 gigabit • Rack-internal is 1 gigabit all-to-all

Hadoop structure • Single namespace for entire cluster • Managed by a single namenode. • Files are single-writer and append-only. • Optimized for streaming reads of large files. • Files are broken in to large blocks. • Typically 128 MB • Replicated to several datanodes, for reliability • Client talks to both namenode and datanodes • Data is not sent through the namenode. • Throughput of file system scales nearly linearly with the number of nodes. • Access from Java, C, or command line.

Hadoop Structure • Java and C++ APIs • In Java use Objects, while in C++ bytes • Each task can process data sets larger than RAM • Automatic re-execution on failure • In a large cluster, some nodes are always slow or flaky • Framework re-executes failed tasks • Locality optimizations • Map-Reduce queries HDFS for locations of input data • Map tasks are scheduled close to the inputs when possible

Example of Hadoop Programming • Word Count: • “I ike parallel computing. I also took courses on parallel computing… …” • Parallel: 2 • Computing: 2 • I: 2 • Like: 1 • ……

Example of Hadoop Programming • Intuition: design <key, value> • Assume each node will process a paragraph… • Map: • What is the key? • What is the value? • Reduce: • What to collect? • What to reduce?

Word Count Example publicclassMapClassextendsMapReduceBase implementsMapper<LongWritable, Text, Text, IntWritable> { privatefinalstaticIntWritableONE = newIntWritable(1); publicvoid map(LongWritable key, Text value, OutputCollector<Text, IntWritable> out, Reporter reporter) throwsIOException { String line = value.toString(); StringTokenizeritr = newStringTokenizer(line); while(itr.hasMoreTokens()) { out.collect(newtext(itr.nextToken()), ONE); } } }

Word Count Example publicclassReduceClassextendsMapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { publicvoidreduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> out, Reporter reporter) throwsIOException { int sum = 0; while(values.hasNext()) { sum += values.next().get(); } out.collect(key, newIntWritable(sum)); } }

Word Count Example publicstaticvoid main(String[] args) throwsException { JobConf conf = newJobConf(WordCount.class); conf.setJobName("wordcount"); conf.setMapperClass(MapClass.class); conf.setCombinerClass(ReduceClass.class); conf.setReducerClass(ReduceClass.class); FileInputFormat.setInputPaths(conf, args[0]); FileOutputFormat.setOutputPath(conf, newPath(args[1])); conf.setOutputKeyClass(Text.class);// out keys are words (strings) conf.setOutputValueClass(IntWritable.class);// values are counts JobClient.runJob(conf); }c

Hadoop in Yahoo! • Database for Search Assist™ is built using Hadoop. • 3 years of log-data • 20-steps of map-reduce

Related research of hadoop All just this year! 2011! • Conference Tutorial: • KDD Tutorial: “Modeling with Hadoop”, KDD 2011 (top conference in data mining) • Strta Tutorial: “How to Develop Big Data Applications for Hadoop” • OSCON Tutorial: “Introduction to Hadoop”, • Papers: • Scalable distributed inference of dynamic user interests for behavioral targeting. KDD 2011: 114-122 • YuchengLow, Deepak Agarwal, Alexander J. Smola: Multiple domain user personalization. KDD 2011: 123-131 • Shuang-Hong Yang, Bo Long, Alexander J. Smola, HongyuanZha, ZhaohuiZheng: Collaborative competitive filtering: learning recommender using context of user choice. SIGIR 2011: 295-304 • SrinivasVadrevu, ChoonHuiTeo, SujuRajan, KunalPunera, Byron Dom, Alexander J. Smola, Yi Chang, ZhaohuiZheng: Scalable clustering of news search results. WSDM 2011: 675-684 • Shuang-Hong Yang, Bo Long, Alexander J. Smola, Narayanan Sadagopan, ZhaohuiZheng, HongyuanZha: Like like alike: joint friendship and interest propagation in social networks. WWW 2011: 537-546 • Amr Ahmed, Alexander J. Smola: WWW 2011 invited tutorial overview: latent variable models on the internet. WWW (Companion Volume) 2011: 281-282 • Daniel Hsu, Nikos Karampatziakis, John Langford, Alexander J. Smola: Parallel Online Learning CoRR abs/1103.4204: (2011) • Neethu Mohandas, Sabu M. Thampi: Improving Hadoop Performance in Handling Small Files. ACC 2011:187-194 • Tomasz WiktorWlodarczyk, Yi Han, ChunmingRong: Performance Analysis of Hadoop for Query Processing. AINA Workshops 2011:507-513 • ……

For more information: • http://hadoop.apache.org/ • http://developer.yahoo.com/hadoop/ • Who uses Hadoop?: • http://wiki.apache.org/hadoop/PoweredBy

Hadoop and its Real-world Applications

Hadoop and its Real-world Applications

Presentation Transcript

Digital Media Technology : Real World Applications

XACML in real-world applications

Architecting real-world Azure applications

Digital Media Technology : Real World Applications

JVSTM and its applications

Aptamers in the Real World Development and Applications

Composing Real World SOA Applications

Compound Interest and Other Real World Applications

Geometry and Its Real World Applications

Competitive Intelligence: Practical Applications and Real-World Tactics

Replication and Its Applications

Real-World Applications

Some real-world applications

Grid Computing for Real World Applications

Quantum effects and real world applications

… and its relationship to the the real world?

Real World Applications of RFID

Elasticity and its Applications

Machine Learning and Real-World Applications

WHAT IS HADOOP AND ITS COMPONENTS

Digital Media Technology : Real World Applications

… and its relationship to the the real world?