Berkley Data Analysis Stack

Berkley Data Analysis Stack Shark, Bagel

Previous Presentation Summary • Mesos, Spark, Spark Streaming New apps: AMP-Genomics, Carat, … Application Application • in-memory processing • trade between time, quality, and cost Data Processing Data Processing Storage Data Management Efficient data sharing across frameworks Infrastructure Resource Management Share infrastructure across frameworks (multi-programming for datacenters)

Previous Presentation Summary • Mesos, Spark, Spark Streaming

Spark Example: Log Mining • Load error messages from a log into memory, then interactively search for various patterns Cache 1 Base RDD Transformed RDD • lines = spark.textFile(“hdfs://...”) • errors = lines.filter(_.startsWith(“ERROR”)) • messages = errors.map(_.split(‘\t’)(2)) • cachedMsgs = messages.cache() Worker • results • tasks Block 1 Driver Cached RDD Parallel operation • cachedMsgs.filter(_.contains(“foo”)).count Cache 2 • cachedMsgs.filter(_.contains(“bar”)).count Worker • . . . Cache 3 Block 2 Worker Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data) Block 3

Logistic Regression Performance val data = spark.textFile(...).map(readPoint).cache() var w = Vector.random(D) for (i <- 1 to ITERATIONS) { … } println("Final w: " + w) • 127 s / iteration • first iteration 174 s • further iterations 6 s

HIVE: Components Map Reduce HDFS Mgmt. Web UI Hive CLI Browsing Queries DDL Thrift API Parser Execution Planner Hive QL SerDe Thrift Jute JSON.. MetaStore

Data Model

Hive/Shark flowchart (Insert into table) Two ways to do this. Load from “external table”. Query the external table for each “bucket” and write that bucket to HDFS. Load “Buckets” directly. The user is responsible for creating buckets. CREATE TABLE page_view(viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'IP Address of the User') COMMENT 'This is the page view table' PARTITIONED BY(dt STRING, country STRING) STORED AS SEQUENCEFILE; Creates the table directory.

Hive/Shark flowchart (Insert into table) Two ways to do this. Load from “external table”. Query the external table for each “bucket” and write that bucket to HDFS. Step 1 CREATE EXTERNAL TABLEpage_view_stg(viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'IP Address of the User', country STRING COMMENT 'country of origination') COMMENT 'This is the staging page view table' ROW FORMAT DELIMITED FIELDS TERMINATED BY '44' LINES TERMINATED BY '12' STORED AS TEXTFILE LOCATION '/user/data/staging/page_view'; hadoopdfs -put /tmp/pv_2008-06-08.txt /user/data/staging/page_view Step 2

Hive/Shark flowchart (Insert into table) Two ways to do this. Load from “external table”. Query the external table for each “bucket” and write that bucket to HDFS. Step 3 FROM page_view_stgpvsINSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='US') SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip WHERE pvs.country = 'US';

Hive Hive Operator Hive Operator Reducer Mapper ObjectInspector Hierarchical Object Hierarchical Object Hierarchical Object Hierarchical Object Hierarchical Object Standard Object Use ArrayList for struct and array Use HashMap for map LazyObject Lazily-deserialized Java Object Object of a Java Class SerDe Text(‘1.0 3 54’) // UTF8 encoded Writable Writable Writable Writable Writable Writable Writable Writable BytesWritable(\x3F\x64\x72\x00) FileFormat / Hadoop Serialization Map Output File File on HDFS File on HDFS Stream Stream thrift_record<…> thrift_record<…> thrift_record<…> thrift_record<…> 1.0 3 54 0.2 1 33 2.2 8 212 0.7 2 22 User Script User defined SerDes per ROW

SerDe, ObjectInspector and TypeInfo “av” int int String Object string string struct getMapValue HashMap<String, String> a, Hierarchical Object map int list string HashMap(“a”  “av”, “b”  “bv”), class HO { HashMap<String, String> a, Integer b, List<ClassC> c, String d; } Class ClassC { Integer a, Integer b; } List ( HashMap(“a”  “av”, “b”  “bv”), 23, List(List(1,null),List(2,4),List(5,null)), “abcd” ) Hierarchical Object getStructField Struct TypeInfo getMapValueOI getFieldOI ObjectInspector2 ObjectInspector1 ObjectInspector3 getType getType getType Writable Writable Text(‘a=av:b=bv 23 1:2=4:5 abcd’) BytesWritable(\x3F\x64\x72\x00) deserialize serialize SerDe getOI

Berkley Data Analysis Stack

Berkley Data Analysis Stack

Presentation Transcript

Abstract Data Types Stack, Queue Amortized analysis

An Analysis of Data Corruption in the Storage Stack

Berkley Data Analysis Stack (BDAS)

Stack application: postponing data usage

Microsoft Data Stack smackdown !

The Stack Data Structure

Nancy Berkley

Abstract Data Type STACK

Data Structures chap5 Stack

Linear Data Structures (Stack)

October 1, 2010 UC Berkley

A Data Stack CoreGen

Data Stack Instructions

Abstract Data Types Stack, Queue Amortized analysis

Abstract Data Types Stack, Queue Amortized analysis

The Stack Data Structure

Data Stack

Data Structure - Stack

DATA STRUCTURE “ STACK”

Berkley Taxi Cab