1 / 12

Berkley Data Analysis Stack

Berkley Data Analysis Stack. Shark, Bagel . Previous Presentation Summary. Mesos , Spark , Spark Streaming. New apps: AMP-Genomics, Carat, … . Application. Application. in-memory processing trade between time , quality , and cost. Data Processing. Data Processing. Storage.

charla
Télécharger la présentation

Berkley Data Analysis Stack

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Berkley Data Analysis Stack Shark, Bagel

  2. Previous Presentation Summary • Mesos, Spark, Spark Streaming New apps: AMP-Genomics, Carat, … Application Application • in-memory processing • trade between time, quality, and cost Data Processing Data Processing Storage Data Management Efficient data sharing across frameworks Infrastructure Resource Management Share infrastructure across frameworks (multi-programming for datacenters)

  3. Previous Presentation Summary • Mesos, Spark, Spark Streaming

  4. Spark Example: Log Mining • Load error messages from a log into memory, then interactively search for various patterns Cache 1 Base RDD Transformed RDD • lines = spark.textFile(“hdfs://...”) • errors = lines.filter(_.startsWith(“ERROR”)) • messages = errors.map(_.split(‘\t’)(2)) • cachedMsgs = messages.cache() Worker • results • tasks Block 1 Driver Cached RDD Parallel operation • cachedMsgs.filter(_.contains(“foo”)).count Cache 2 • cachedMsgs.filter(_.contains(“bar”)).count Worker • . . . Cache 3 Block 2 Worker Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data) Block 3

  5. Logistic Regression Performance val data = spark.textFile(...).map(readPoint).cache() var w = Vector.random(D) for (i <- 1 to ITERATIONS) { … } println("Final w: " + w) • 127 s / iteration • first iteration 174 s • further iterations 6 s

  6. HIVE: Components Map Reduce HDFS Mgmt. Web UI Hive CLI Browsing Queries DDL Thrift API Parser Execution Planner Hive QL SerDe Thrift Jute JSON.. MetaStore

  7. Data Model

  8. Hive/Shark flowchart (Insert into table) Two ways to do this. Load from “external table”. Query the external table for each “bucket” and write that bucket to HDFS. Load “Buckets” directly. The user is responsible for creating buckets. CREATE TABLE page_view(viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'IP Address of the User') COMMENT 'This is the page view table' PARTITIONED BY(dt STRING, country STRING) STORED AS SEQUENCEFILE; Creates the table directory.

  9. Hive/Shark flowchart (Insert into table) Two ways to do this. Load from “external table”. Query the external table for each “bucket” and write that bucket to HDFS. Step 1 CREATE EXTERNAL TABLEpage_view_stg(viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'IP Address of the User', country STRING COMMENT 'country of origination') COMMENT 'This is the staging page view table' ROW FORMAT DELIMITED FIELDS TERMINATED BY '44' LINES TERMINATED BY '12' STORED AS TEXTFILE LOCATION '/user/data/staging/page_view'; hadoopdfs -put /tmp/pv_2008-06-08.txt /user/data/staging/page_view Step 2

  10. Hive/Shark flowchart (Insert into table) Two ways to do this. Load from “external table”. Query the external table for each “bucket” and write that bucket to HDFS. Step 3 FROM page_view_stgpvsINSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='US') SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip WHERE pvs.country = 'US';

  11. Hive Hive Operator Hive Operator Reducer Mapper ObjectInspector Hierarchical Object Hierarchical Object Hierarchical Object Hierarchical Object Hierarchical Object Standard Object Use ArrayList for struct and array Use HashMap for map LazyObject Lazily-deserialized Java Object Object of a Java Class SerDe Text(‘1.0 3 54’) // UTF8 encoded Writable Writable Writable Writable Writable Writable Writable Writable BytesWritable(\x3F\x64\x72\x00) FileFormat / Hadoop Serialization Map Output File File on HDFS File on HDFS Stream Stream thrift_record<…> thrift_record<…> thrift_record<…> thrift_record<…> 1.0 3 54 0.2 1 33 2.2 8 212 0.7 2 22 User Script User defined SerDes per ROW

  12. SerDe, ObjectInspector and TypeInfo “av” int int String Object string string struct getMapValue HashMap<String, String> a, Hierarchical Object map int list string HashMap(“a”  “av”, “b”  “bv”), class HO { HashMap<String, String> a, Integer b, List<ClassC> c, String d; } Class ClassC { Integer a, Integer b; } List ( HashMap(“a”  “av”, “b”  “bv”), 23, List(List(1,null),List(2,4),List(5,null)), “abcd” ) Hierarchical Object getStructField Struct TypeInfo getMapValueOI getFieldOI ObjectInspector2 ObjectInspector1 ObjectInspector3 getType getType getType Writable Writable Text(‘a=av:b=bv 23 1:2=4:5 abcd’) BytesWritable(\x3F\x64\x72\x00) deserialize serialize SerDe getOI

More Related