1 / 64

Bluemix + Next-generation Analytics

Learn how to set up a development environment and create Spark applications using IBM Bluemix and Spark. Discover the advantages and disadvantages of local development, and explore the features and benefits of Spark.

deschamp
Télécharger la présentation

Bluemix + Next-generation Analytics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Bluemix + Next-generation Analytics

  2. Agenda • Introductions • Round table • Introduction to Spark • Set up development environment and create the hello world application • Notebook Walk-through • Break • Use case discussion • Introduction to Spark Streaming • Build an application with Spark Streaming: Sentiment analysis with Twitter and Watson Tone Analyzer

  3. Introductions • David Taieb • David_taieb@us.ibm.com • Developer Advocate • IBM Cloud Data Services • Chetna Warade • warade@us.ibm.com • Developer Advocate • IBM Cloud Data Services • https://developer.ibm.com/clouddataservices/connect/

  4. Introductions • Our mission: • We are here to help developers realize their most ambitious projects. • Goals for today’s session: • Setup a local development environment via Scala Eclipse IDE. • Write a hello world Scala project to run Spark. Build a custom library. • Run locally on Spark. • Deploy on Jupyter notebook and Apache Spark on Bluemix.

  5. What is our motivation? • Local or cloud development and deployment • Advantages of local development • Rapid development • Productivity • Excellent for proof of concept • Disadvantages of local development • Time consuming for reproducing on a larger scale • Difficult for sharing quickly • Intense on hardware resource

  6. What is spark Spark is an open source in-memory computing framework for distributed data processing and iterative analysis on massive data volumes

  7. Spark Core Libraries executes SQL statements performs streaming analytics using micro-batches common machine learning and statistical algorithms distributed graph processing framework Spark SQL Spark Streaming GraphX (graph) Mllib (machine learning) Spark Core general compute engine, handles distributed task dispatching, scheduling and basic I/O functions

  8. Key reasons for interest in Spark • Largest project and one of the most active on Apache • Vibrant growing community of developers continuously improve code base and extend capabilities • Fast adoption in the enterprise (IBM, Databricks, etc…) Open Source Fast • In-memory storage greatly reduces disk I/O • Up to 100x faster in memory, 10x faster on disk • Fault tolerant, seamlessly recompute lost data from hardware failure • Scalable: easily increase number of worker nodes • Flexible job execution: Batch, Streaming, Interactive distributed data processing • Unified programming model across a range of use cases • Rich and expressive apis hide complexities of parallel computing and worker node management • Support for Java, Scala, Python and R: less code written • Include a set of core libraries that enable various analytic methods: Saprk SQL, Mllib, GraphX Productive • Easily handle Petabytes of data without special code handling • Compatible with existing Hadoopecosystem Web Scale

  9. Ecosystem of the IBM Analytics for Apache Spark as service

  10. Notebook walkthrough • https://developer.ibm.com/clouddataservices/start-developing-with-spark-and-notebooks/ • Sign up on Bluemix https://console.ng.bluemix.net/registration/ • Create an Apache Starter boilerplate application • Create notebooks either in python or scala or both • Run basic commands and get familiar with notebooks

  11. Setup local development Environment • http://velocityconf.com/devops-web-performance-ny-2015/public/schedule/detail/45890 • Pre-requisites • Scala runtime 2.10.4 http://www.scala-lang.org/download/2.10.4.html • Homebrew http://brew.sh/ • Scala sbt http://www.scala-sbt.org/download.html • Spark 1.3.1 http://www.apache.org/dyn/closer.lua/spark/spark-1.3.1/spark-1.3.1.tgz

  12. Setup local development Environment contd.. • Create scala project using sbt • Create directories to start from scratch mkdir helloSpark && cd helloSpark mkdir -p src/main/scala mkdir -p src/main/java mkdir -p src/main/resources Create a subdirectory under src/main/scala directory mkdir -p com/ibm/cds/spark/sample • Github URL for the same project https://github.com/ibm-cds-labs/spark.samples

  13. Setup local development Environment contd.. • Create HelloSpark.scala using an IDE or a text editor • Copy paste this code snippet package com.ibm.cds.spark.samples import org.apache.spark._ object HelloSpark {     //main method invoked when running as a standalone Spark Application     def main(args: Array[String]) {         val conf = new SparkConf().setAppName("Hello Spark")         val spark = new SparkContext(conf)         println("Hello Spark Demo. Compute the mean and variance of a collection")         val stats = computeStatsForCollection(spark);         println(">>> Results: ")         println(">>>>>>>Mean: " + stats._1 );         println(">>>>>>>Variance: " + stats._2);         spark.stop()     }     //Library method that can be invoked from Jupyter Notebook     def computeStatsForCollection( spark: SparkContext, countPerPartitions: Int = 100000, partitions: Int=5): (Double, Double) = {            val totalNumber = math.min( countPerPartitions * partitions, Long.MaxValue).toInt;         val rdd = spark.parallelize( 1 until totalNumber,partitions);         (rdd.mean(), rdd.variance())     } }

  14. Setup local development Environment contd.. • Create a file build.sbt under the project root directory: • Under the project root directory run Check for helloSpark 2.10-10.jar under the project root directory • name := "helloSpark" • version := "1.0" • scalaVersion := "2.10.4" • libraryDependencies ++= { •     val sparkVersion =  "1.3.1" •     Seq( •         "org.apache.spark" %% "spark-core" % sparkVersion, •         "org.apache.spark" %% "spark-sql" % sparkVersion, •         "org.apache.spark" %% "spark-repl" % sparkVersion •     ) • } • Download all dependencies • $sbt update • Compile • $sbt compile • Package an application jar file • $sbt package

  15. Hello World application on Bluemix Apache Starter

  16. Break Join us in 15 minutes

  17. Use-cases Network Performance Optimization Churn Reduction Cyber Security -Predict customer drop-offs/drop-outs -Network intrusion detection -Fraud Detection -… -Diagnose real-time device issues -… IT –Any Industry IT –Any Industry Telco, Cable, Schools Predictive Maintenance (IoT) Customer Behavior Analytics Update.. -Refine strategy based on customer behaviour data -… Retail & Merchandising -Predict system failure before it happens

  18. Use-cases • SETI use-case for astronomers, data scientist, mathematician and algorithm design.

  19. IBM Spark @ SETI - Application Architecture Collaborative environment for project team data scientists (NASA, SETI Institute, Penn State, IBM Research) Actively analyzing over 4TB of signal data. Results have already been used by SETI to re-program the radio telescope observation sequence to include “new targets of interest” • Spark@SETI GitHub repository • Python code modules for data access and analytics • Jupyter notebooks • Documentation and links to other relevant github repos • Standard GitHub Collaboration functions Import of signal data from SETI radio telescope data archives ~ 10 years • Shared repository of SETI data in Object Store • 200M rows of signal event data • 15M binary recordings of “signals of interest”

  20. Spark Streaming • “Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams”(http://spark.apache.org/docs/latest/streaming-programming-guide.html) • Breakdown the Streaming data into smaller pieces which are then sent to the Spark Engine

  21. Spark Streaming • Provides connectors for multiple data sources: • Kafka • Flume • Twitter • MQTT • ZeroMQ • Provides API to create custom connectors. Lots of examples available on Github and spark-packages.org

  22. Building a Spark Streaming application Sentiment analysis with Twitter and Watson Tone Analyzer • Section 1: Setup the dev environment • Create a new scala project • Configure the sbt dependencies • Create the Application boilerplate code • Run the code • Using an Eclipse launch configuration • Using spark-submit command line

  23. A Word about the Scala Programming language • Scala is Object oriented but also support functional programming style • Bi-directional interoperability with Java • Resources: • Official web site: http://scala-lang.org • Excellent first steps site: http://www.artima.com/scalazine/articles/steps.html • Free e-books: http://readwrite.com/2011/04/30/5-free-b-books-and-tutorials-o

  24. Section 1.1: Create a new scala project • Refer to “Set up development environment” section earlier in this presentation • Resource: https://developer.ibm.com/clouddataservices/start-developing-with-spark-and-notebooks/

  25. Section 1.2: Configure the sbt dependencies Spark depencencies resolved with sbt update Extra dependencies needed by this app Extra dependencies needed by this app

  26. Section 1.2: Configure the sbt dependencies • Run “sbt update” to resolve all the dependencies and download them into your local apache ivy repository (in <home>/.ivy2/cache) • Optional: If you are using Scala IDE for Eclipse, run “sbt eclipse” to generate the eclipse project and associated classpath that reflects the project dependencies • Run “sbt assembly” to generate a uber jar that contains your code and all the required depencencies as defined in build.sbt

  27. Section 1.3: Create the Application boilerplate code • Boiler plate code that creates a twitter stream

  28. Section 1.4.a: Run the code using a Eclipse launch configuration Tell SparkSubmit which class to run SparkSubmit is the Main class that runs this job

  29. Section 1.4.b: Run the code using spark-submit command line • Package the code as a jar: “sbt assembly” • Generates • Run the job using spark-submit script available in the spark distribution: • $SPARK_HOME/bin/spark-submit --class com.ibm.cds.spark.samples.StreamingTwitter --jars <path>/tutorial-streaming-twitter-watson-assembly-1.0.jar

  30. Building a Spark Streaming application Sentiment analysis with Twitter and Watson Tone Analyzer • Section 2: Configure Twitter and Watson Tone Analyzer • Configure OAuth credentials for Twitter • Create a Watson Tone Analyzer Service on Bluemix

  31. Section 2.1: Configure OAuth credentials for Twitter • You can follow along the steps in https://developer.ibm.com/clouddataservices/sentiment-analysis-of-twitter-hashtags/#twitter

  32. Section 2.2: Create a Watson Tone Analyzer Service on Bluemix • You can follow along the steps in https://developer.ibm.com/clouddataservices/sentiment-analysis-of-twitter-hashtags/#bluemix

  33. Building a Spark Streaming application Sentiment analysis with Twitter and Watson Tone Analyzer • Section 3: Work with Twitter data • Create a Twitter Stream • Enrich the data with sentiment analysis from Watson Tone Analyzer • Aggregate data into RDD with enriched Data model • Create SparkSQL DataFrame and register Table

  34. Section 3.1: Create a Twitter Stream Create a map that stores the credentials for the Twitter and Watson Service //Hold configuration key/value pairs valconfig = Map[String, String]( ("twitter4j.oauth.consumerKey", Option(System.getProperty("twitter4j.oauth.consumerKey")).orNull ), ("twitter4j.oauth.consumerSecret", Option(System.getProperty("twitter4j.oauth.consumerSecret")).orNull ), ("twitter4j.oauth.accessToken", Option(System.getProperty("twitter4j.oauth.accessToken")).orNull ), ("twitter4j.oauth.accessTokenSecret", Option(System.getProperty("twitter4j.oauth.accessTokenSecret")).orNull ), ("tweets.key", Option(System.getProperty("tweets.key")).getOrElse("")), ("watson.tone.url", Option(System.getProperty("watson.tone.url")).orNull ), ("watson.tone.username", Option(System.getProperty("watson.tone.username")).orNull ), ("watson.tone.password", Option(System.getProperty("watson.tone.password")).orNull ) ) config.foreach( (t:(String,String)) => if ( t._1.startsWith( "twitter4j") ) System.setProperty( t._1, t._2 ) ) Twitter4j requires credentials to be store in System properties

  35. Section 3.1: Create a Twitter Stream //Filter the tweets to only keeps the one with english as the language //twitterStream is a discretized stream of twitter4j Status objects vartwitterStream = org.apache.spark.streaming.twitter.TwitterUtils.createStream( ssc, None ) .filter { status => Option(status.getUser).flatMap[String] { u => Option(u.getLang) }.getOrElse("").startsWith("en") //Allow only tweets that use “en” as the language && CharMatcher.ASCII.matchesAllOf(status.getText) //Only pick text that are ASCII && ( keys.isEmpty||keys.exists{status.getText.contains(_)}) //If User specified #hashtags to monitor } Filtered DStream with english tweets Initial DStream of Status Objects

  36. Section 3.2: Enrich the data with sentiment analysis from Watson Tone Analyzer //Broadcast the config to each worker node valbroadcastVar = sc.broadcast(config) valrowTweets = twitterStream.map(status=> { lazyvalclient = PooledHttp1Client() val sentiment = callToneAnalyzer(client, status, broadcastVar.value.get("watson.tone.url”).get, broadcastVar.value.get("watson.tone.username").get, broadcastVar.value.get("watson.tone.password").get ) … } Filtered DStream with english tweets Initial DStream of Status Objects

  37. Section 3.2: Enrich the data with sentiment analysis from Watson Tone Analyzer //Prepare data for create a SparkSQL Row //Add the data from the tweets first varcolValues = Array[Any]( status.getUser.getName, //author status.getCreatedAt.toString, //date status.getUser.getLang, //Lang status.getText, //text Option(status.getGeoLocation).map{ _.getLatitude}.getOrElse(0.0),//lat Option(status.getGeoLocation).map{_.getLongitude}.getOrElse(0.0) //long ) //Append the scores for each emotional tones colValues= colValues++sentimentFactors.map { f => (BigDecimal(scoreMap.get(f._2).getOrElse(0.0)) .setScale(2, BigDecimal.RoundingMode.HALF_UP).toDouble) *100.0 } //Return a tuple composed of a SparkSQL Row that contains the tweet data //+ sentiment score. The second data keeps the original sentiment and status data (Row(colValues.toArray:_*),(sentiment, status)) Data Model |-- author: string (nullable = true) |-- date: string (nullable = true) |-- lang: string (nullable = true) |-- text: string (nullable = true) |-- lat: integer (nullable = true) |-- long: integer (nullable = true) |-- Cheerfulness: double (nullable = true) |-- Negative: double (nullable = true) |-- Anger: double (nullable = true) |-- Analytical: double (nullable = true) |-- Confident: double (nullable = true) |-- Tentative: double (nullable = true) |-- Openness: double (nullable = true) |-- Agreeableness: double (nullable = true) |-- Conscientiousness: double (nullable = true) DStream of key, value pairs Filtered DStream with english tweets Initial DStream of Status Objects

  38. Section 3.3: Aggregate data into RDD with enriched Data model ….. //Aggregate the data from each DStream into the working RDD rowTweets.foreachRDD( rdd => { if ( rdd.count() >0 ){ workingRDD = sc.parallelize( rdd.map( t => t._1 ).collect()).union( workingRDD ) } }) workingRDD Data Model |-- author: string (nullable = true) |-- date: string (nullable = true) |-- lang: string (nullable = true) |-- text: string (nullable = true) |-- lat: integer (nullable = true) |-- long: integer (nullable = true) |-- Cheerfulness: double (nullable = true) |-- Negative: double (nullable = true) |-- Anger: double (nullable = true) |-- Analytical: double (nullable = true) |-- Confident: double (nullable = true) |-- Tentative: double (nullable = true) |-- Openness: double (nullable = true) |-- Agreeableness: double (nullable = true) |-- Conscientiousness: double (nullable = true) Filtered Dstream Initial DStream RowTweets Filtered Dstream Initial DStream RowTweets Microbatches …. Initial DStream RowTweets Filtered Dstream

  39. Section 3.4: Create SparkSQL DataFrame and register Table //Create a SparkSQL DataFrame from the aggregate workingRDD valdf = sqlContext.createDataFrame( workingRDD, schemaTweets ) //Register a temporary table using the name "tweets" df.registerTempTable("tweets") println("A new table named tweets with "+df.count() +" records has been correctly created and can be accessed through the SQLContext variable") println("Here's the schema for tweets") df.printSchema() (sqlContext, df) Relational SparkSQL Table workingRDD

  40. Building a Spark Streaming application: Sentiment analysis with Twitter and Watson Tone Analyzer • Section 4: IPython Notebook analysis • Load the data into an IPython Notebook • Analytic 1: Compute the distribution of tweets by sentiment scores greater than 60% • Analytic 2: Compute the top 10 hashtags contained in the tweets • Analytic 3: Visualize aggregated sentiment scores for the top 5 hashtags

  41. Introduction to Notebooks • Notebooks allow creation of interactive executable documents that include rich text with Markdown, executable code with Scala, Python or R, graphics with matplotlib • Apache Spark provides multiple flavor APIs that can be executed with a REPL shell: Scala, Python (PYSpark), R • Multiple open-source implementations available: • Jupyter: https://jupyter.org • Apache Zeppelin: http://zeppelin-project.org

  42. Section 4.1: Load the data into an IPython Notebook • You can follow along the steps here: https://github.com/ibm-cds-labs/spark.samples/blob/master/streaming-twitter/notebook/Twitter%20%2B%20Watson%20Tone%20Analyzer%20Part%202.ipynb Create a SQLContext from a SparkContext Load from parquet file and create a DataFrame Create a SQL table and start excuting SQL queries

  43. Section 4.2: Analytic 1 - Compute the distribution of tweets by sentiment scores greater than 60% #create an array that will hold the count for each sentiment sentimentDistribution=[0] * 9 #For each sentiment, run a sql query that counts the number of tweets for which the sentiment score is greater than 60% #Store the data in the array for i, sentiment in enumerate(tweets.columns[-9:]): sentimentDistribution[i]=sqlContext.sql("SELECT count(*) as sentCount FROM tweets where " + sentiment + " > 60")\ .collect()[0].sentCount

  44. Section 4.2: Analytic 1 - Compute the distribution of tweets by sentiment scores greater than 60% Use matplotlib to create a bar chart

  45. Section 4.2: Analytic 1 - Compute the distribution of tweets by sentiment scores greater than 60% Bar Chart Visualization

  46. Section 4.3: Analytic 2: Compute the top 10 hashtags contained in the tweets Bag of Words RDD Initial Tweets RDD Filter hashtags Key, value pair RDD Reduced map with counts Sorted Map by key reduceByKey sortByKey flatMap map filter

  47. Section 4.3: Analytic 2: Compute the top 10 hashtags contained in the tweets

  48. Section 4.3: Analytic 2: Compute the top 10 hashtags contained in the tweets

  49. Section 4.4: Analytic 3 - Visualize aggregated sentiment scores for the top 5 hashtags • Problem: • Compute the mean average all the emotion score for all the top 10 hastags • Format the data in a way that can be consumed by the plot script

More Related