1 / 47

Big data analytics

Large-scale information extraction and integration infrastructure for supporting financial decision making (FP7-ICT- 257928 ) http://project-first.eu. Big data analytics. Miha Gr čar 1,2 1 Jožef Stefan Institute 2 Sowa Labs GmbH. Outline.

hisoki
Télécharger la présentation

Big data analytics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Large-scale information extraction and integration infrastructure for supporting financial decision making (FP7-ICT-257928) http://project-first.eu Big data analytics Miha Grčar1,2 1Jožef Stefan Institute 2Sowa LabsGmbH

  2. Outline • What is big data? What caused it? Who should care? • Solving big data problems • Examples Miha Grčar

  3. What is big data? • “How many terabytes?” • We deliberately avoid being specific • Big data refers to datasets that cannot be captured, stored, managed, and/or analyzed by the mainstream storage and processing devices Miha Grčar

  4. What is big data? Miha Grčar

  5. What caused big data?Storage capacity and processing power Source: Hilbert and López, “The world’s technological capacity to store, communicate, and compute information,” Science, 2011 Miha Grčar

  6. What caused big data?Data availability (industry) Source: IDC; US Bureau of Labor Statistics; McKinsey Global Institute analysis Miha Grčar

  7. What caused big data?Data availability (social media and mobile devices) Source: www.creotivo.com

  8. What caused big data?Data availability (sensors) Source: Analyst interviews; McKinsey Global Institute analysis Miha Grčar

  9. What caused big data?Maturity of technologies & tools Emerging, hyped Mature Source: Gartner (July, 2012) Miha Grčar

  10. Who should care about big data? Source: US Bureau of Labor Statistics; McKinsey Global Institute analysis Miha Grčar

  11. Solving big data problems • Distributed infrastructure • Cloud Amazon Elastic Compute Cloud (EC3) • Distributed processing • MapReduce / batches Hadoop • Distributed workflows / streams Twitter Storm • Distributed storage • Distributed FS/DB • NoSQL 1+1= 1+1= 1+1= Miha Grčar

  12. Solving big data problems • Distributed infrastructure • Cloud Amazon Elastic Compute Cloud (EC2) • Distributed processing • MapReduce / batches Hadoop • Distributed workflows / streams Twitter Storm • Distributed storage • Distributed FS/DB • NoSQL Hadoop, MS DryadLINQ, Disco, Misco, Phoenix, Cloud MapReduce, bashreduce, Qizmt… Amazon EC2, Windows Azure, Google Cloud Platform, Cloudwatt… Storm (Twitter), S4 (Yahoo), “Real-time Hadoops”: Impala, HFlame, Spark… Google File System, HDFS, Google Big Table, HBase,Cassandra, MongoDB, CouchDB, Hive… Miha Grčar

  13. Amazon EC2EC2 = ECC = Elastic Compute Cloud • Central part of Amazon.com’s cloud computing service • ~500,000 physical Linux machines • Elastic: possibility to start / stop servers with respect to demand; pay only for running servers • Instances (several examples) • Micro, 1 ECU, 1 Core, 613 MiB • High-Memory XL, 6.5 ECUs, 2 Cores, 17.1 GiB • High-CPU XL, 20 ECUs, 8 Cores, 7 GiB • OS • Windows • Linux • FreeBSD • Storage • Temporary instance-storage • Persistent Elastic Block Storage (EBS) Miha Grčar

  14. MapReduce (Hadoop) Miha Grčar

  15. A bunch of ballots, all mixed up… Map Still mixed up… A B C A B C A B C Reduce Election results: A: 321,015 B: 179,539 C: 201,734

  16. MapReduce (Hadoop) 195005150700+0000 195005151200+0022 195005151800-0011 194903241200+0111 194903241800+0078 1950 0 1950 22 1950 -11 1949 111 1949 78 1950 [ 0, 22, -11 ] 1949 [ 111, 78 ] 1950 [ 22 ] 1949 [ 111 ] merge map reduce sort copy data output Source: Tom White: Hadoop, The Definitive Guide, 3rd Ed., 2012 (O’Reilly & Yahoo! Press) Miha Grčar

  17. MapReduce (Hadoop) Source: Tom White: Hadoop, The Definitive Guide, 3rd Ed., 2012 (O’Reilly & Yahoo! Press) Miha Grčar

  18. Twitter Storm Collate & bind Produce report Sign Send Print Spout Bolt Bolt Bolt Bolt Data source Data sink Data processors Miha Grčar

  19. Twitter StormBasic principle 195005150700+0000 195005151200+0022 195005151800-0011 194903241200+0111 194903241800+0078 Received: 111 Current max: 22 New max: 111 Overwrite 22 with 111 Spout Bolt Bolt 194903241200+0111 111 Data source Data processor Data sink/writer Miha Grčar

  20. Twitter StormTopology Miha Grčar

  21. Twitter StormPipelining and parallelization Stream Parallelization Pipelining Miha Grčar

  22. Examples • Twitter sentiment and volume • Elections • Stock trading • News cohesiveness, volume, and sentiment • Correlation with VIX, CDS • Correlation with big events • Vocabulary in news & blogs • Pump & dump use case Miha Grčar

  23. Slovene elections • 3 candidates, 3 live debates • Sentiment analysis provider: Gama System & our team at JSI • Streamed live, in real time, in prime time during the debates on POP TV • During and after the debates (3 broadcasts), the sentiment chart was shown 5 times (with commentary) Miha Grčar

  24. First live debate Second live debate Third live debate Elections (first round)

  25. Candidates joined by their wives Candidates justifying their wealth Criticizing a questionable pardoning of a criminal Criticizing the gov Supporting the gov Justifying it

  26. “Democratic.” Zver:--“What kind of a political party leader were you if they(party members) didn’t follow your lead?” Pahor:--“Democratic.” Miha Grčar

  27. Polls vs. sentiment vs. outcome • Actual outcome • November 11, 2012 • BorutPahor40% (+4%) • Danilo Türk 36% • Milan Zver24% DeloStik (Delo, 9.11.) 44 / 31 / 25 Mediana (Slovenske novice, 9.11.) 41.67 / 34.72 / 23.61 Ninamedia (Mladina, 9.11.) 43.8 / 33.6 / 22.6 Twitter sentiment “BorutPahor will win” Miha Grčar

  28. Twitter volume andelection results • There’s no such thing as bad publicity. • “We believe that Twitter and other social media reflect the underlying trend in a political race that goes beyond a district’s fundamental geographic and demographic composition. If people must talk about you, even in negative ways, it is a signal that a candidate is on the verge of victory. The attention given to winners creates a situation in which all publicity is good publicity.” • (DiGrazia, McKelvey, Bollen, Rojas: More tweets, more votes: Social media as a quantitative indicator of political behavior, February 2013) Source: Smailović, Kranjc, Juršič, Grčar, Gačnik, Mozetič: MonitoringtheTwitter sentiment duringtheBulgarianelections (2013; to appear) Miha Grčar

  29. We’re looking at the stock of Amazon.com… The blue line shows the stock price. …during 2012. The black line is the 7-day moving average. The green-red line shows whether we profited (green) or not (red) from blindly following the social signals. The red line shows the related Twitter sentiment. A MA zerocross-over serves as a buy or sell signal. Source: Sowa Labs GmbH Miha Grčar

  30. On April 26, 2012 Amazon announced financial results for its first quarter ended March 31, 2012. Amazon has been spending lots of money on expanding its operations, so analysts expected a huge drop in profit for this first quarter. However, Amazon blows analysts’ estimates away. Even though earnings did fall, they didn't decline nearly as much as analysts had feared. • Amazon earned $130 million or 28 cents per share for the quarter that ended March 31. That was a 35% decline from a year ago, but it was much better than the 7 cents per share forecasts from analysts polled by Thomson Reuters. • Based on this news, Amazon shares surged nearly 16% on Friday morning April 27, 2012. Q3 results Q2 results Q1 results Q4/’11 results Source: Sowa Labs GmbH Miha Grčar

  31. The sentiment MA cross-over happens well before the price jump. Source: Sowa Labs GmbH Miha Grčar

  32. We’re looking at the stock of Google… Q3 results …during 2012. Q1 results Q4/’11 results Q2 results • On October 18, 2012, Google’s shares plunged by 9% after the search giant’s third-quarter earnings came in considerably lower than expected. • The results were accidentally released several hours earlier than expected, leading to a halt in the shares’ trading for a time. Source: Sowa Labs GmbH Miha Grčar

  33. The sentiment MA cross-over happens well before the price plunge. Source: Sowa Labs GmbH Miha Grčar

  34. Source: Sowa Labs GmbH Miha Grčar

  35. Sentiment in news:Spain, Greece, Italy, Germany Miha Grčar

  36. News cohesiveness and VIX VIX – implied volatility of S&P500 (aka fear index) Source: RudjerBoskovicInstitute, Boston University, Jozef Stefan Institute Miha Grčar

  37. News cohesiveness and CDS CDS – Credit Default Swaps (insurance against default) Source: RudjerBoskovicInstitute, Boston University, Jozef Stefan Institute Miha Grčar

  38. Pump & dump Source: b-next, Goethe Universität, JSI (FIRST) Miha Grčar

  39. Pump & dump Country Black List Industry Black List Black List Company Black List Company Age History Bankrupt Comp_FinInst Pump & Dump Market Segment Market Market Capitalization Financial Instrument Trading Volume Trading Number of Trades Sentiment News Content Source: b-next, Goethe Universität, JSI (FIRST) Miha Grčar

  40. Quick recap (1/3) • Big data: volume, velocity, variety • Enablers • Storage capacity & processing power • Maturity of technologies • Availability of data, e.g., social networks and mobile devices • Mindset • Financial domain: one of the biggest gainers Miha Grčar

  41. Quick recap (2/3) Solving big data problems • Distributed infrastructure • Amazon EC2 • Distributed processing capacity • MapReduce (Hadoop) • Twitter Storm • Distributed storage Miha Grčar

  42. Quick recap (3/3) Examples • Elections • No such thing as bad publicity • Stock trading • Sentiment vs. price, Twitter volume vs. trading volume • News & blogs • Volume & sentiment expose big events • Cohesiveness vs. VIX & CDS • Content and sentiment as inputs into a pump & dump detection model Miha Grčar

  43. Large-scale information extraction and integration infrastructure for supporting financial decision making (FP7-ICT-257928) http://project-first.eu http://www.sowalabs.de (coming really soon!) Miha Grčar

More Related