1 / 0

Platform for Big Data, NoSQL and Relational Data. What makes sense for me ? (+Azure)

Platform for Big Data, NoSQL and Relational Data. What makes sense for me ? (+Azure). Michael Epprecht Technology Evangelist michael.epprecht@microsoft.com @ fastflame. Agenda. Big Data AllSQL , NoSQL , NewSQL , SomeSQL Windows Azure. Big Data. WHAT IS BIG DATA?. Big Data.

reidar
Télécharger la présentation

Platform for Big Data, NoSQL and Relational Data. What makes sense for me ? (+Azure)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Platform for Big Data, NoSQL and Relational Data. What makes sense for me?(+Azure)

    Michael Epprecht Technology Evangelist michael.epprecht@microsoft.com @fastflame
  2. Agenda Big Data AllSQL, NoSQL, NewSQL, SomeSQL Windows Azure
  3. Big Data
  4. WHAT IS BIG DATA? Big Data Petabytes Click stream Wikis/blogs Sensors/RFID/devices Social sentiment Audio/video Log files Spatial & GPS coordinates Data market feeds eGov feeds Weather Text/image Web 2.0 Advertising Mobile Collaboration eCommerce Terabytes Web Logs Digital Marketing Search Marketing Recommendations ERP/CRM Gigabytes Payables Payroll Inventory Contacts Deal Tracking Sales Pipeline Megabytes Data Complexity: Variety and Velocity
  5. Original Gartner three V’s Feb 2001: http://blogs.gartner.com/doug-laney/files/2012/01/ad949-3D-Data-Management-Controlling-Data-Volume-Velocity-and-Variety.pdf Volume (think data tiering) Size of the data Manageability Velocity (think CEP) Speed at which data is received Latency to deliver data analysis Variety (think ETL, ODS, Email, Social Networks) Differing formats of data Disparate source systems
  6. Big Data to Data Analytics Variety: Dealing with Un/Semi-structured and Structured How do you mix Oranges and Apples? Compare Textual data with Relational Tooling – accessing the “Variety” of different data sources Determining “Value” Big Data = Proxy for doing more with existing data Perspective What you are doing Hardware Innovations overtime Spinning disk V Flash GPGPU v CPU
  7. Replacing BI? Single Version of the Truth? Conformed dimensions (standardised data reporting) Four different operational systems ETL’d into single dimension Does Big Data change that? NO! YES! Unstructured data is unstructured – can it be conformed? Report on Detail or Aggregations? No – Analytics – we are data mining Still needs standardisation and thought – formal design process
  8. All data has Structure - not All data has Context Data stored [in structure] Image -> png, jpg, bmp etc. Free-text -> ascii, unicode, .docx, xls etc. Sound -> mp3, mpeg Data queried Image -> (?) face regonition, kinect Free-text -> grammar Sound -> Pitch, Note etc. Context? Image -> Polygon Free-text -> ?? Sound -> Bars in the Music??
  9. has Structure? A1 difficulties
  10. has Context? Stored in Normal Form (Relational) Stored in Unicode A1 – could mean anything Difficulties – the word itself has meaning Notes: Using Norm Form (Relational) context is provided by schema New term time – Uncontexted data (115 Bing references) Context gives data structured only when applied
  11. Big Data Processing
  12. We’ve been Hyped Band wagon is rolling If you hear a new term – research it; probably nothing new
  13. Finally: What is Big Data (really)? Data Analytics (stuff we already do) What is new? New toolsets to help with variety of data Industry waking up to the power of commodity kit Data Science as a field (combination of a BI Analyst, Business Analyst and BI Developer) It’s still all about Insights into our data Hadoop– the platform of the next generation? Look out for the name change Big Data will become Data Analytics
  14. A NEW SET OF QUESTIONS How do I better predict future outcomes? How do I optimize my fleet based on weather and traffic patterns? What’s the social sentiment for my brand or products Advanced ANALYTICS SOCIAL & Web ANALYTICS LIVE DATA FEEDS
  15. Common Big Data Customer ScenariosGain competitive advantage by moving first and fast in your industry IT infrastructure optimization Legal discovery Social network analysis Traffic flow optimization Web app optimization Weather forecasting Healthcare outcomes Natural resource exploration Churn analysis Fraud detection Life sciences research Advertising analysis Equipment monitoring Smart meter monitoring
  16. What is Hadoop?
  17. Massively Parallel Processing (MPP) Chop a task up across multiple physical machines High Performance Clustering (HPC) Distributed Data Processing (DDP) Processing done locally on Data MapReduce is based on Something we know already
  18. Why MPP? Because Enterprise kit for this performance is way too expensive. 100 machines with cheap DAS costs fraction of a scale up machine with expensive SAN infrastructure Most NoSQL and NewSQL products are built with MPP and commodity kit as a design feature. Cloud computing model also Network connectivity is key component (oh, hence take the processing to the data!) Follows the design paradigm that processing should move to the data and not the data to the processing
  19. What is Hadoop? Open source project coordinated by Apache Analogous to an OS; core components: Utilities HDFS MapReduce Lots of other projects that sit within the ecosphere: Mahout, Sqoop, Flume, Scribe, Oozie, Jaql, Hue, Hiho, Hive, Pig, Hbase, … and more and more… • V1.0.0 and V2.0.0 code branches
  20. HBase persistent | distributed In Memory Efficient at Random Reads/Writes Distributed, large scale data store Utilizes Hadoop for persistence Both HBase and Hadoop are distributed
  21. In HadoopMapReduce speak Map Parse input line to get data you want: output: key (presented to single reducer), value pair (what we will likely aggregate) Shuffle Sort and move same “keys” to same node for reduction (can be expensive – plan your data partitions properly) Reduce Aggregate values Output http://developer.yahoo.com/hadoop/tutorial/module4.html
  22. MapReduce as SQL Map = SELECT FROM WHERE Reduce = GROUP BY
  23. AllSQL, NoSQL, NewSQL and SomeSQL
  24. AllSQL Data stored in Normal Form ACID for consistency and durability Queries done using ANSI SQL Basically what the majority of folk do The majority of reporting products use SQL as an interface Everybody knows SQL (despite its sins) Easy to understand and get going with
  25. NoSQL (Not Only SQL) Led by Developers wanting: More flexible data structures (dynamic schema) Ability to store none-tabular data Higher Scalability – scale out Hardware cost – build on commodity kit Durability and consistency not a primary concern Open source – move away from proprietary products Data resilience built into the product through replicas rather than expensive hardware and software solutions Examples See http://nosql-database.org/ - there are 100’s! Azure Table Store Google’s BigTable HADOOP MapReduce Cassandra RavenDB CouchDB MongoDB
  26. NoSQLmomentum RDBMS cannot scale because of ACID (Atomicity, Consistency, Isolation, Durability) Swathe of new open source products Data captured has value but not readily accessible NewSQL– will it “cure” the NoSQL problem?
  27. NewSQL Existing AllSQL Products do not scale out well Single machine design Design is several decades old Expensive to create a DR/HA environment Realisation Folk do not want to learn Java in order to report off their data Most toolsets use SQL as a method for reporting Examples VoltDB NuoDB Azure DB
  28. AllSQL, NoSQL, NewSQL and SomeSQL Days where everything in SQL Server are going BI/BA/DA {whatever you want to call it} done across different data sources – semi/un/fully structured Understand the non-relational world The SQL language isn’t going anywhere This isn’t about enterprise only – this affects us all
  29. Windows Azure
  30. MANAGE any data, any size, anywhere 010101010101010101 Unified Monitoring, Management & Security 1010101010101010 01010101010101 101010101010 Non-Relational Streaming Relational Data Movement
  31. HADOOP INTEGRATED INTO THE DATA PLATFORM Non-Relational Microsoft HDInsight Server for on-premises Windows Azure HDInsight Service for cloud Enterprise class security, HA & management Seamlessly integrated with Microsoft BI tools Windows Simplicity and Manageability Provisioned in minutes on Windows Azure Built on Hortonworks Data Platform (HDP)
  32. Hadoop architecture. Business Intelligence (Excel, PowerView…) Active Directory (Security) Pipeline / workflow (Oozie) Metadata (HCatalog) Graph (Pegasus) Stats processing (RHadoop) Data Integration ( ODBC / SQOOP/ REST) Scripting (Pig) Query (Hive) Machine Learning (Mahout) NoSQL Database (HBase) System Center Log file aggregation (Flume) Distributed Processing (Map Reduce) Distributed Storage (HDFS)
  33. insights FOR ALL USERS through familiar tools PB TB GB BI Professionals Business Analysts Data Scientists Advanced Analytics from Microsoft and 3rd parties Self Service Analysis with PowerPivot & Power View Interactivity & exploration with Hadoop data in Excel
  34. Azure SQL Database
  35. SQL Database Architecture
  36. Architecture Federation An object contained within a user database Defines the scheme for the federation Represent the database being sharded Federation Root Database that houses the federation object Federation Member System managed SQL databases Contain part, or “slices” of data Federations SalesDB Orders_federation Orders_federation Orders_Fed Federation Root Federation Members CREATE FEDERATION fed_name(fed_key_labelfed_key_typedistribution_type)
  37. Architecture Cont. Federation Key The key used for data distribution int, bigint, guid, varbinary Atomic Unit Represent a single instance of a federation key. All rows in all federated tables with the same federation key value. Federations SalesDB Orders_federation Orders_federation Orders_Fed Member: range [1000, 2000) Federation Root Federation Members AUPK=5 AUPK=25 AUPK=35 AUPK=5 AUPK=25 AUPK=35 AUPK=1005 AUPK=1025 AUPK=1035 Atomic Units
  38. Architecture Cont. Federated Table Contains only atomic units for member’s key range Reference Table Non-Federated table
  39. Repartitioning Dynamic Partitioning SPLIT members to spread workloads over to more nodes DROP members to shrink back to fewer nodes ALTER FEDERATION Orders_Fed SPLIT AT (tenant_id=7500) SalesDB Orders_federation Orders_federation Orders_Fed [5000, 7500) & [7500, 10000) [5000, 10000)
  40. Reliable Routing Built-in Data-Dependent Routing (DDR) Ensure apps can discover where the data is just-in-time No “Shard Map” caching Guaranteed member routing USE FEDERATION Orders_Fed (tenant_id=7509) SalesDB Orders_federation Orders_federation Orders_Fed [5000, 7500) & [7500, 10000)
  41. Azure NoSQL (Azure Table Storage)
  42. Table Storage Concepts Account Table Entity Name =… Email = … customers Name =… EMailAdd= contoso Photo ID =… Date =… photos Photo ID =… Date =…
  43. Table Details Create, Query, Delete Tables can have metadata Not an RDBMS! Table Insert Update Merge – Partial update Replace – Update entire entity Upsert Delete Query Entity Group Transactions Multiple CUD Operations in a single atomic transaction Entities
  44. Entity Properties Entity can have up to 255 properties Up to 1MB per entity Mandatory Properties for every entity PartitionKey & RowKey (only indexed properties) Uniquely identifies an entity Defines the sort order Timestamp Optimistic Concurrency Exposed as an HTTP Etag No fixed schema for other properties Each property is stored as a <name, typed value> pair No schema stored for a table Properties can be the standard .NET types String, binary, bool, DateTime, GUID, int, int64, and double
  45. No Fixed Schema FAV SPORT Canoeing
  46. Querying ?$filter=Last eq ‘Wegner’
  47. Purpose of the PartitionKey Entity Locality Entities in the same partition will be stored together Efficient querying and cache locality Endeavour to include partition key in all queries Entity Group Transactions Atomic multiple Insert/Update/Delete in same partition in a single transaction Table Scalability Target throughput – 500 tps/partition, several thousand tps/account Windows Azure monitors the usage patterns of partitions Automatically load balance partitions Each partition can be served by a different storage node Scale to meet the traffic needs of your table
  48. Partitions and Partition Ranges Server A Table = Products [MinKey - Canoes) Server A Table = Products Server B Table = Products [Canoes - MaxKey)
  49. MANAGE ANY DATA, ANY SIZE ANYWHERE Unified Monitoring, Management & Security Non-Relational Hadoop on Windows Hadoop on Azure Relational Streaming StreamInsight SQL Server Database & Parallel Data Warehouse 1010101010101010 01010101010101 101010101010 Data Movement Hadoop Connectors & ETL
  50. Frameworks caching identity service bus media cdn big data commerce integration analytics hpc mobile Services . . . . . . . . . . . . . . . . . . Fabric virtual machines web sites cloud services SQL database noSQL database blob storage connect virtual network traffic manager compute storage networking Global Physical Infrastructure servers / network / datacenters Automated Managed Resources Elastic Usage Based Infrastructure N Central US, S Central US, N Europe, W Europe, E Asia, SE Asia + 24 Edge CDN Locations
  51. www.microsoft.ch/shape
  52. Questions?
More Related