1 / 52

Introduction to Big Data and NoSQL

Introduction to Big Data and NoSQL. SQL Azure Saturday April, 21, 2012. Don Demsak Advisory Solutions Architect EMC Consulting www.donxml.com. Meet Don. Advisory Solutions Architect EMC Consulting Application Architecture, Development & Design DonXml.com, Twitter: donxml

murray
Télécharger la présentation

Introduction to Big Data and NoSQL

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Big Data and NoSQL SQL Azure Saturday April, 21, 2012 Don Demsak Advisory Solutions Architect EMC Consulting www.donxml.com

  2. Meet Don • Advisory Solutions Architect • EMC Consulting • Application Architecture, Development & Design • DonXml.com, Twitter: donxml • Email – don@donxml.com • SlideShare - http://www.slideshare.net/dondemsak

  3. The era of Big Data

  4. How did we get here? • Expensive • Processors • Disk space • Memory • Operating Systems • Software • Programmers • Monoculture • Limit CPU cycles • Limit disk space • Limit memory • Limited OS Development • Limited Software • Programmers • Mono-lingual • Mono-persistence

  5. Typical RDBMS Implementations • Fixed table schemas • Small but frequent reads/writes • Large batch transactions • Focus on ACID • Atomicity • Consistency • Isolation • Durability

  6. How we scale RDBMS implementations

  7. 1st Step – Build a relational database Database

  8. 2nd Step – Table Partitioning • p1 p2 p3 Database

  9. 3rd Step – Database Partitioning Database Web Tier B/L Tier Browser • Customer #1 Database Web Tier B/L Tier Browser • Customer #2 Database Web Tier B/L Tier Browser • Customer #3

  10. 4th Step – Move to the cloud? SQL Azure Federation Web Tier B/L Tier Browser • Customer #1 SQL Azure Federation Web Tier B/L Tier Browser • Customer #2 SQL Azure Federation Web Tier B/L Tier Browser • Customer #3

  11. There has to be other ways

  12. Polyglot Persistence

  13. Polyglot Programmer

  14. Where Did NoSQL Originate? • 1998 - Carlo Strozzi • NoSQL project - lightweight open-source relational DB with no SQL interface • 2009 - Eric Evans & Johan Oskarsson of Last.fm wanted to organize an event to discuss open-source distributed databases

  15. NoSQL (loose) Definition • (often) Open source • Non-relational • Distributed • (often) don’t guarantee ACID

  16. Atlanta 2009 • No:sql(east) conference • select fun, profit from real_world where relational=false • Billed as “conference of no-reldatastores”

  17. Types Of NoSQL Data Stores

  18. 5 Groups of Data Models

  19. Document Store • Apache Jackrabbit • CouchDB • MongoDB • SimpleDB • XML Databases • MarkLogic Server • eXist.

  20. Document? • Okay think of a web page... • Relational model requires column/tag • Lots of empty columns • Wasted space • Document model just stores the pages as is • Saves on space • Very flexible.

  21. Graph Storage • AllegroGraph • Core Data • Neo4j • DEX • FlockDB • Microsoft Trinity (research project) • http://research.microsoft.com/en-us/projects/trinity/

  22. What’s a graph? • Graph consists of • Node (‘stations’ of the graph) • Edges (lines between them) • FlockDB • Created by the Twitter folks • Nodes = Users • Edges = Nature of relationship between nodes.

  23. Key/Value Stores • On disk • Cache in Ram • Eventually Consistent • Weak Definition • “If no updates occur for a period, eventually all updates will propagate through the system and all replicas will be consistent” • Strong Definition • “for a given update and a given replica eventually either the update reaches the replica or the replica retires” • Ordered • Distributed Hash Table allows lexicographical processing

  24. Key/Value Examples • Azure AppFabricCache • Memcache-d • VMWare vFabricGemFire

  25. Object Databases • Db4o • GemStone/S • InterSystemsCaché • Objectivity/DB • ZODB

  26. Tabular • BigTable • Mnesia • Hbase • Hypertable • Azure Table Storage • SQL Server 2012

  27. Azure Table Storage Demo

  28. Big Data

  29. Big Data Definition • Volumes & volumes of data • Unstructured • Semi-structured • Not suited for Relational Databases • Often utilizes MapReduce frameworks

  30. Big Data Examples • Cassandra • Hadoop • Greenplum • Azure Storage • EMC Atmos • Amazon S3 • SQL Azure (with Federations support)

  31. Real World Example • Twitter • The challenges • Needs to store many graphs • Who you are following • Who’s following you • Who you receive phone notifications from etc • To deliver a tweet requires rapid paging of followers • Heavy write load as followers are added and removed • Set arithmetic for @mentions (intersection of users).

  32. What did they try? • Started with Relational Databases • Tried Key-Value storage of denormalized lists • Did it work? • Nope • Either good at • Handling the write load • Or paging large amounts of data • But not both

  33. What did they need? • Simplest possible thing that would work • Allow for horizontal partitioning • Allow write operations to • Arrive out of order • Or be processed more than once • Failures should result in redundant work • Not lost work!

  34. The Result was FlockDB • Stores graph data • Not optimized for graph traversal operations • Optimized for large adjacency lists • List of all edges in a graph • Key is the edge value a set of the node end points • Optimized for fast read and write • Optimized for page-able set arithmetic.

  35. How Does it Work? • Stores graphs as sets of edges between nodes • Data is partitioned by node • All queries can be answered by a single partition • Write operations are idempotent • Can be applied multiple times without changing the result • And commutative • Changing the order of operands doesn’t change the result.

  36. Working With Big Data

  37. ACID • Atomicity • All or Nothing • Consistency • Valid according to all defined rules • Isolation • No transaction should be able to interfere with another transaction • Durability • Once a transaction has been committed, it will remain so, even in the event of power loss, crashes, or errors

  38. BASE • Basically Available • High availability but not always consistent • Soft state • Background cleanup mechanism • Eventual consistency • Given a sufficiently long period of time over which no changes are sent, all updates can be expected to propagate eventually through the system and all the replicas will be consistent.

  39. Traditional (relational) Approach Transactional Data Store Data Warehouse

  40. Big Data Approach • MapReduce Pattern/Framework • an Input Reader • Map Function – To transform to a common shape (format) • a partition function • a compare function • Reduce Function • an Output Writer

  41. MongoDB Example > // map function > m = function(){ ... this.tags.forEach( ... function(z){ ... emit( z , { count : 1 } ); ... } ... ); ...}; > // reduce function > r = function( key , values ){ ... var total = 0; ... for ( var i=0; i<values.length; i++ ) ... total += values[i].count; ... return { count : total }; ...}; > // execute > res = db.things.mapReduce(m, r, { out : "myoutput" } );

  42. MongoDB Demo

  43. Big Data on Azure • Azure Table Storage • Azure Service Bus • SQL Azure Federations • MongoDB on Azure • http://www.mongodb.org/display/DOCS/MongoDB+on+Azure • Hadoop on Azure • https://www.hadooponazure.com/

  44. Using Azure for Computing Data Worker Data Data Worker Client Master Sockets Job/Task Scheduler Worker Data

  45. Moving to Event Based Architecture Web Role Worker Role Web Role Worker Role Web Role Worker Role Req Req Req Queue Web Role Worker Role Web Role • Monitor queue • length against user’s expectations Worker Role Web Role Worker Role

  46. Aggregate Stores

  47. Visualizing Aggregates Orders ID: 1001 Customer: Ann Customers Line Items Order Lines Payment Details Card: AmEx CC#: 12343 Expiration: 07/2015 Credit Cards

  48. Visualizing Aggregates { “SalesOrdersView”:{ ID: 1001, Customer: Ann, LineItems: [] …………….. ……………. …………….. } } ID: 1001 Customer: Ann Line Items Payment Details Card: AmEx CC#: 12343 Expiration: 07/2015

  49. MongoDB on Azure Demo

More Related