NoSQL by Michael Britton, Mark McGregor, and Sam Howard

NoSQL by Michael Britton, Mark McGregor, and Sam Howard Simplicity, Speed, Scalability

What is NoSQL? Next Generation Databases mostly addressing some of the points: being non-relational, distributed, open-source and horizontally scalable The term “NoSQL” is actually misleading. A more appropriate term is actually “Not Only SQL”

Origins • 1998 - Carlo Strozzi • Still used Relational model • More accurately called “NoRel” • 2009 – Eric Evans and Johan Oskarsson • Organized event to discuss open-source distributed databases • Originally a term to label Non-ACID databases • meant to be a Twitter hashtag but went viral and stuck

Why NoSQL

What You Are Giving Up With NoSQL • Relationships between entities are basically non-existent • Limited ACID transactions • No standard language for queries (SQL) • Less structured

RDBMS Vs. NoSQL RDBMS • Structured and organized data • Structured Query Language (SQL) • Data and its relationships stored in separate tables. • Data Manipulation Language, Data Definition Language • Tight Consistency • BASE Transaction NoSQL • No declarative query language • No predefined schema • Key-Value pair storage, Column Store, Document Store, Graph Databases • Eventual consistency rather ACID property • Unstructured and unpredictable data • CAP Theorem • Prioritize high performance, high availability and scalability

SQL VS NoSQL Queries NoSQL Query: SQL Query:

NoSQL vs. MySQL • MySQL > 50 GB Data • Writes Average: ~300 ms • Reads Average: ~350 ms • Cassandra > 50 GB Data • Writes Average: 0.12 ms • Reads Average: 15 ms

NoSQL Pros/Cons Pros • High Scalability • Distributed Computing • Lower Cost • Schema Flexibility, Semi-Structured Data • No Complicated Relationships Cons • No Standardization • Limited query capabilities • Eventual consistent model is not intuitive to program for

Non-Relational: The concept of joining tables together by relations is non-existent. Distributed: A network of interconnected computers, controlled by a central Database Management System Open-Source: Anyone can make changes to the original source code. Horizontally Scalable: Using multiple computers as one unit to increase productivity

Non-Relational • Relational databases join tables together using Primary Key / Foreign Key relationships • Non-Relational databases have no such structure • Items are aggregated into one file, much like a giant Excel spreadsheet • Prone to data duplication • Difficult to update records

Distributed • Non-relational databases can easily be spread out over multiple machines over the same network • Each machine in the distributed network can carry information most relevant to it’s area • Controlled by the DDBMS – Distributed Database Management System

Open-Source • Source code is generally available to the open public • Improve the software as needed • Share with the community

Horizontally Scalable Horizontal Vertical

Other Important Terms • Denormalization - optimizing read performance by adding redundant data or grouping data in order to improve scalability and performance • does NOT mean that the data has not been normalized • Denormalization should ideally take place after 3NF has been achieved • Constraints are used to ensure that redundant copies of data are synchronized • Materialized View - a database object that contains the results of a query. • query result is cached but can be updated from the original query as necessary

Other Important Terms • Keyspace - object that holds together all column families of a design • outermost grouping of data in datastore • resembles a schema in RDMS • Column Families - tuple (pair) consisting of a key-value pair, where the key is set to a value that is a set of columns • object that contains columns of related data • resembles a table in RDMS

Other Important Terms • Super Column Family - tuple (pair) that consists of key-value pair, where the key is mapped to a value that are column families • similar to a view in RDBS • Column (data store) - tuple (triplet) key-value pair consisting of a unique name, a value, and a timestamp. the timestamp determines old data from new data not to be confused with a standard relational database column lowest level object in a keyspace

Other Important Terms • Database Shard - a horizon partition in a database or a search partition. Each partition is a separate shard. • shards can be distributed to separate hardware, reducing the number of rows in each table • not to be confused with horizontal partitioning, which refers to splitting one or more tables by rows within a single schema or database server • Sharding - the process of forming shards within the distributed database system. • traditionally done by hand coding • auto-sharding code is highly sought after

Other Important Terms • Consistent Hashing - special hashing in which when the hash table is resized, only K / n keys need to be remapped • K is the number of rows • n is the number of slots

All your BASE are belonging to NoSQL • A BASE system gives up on consistency. • Basically Available indicates the system does guarantee availability. • Soft state indicates that the state of the system may change over time, even without input. • Eventual consistency indicates that the system will become consistent over time, given the system doesn’t receive input during that time.

CAP Theorem (Brewer’s Theorem) • There are three basic requirements which exist in a special relation when designing for a distributed architecture. • Consistency ‘C’ - the data in the database remains consistent after the execution of the operation • Availability ‘A’ - the system is always on, no downtime. • Partition Tolerance ‘P’ - the system continues to function even if the communication among the servers is unreliable.

CAP Theorem Cont. • CAP provides the basic requirements for a distributed systems to follow 2 of the 3 requirements. All of the current NoSQL database follow the different combinations of C, A, and P. • CA - Single site cluster, therefore all the nodes are always in contact. • CP - Some data may not be accessible, but the rest is still consistent/accurate • AP - System is still available under partitioning, but some of the data may be inaccurate.

Challenges of NoSQL • Maturity - In comparison RDBMS systems have been around for a long time. Most NoSQL alternatives are in pre-production versions with many key features yet to be implemented. • Support - Most NoSQL systems are Open Source projects, and the companies that offer support are small start-ups without global reach, support services, or the credibility of Oracle, Microsoft, or IBM.

Challenges of NoSQL • Analytics and Business Intelligence - NoSQL databases have evolved to meet the scaling demands of Web 2.0 applications. • Administration - The design goals for NoSQL is to provide a zero-admin solution, but as of today it requires a lot of skill to install and a lot of to effort to maintain. • Expertise - Almost all NoSQL developers is learning how to use and develop for NoSQL

Advantages of NoSQL • Elastic Scaling - NoSQL databases are designed to expand transparently to take advantage of new nodes, and they are usually designed with low-cost commodity hardware in mind. • Big Data - The volumes of data that can be handled by NoSQL systems are greater than what can be handled by the biggest RDBMS. • No DBA - NoSQL databases are designed from the ground up to require less management: automatic repair, data distribution, and simpler data models to lead to lower administration and tuning requirements.

Advantages of NoSQL • Economic - NoSQL databases typically use clusters of cheap commodity servers to managing the ever-expanding amount of data and transactions. • Flexible Data Models - NoSQL databases have more relaxed data model restrictions. Key Value stores and document databases allow the application to store virtually any structure it wants in a data element.

Taxonomy (Data Models) Key-value stores are the simplest NoSQL databases. Every single item in the database is stored as an attribute name (or "key"), together with its value. Examples of key-value stores are Riak and Voldemort. Some key-value stores, such as Redis, allow each value to have a type, such as "integer", which adds functionality Document databases pair each key with a complex data structure known as a document. Documents can contain many different key-value pairs, or key-array pairs, or even nested documents. Graph stores are used to store information about networks, such as social connections. Graph stores include Neo4J and HyperGraphDB. Column stores such as Cassandra and HBase are optimized for queries over large datasets, and store columns of data together, instead of rows.

Key-Value stores • Examples-Tokyo Cabinet/Tyrant, Redis, Voldemort, Oracle BDB • Typical Application- Content caching (Focus on scaling to huge amounts of data, designed to handle massive load), logging, etc. • Strengths- Fast Lookups • Weaknesses- Stored data has no schema

Oracle Embraces NoSQL

Oracle Embraces NoSQL • Distributed key-value database • Designed to provide highly reliable, scalable, and available data storage across a configurable set of systems that function as storage nodes • Data is stored as key-value pairs, which are written to particular storage node(s), based on the hashed value of the primary key. • Storage nodes are replicated to ensure high availability, rapid failover in the event of a node failure and optimal load balancing of queries. • Customer applications are written using an easy-to-use Java/C API to read and write data.

Oracle Embraces NoSQL • Utilizes storage nodes • more storage nodes provide greater throughput • Storage Node Agent (SNA) monitors each nodes behavior • Replication nodes work in groups to serve the same data • Replication factor of 3 • Single-master architecture • Master node replicates to replication nodes • Election system elects new master in case of failure

Column Stores • Examples-Cassandra, HBase, Riak • Typical applications-Distributed file systems • Data model-Columns → column families • Strengths-Fast lookups, good distributed storage of data • Weaknesses-Very low-level API

Apache Cassandra Project • Scalability and high availability without compromising performance • Uses column indexes • Denormalization • Materialized Views • Built-in caching

Apache Cassandra Project • Used in over 1500 companies with large, active data sets • Largest cluster has 300 TB of data on over 400 machines • Replication across multiple data centers allows failed nodes to be replaced with no downtime • Every node is identical, allowing no single point of failure • Users can choose between synchronous and asynchronous replication

Document Databases • Examples-CouchDB, MongoDb • Typical applications-Web applications (Similar to Key-Value stores, but the DB knows what the Value is) • Data model-Collections of Key-Value collections • Strengths-Tolerant of incomplete data • Weaknesses-Query performance, no standard query syntax

Hu - MongoDB - us • Stores data in the form of BSON (Binary JSON) documents with dynamic schemas, making the integration of data in certain types of applications easy and fast. • Most talked about NoSQL DBMS technology because it features auto sharding, replication,schema less design, and scalability, and more.

Hu - MongoDB - us • Full indexing support - index on any attribute • Replicable - mirror across WAN and LAN • Auto Sharding • Document-based querying • Flexible aggregation • GridFS allows for storage of data files larger than BSON allows

Graph Databases • Graphs databases store data in graphics to easily represent data • Graphs records data in nodes with properties • Nodes can have unlimited properties, but are generally broken up into multiple nodes • Useful for answering questions based on related information

Neo4J • Highly Scalable • Fully ACID • Intuitive graphical models • Custom disk-based native storage engine • Massively scalable, with potential for BILLIONS of nodes • Highly available

Neo4J • Expressive, powerful, human readable graph query language • EX: MATCH (a:Actor { name:"Keanu Reeves" }) RETURN a

Other NoSQL DBMS Products Cont. • CouchDB - stores data in the form of a collection document. Each document is a bunch of ‘keys’ and corresponding ‘values’. CouchDB support indices, queries, and views. It uses JSON to story data, JavaScript as its query language using MapReduce and HTTP for the API. • Redis - An in-memory, key value data store. Mostly used as a caching mechanism in most of the applications because it stores data in the RAM making it extremely fast when retrieving data. It is a data structure server and not a replacement to the traditional database. Used in combination with products like MySql to deliver high performance when the data is needed to be delivered rapidly.

Other NoSQL DBMS Products Cont. • Hadoop - An open-source framework. Written in Java and supports data-intensive distributed applications. Supports applications running on largest clusters of computers and allows analyzing data among many different computers. Designed to scale up from single servers to thousands of machines. • There are currently 150 different NoSQL databases

Companies That Implement NoSQL • Google - BIGTABLE • Facebook - CASSANDRA • Mozilla - HBASE • Adobe - HBASE • Foursquare - MongoDB • LinkedIn - VOLDEMORT • Digg - REDIS • Twitter - HADOOP, PIG, CASSANDRA

Questions? Tough!

Sources: • http://nosql-database.org/ • http://www.ignoredbydinosaurs.com/2013/05/explaining-non-relational-databases-my-mom • http://en.wikipedia.org/wiki/NoSQL • http://greendatacenterconference.com/blog/the-five-key-advantages-and-disadvantages-of-nosql • http://www.tutorialindustry.com/nosql-tutorial-for-beginners • http://www.techrepublic.com/blog/10-things/10-things-you-should-know-about-nosql-databases • http://readwrite.com/2011/10/24/oracle-formally-embraces-nosql#awesm=~oCvdI8zKkJmAiZ • http://www.oracle.com/technetwork/database/database-technologies/nosqldb/overview/index.html • http://cassandra.apache.org/ • http://www.neo4j.org/learn/nosql • http://www.w3resource.com/mongodb/nosql.php • http://architects.dzone.com/articles/putting-nosql-perspective • http://en.wikipedia.org/wiki/Shard_%28database_architecture%29 • http://en.wikipedia.org/wiki/Consistent_hashing • https://www.mongodb.org/

NoSQL by Michael Britton, Mark McGregor, and Sam Howard

NoSQL by Michael Britton, Mark McGregor, and Sam Howard

Presentation Transcript

Presented by Sam Supervised by Prof. Michael Lyu

Presented by: Mark Holcombe and Michael Misrahi

NoSQL and .NET

Mark Spitz and Michael Phelps

NoSQL and Review

NoSQL and NOSQL

Newton’s Laws By: Frank Britton

By Alexis Howard

Larry Stanislawski , Michael Howard

By: Sam and Brian

By: Sam and Torrey

NoSQL and NewSQL

By: Mark, Karen, sam , Jenny

By Dan Morris, Mark Birch And Sam Brindley

Hosted by Mark, Sam and Gerald

Mark Bryson, Michael Brandt, Amanda Perry, Sam Dykstra, and Kirk Scarbrough

Mark Howard October, 2008

By: Allison Howard

By Alexander Strizhiver Michael Shamis Supervised by Mark Moulin

Presented by Sam Supervised by Prof. Michael Lyu

NoSQL and MongoDB

Geography Content by: Michael Ihler Template Design by: Mark Geary