510 likes | 769 Vues
NoSQL by Michael Britton, Mark McGregor, and Sam Howard. Simplicity, Speed, Scalability. What is NoSQL?.  Next Generation Databases mostly addressing some of the points: being non-relational, distributed, open-source and horizontally scalable
 
                
                E N D
NoSQL by Michael Britton, Mark McGregor, and Sam Howard Simplicity, Speed, Scalability
What is NoSQL? Next Generation Databases mostly addressing some of the points: being non-relational, distributed, open-source and horizontally scalable The term “NoSQL” is actually misleading. A more appropriate term is actually “Not Only SQL”
Origins • 1998 - Carlo Strozzi • Still used Relational model • More accurately called “NoRel” • 2009 – Eric Evans and Johan Oskarsson • Organized event to discuss open-source distributed databases • Originally a term to label Non-ACID databases • meant to be a Twitter hashtag but went viral and stuck
What You Are Giving Up With NoSQL • Relationships between entities are basically non-existent • Limited ACID transactions • No standard language for queries (SQL) • Less structured
RDBMS Vs. NoSQL RDBMS • Structured and organized data • Structured Query Language (SQL) • Data and its relationships stored in separate tables. • Data Manipulation Language, Data Definition Language • Tight Consistency • BASE Transaction NoSQL • No declarative query language • No predefined schema • Key-Value pair storage, Column Store, Document Store, Graph Databases • Eventual consistency rather ACID property • Unstructured and unpredictable data • CAP Theorem • Prioritize high performance, high availability and scalability
SQL VS NoSQL Queries NoSQL Query: SQL Query:
NoSQL vs. MySQL • MySQL > 50 GB Data • Writes Average: ~300 ms • Reads Average: ~350 ms • Cassandra > 50 GB Data • Writes Average: 0.12 ms • Reads Average: 15 ms
NoSQL Pros/Cons Pros • High Scalability • Distributed Computing • Lower Cost • Schema Flexibility, Semi-Structured Data • No Complicated Relationships Cons • No Standardization • Limited query capabilities • Eventual consistent model is not intuitive to program for
Non-Relational: The concept of joining tables together by relations is non-existent. Distributed: A network of interconnected computers, controlled by a central Database Management System Open-Source: Anyone can make changes to the original source code. Horizontally Scalable: Using multiple computers as one unit to increase productivity
Non-Relational • Relational databases join tables together using Primary Key / Foreign Key relationships • Non-Relational databases have no such structure • Items are aggregated into one file, much like a giant Excel spreadsheet • Prone to data duplication • Difficult to update records
Distributed • Non-relational databases can easily be spread out over multiple machines over the same network • Each machine in the distributed network can carry information most relevant to it’s area • Controlled by the DDBMS – Distributed Database Management System
Open-Source • Source code is generally available to the open public • Improve the software as needed • Share with the community
Horizontally Scalable Horizontal Vertical
Other Important Terms • Denormalization - optimizing read performance by adding redundant data or grouping data in order to improve scalability and performance • does NOT mean that the data has not been normalized • Denormalization should ideally take place after 3NF has been achieved • Constraints are used to ensure that redundant copies of data are synchronized • Materialized View - a database object that contains the results of a query. • query result is cached but can be updated from the original query as necessary
Other Important Terms • Keyspace - object that holds together all column families of a design • outermost grouping of data in datastore • resembles a schema in RDMS • Column Families - tuple (pair) consisting of a key-value pair, where the key is set to a value that is a set of columns • object that contains columns of related data • resembles a table in RDMS
Other Important Terms • Super Column Family - tuple (pair) that consists of key-value pair, where the key is mapped to a value that are column families • similar to a view in RDBS • Column (data store) - tuple (triplet) key-value pair consisting of a unique name, a value, and a timestamp. the timestamp determines old data from new data not to be confused with a standard relational database column lowest level object in a keyspace
Other Important Terms • Database Shard - a horizon partition in a database or a search partition. Each partition is a separate shard. • shards can be distributed to separate hardware, reducing the number of rows in each table • not to be confused with horizontal partitioning, which refers to splitting one or more tables by rows within a single schema or database server • Sharding - the process of forming shards within the distributed database system. • traditionally done by hand coding • auto-sharding code is highly sought after
Other Important Terms • Consistent Hashing - special hashing in which when the hash table is resized, only K / n keys need to be remapped • K is the number of rows • n is the number of slots
All your BASE are belonging to NoSQL • A BASE system gives up on consistency. • Basically Available indicates the system does guarantee availability. • Soft state indicates that the state of the system may change over time, even without input. • Eventual consistency indicates that the system will become consistent over time, given the system doesn’t receive input during that time.
CAP Theorem (Brewer’s Theorem) • There are three basic requirements which exist in a special relation when designing for a distributed architecture. • Consistency ‘C’ - the data in the database remains consistent after the execution of the operation • Availability ‘A’ - the system is always on, no downtime. • Partition Tolerance ‘P’ - the system continues to function even if the communication among the servers is unreliable.
CAP Theorem Cont. • CAP provides the basic requirements for a distributed systems to follow 2 of the 3 requirements. All of the current NoSQL database follow the different combinations of C, A, and P. • CA - Single site cluster, therefore all the nodes are always in contact. • CP - Some data may not be accessible, but the rest is still consistent/accurate • AP - System is still available under partitioning, but some of the data may be inaccurate.
Challenges of NoSQL • Maturity - In comparison RDBMS systems have been around for a long time. Most NoSQL alternatives are in pre-production versions with many key features yet to be implemented. • Support - Most NoSQL systems are Open Source projects, and the companies that offer support are small start-ups without global reach, support services, or the credibility of Oracle, Microsoft, or IBM.
Challenges of NoSQL • Analytics and Business Intelligence - NoSQL databases have evolved to meet the scaling demands of Web 2.0 applications. • Administration - The design goals for NoSQL is to provide a zero-admin solution, but as of today it requires a lot of skill to install and a lot of to effort to maintain. • Expertise - Almost all NoSQL developers is learning how to use and develop for NoSQL
Advantages of NoSQL • Elastic Scaling - NoSQL databases are designed to expand transparently to take advantage of new nodes, and they are usually designed with low-cost commodity hardware in mind. • Big Data - The volumes of data that can be handled by NoSQL systems are greater than what can be handled by the biggest RDBMS. • No DBA - NoSQL databases are designed from the ground up to require less management: automatic repair, data distribution, and simpler data models to lead to lower administration and tuning requirements.
Advantages of NoSQL • Economic - NoSQL databases typically use clusters of cheap commodity servers to managing the ever-expanding amount of data and transactions. • Flexible Data Models - NoSQL databases have more relaxed data model restrictions. Key Value stores and document databases allow the application to store virtually any structure it wants in a data element.
Taxonomy (Data Models) Key-value stores are the simplest NoSQL databases. Every single item in the database is stored as an attribute name (or "key"), together with its value. Examples of key-value stores are Riak and Voldemort. Some key-value stores, such as Redis, allow each value to have a type, such as "integer", which adds functionality Document databases pair each key with a complex data structure known as a document. Documents can contain many different key-value pairs, or key-array pairs, or even nested documents. Graph stores are used to store information about networks, such as social connections. Graph stores include Neo4J and HyperGraphDB. Column stores such as Cassandra and HBase are optimized for queries over large datasets, and store columns of data together, instead of rows.
Key-Value stores • Examples-Tokyo Cabinet/Tyrant, Redis, Voldemort, Oracle BDB • Typical Application- Content caching (Focus on scaling to huge amounts of data, designed to handle massive load), logging, etc. • Strengths- Fast Lookups • Weaknesses- Stored data has no schema
Oracle Embraces NoSQL • Distributed key-value database • Designed to provide highly reliable, scalable, and available data storage across a configurable set of systems that function as storage nodes • Data is stored as key-value pairs, which are written to particular storage node(s), based on the hashed value of the primary key. • Storage nodes are replicated to ensure high availability, rapid failover in the event of a node failure and optimal load balancing of queries. • Customer applications are written using an easy-to-use Java/C API to read and write data.
Oracle Embraces NoSQL • Utilizes storage nodes • more storage nodes provide greater throughput • Storage Node Agent (SNA) monitors each nodes behavior • Replication nodes work in groups to serve the same data • Replication factor of 3 • Single-master architecture • Master node replicates to replication nodes • Election system elects new master in case of failure
Column Stores • Examples-Cassandra, HBase, Riak • Typical applications-Distributed file systems • Data model-Columns → column families • Strengths-Fast lookups, good distributed storage of data • Weaknesses-Very low-level API
Apache Cassandra Project • Scalability and high availability without compromising performance • Uses column indexes • Denormalization • Materialized Views • Built-in caching
Apache Cassandra Project • Used in over 1500 companies with large, active data sets • Largest cluster has 300 TB of data on over 400 machines • Replication across multiple data centers allows failed nodes to be replaced with no downtime • Every node is identical, allowing no single point of failure • Users can choose between synchronous and asynchronous replication
Document Databases • Examples-CouchDB, MongoDb • Typical applications-Web applications (Similar to Key-Value stores, but the DB knows what the Value is) • Data model-Collections of Key-Value collections • Strengths-Tolerant of incomplete data • Weaknesses-Query performance, no standard query syntax
Hu - MongoDB - us • Stores data in the form of BSON (Binary JSON) documents with dynamic schemas, making the integration of data in certain types of applications easy and fast. • Most talked about NoSQL DBMS technology because it features auto sharding, replication,schema less design, and scalability, and more.
Hu - MongoDB - us • Full indexing support - index on any attribute • Replicable - mirror across WAN and LAN • Auto Sharding • Document-based querying • Flexible aggregation • GridFS allows for storage of data files larger than BSON allows
Graph Databases • Graphs databases store data in graphics to easily represent data • Graphs records data in nodes with properties • Nodes can have unlimited properties, but are generally broken up into multiple nodes • Useful for answering questions based on related information
Neo4J • Highly Scalable • Fully ACID • Intuitive graphical models • Custom disk-based native storage engine • Massively scalable, with potential for BILLIONS of nodes • Highly available
Neo4J • Expressive, powerful, human readable graph query language • EX: MATCH (a:Actor { name:"Keanu Reeves" }) RETURN a
Other NoSQL DBMS Products Cont. • CouchDB - stores data in the form of a collection document. Each document is a bunch of ‘keys’ and corresponding ‘values’. CouchDB support indices, queries, and views. It uses JSON to story data, JavaScript as its query language using MapReduce and HTTP for the API. • Redis - An in-memory, key value data store. Mostly used as a caching mechanism in most of the applications because it stores data in the RAM making it extremely fast when retrieving data. It is a data structure server and not a replacement to the traditional database. Used in combination with products like MySql to deliver high performance when the data is needed to be delivered rapidly.
Other NoSQL DBMS Products Cont. • Hadoop - An open-source framework. Written in Java and supports data-intensive distributed applications. Supports applications running on largest clusters of computers and allows analyzing data among many different computers. Designed to scale up from single servers to thousands of machines. • There are currently 150 different NoSQL databases
Companies That Implement NoSQL • Google - BIGTABLE • Facebook - CASSANDRA • Mozilla - HBASE • Adobe - HBASE • Foursquare - MongoDB • LinkedIn - VOLDEMORT • Digg - REDIS • Twitter - HADOOP, PIG, CASSANDRA
Questions? Tough!
Sources: • http://nosql-database.org/ • http://www.ignoredbydinosaurs.com/2013/05/explaining-non-relational-databases-my-mom • http://en.wikipedia.org/wiki/NoSQL • http://greendatacenterconference.com/blog/the-five-key-advantages-and-disadvantages-of-nosql • http://www.tutorialindustry.com/nosql-tutorial-for-beginners • http://www.techrepublic.com/blog/10-things/10-things-you-should-know-about-nosql-databases • http://readwrite.com/2011/10/24/oracle-formally-embraces-nosql#awesm=~oCvdI8zKkJmAiZ • http://www.oracle.com/technetwork/database/database-technologies/nosqldb/overview/index.html • http://cassandra.apache.org/ • http://www.neo4j.org/learn/nosql • http://www.w3resource.com/mongodb/nosql.php • http://architects.dzone.com/articles/putting-nosql-perspective • http://en.wikipedia.org/wiki/Shard_%28database_architecture%29 • http://en.wikipedia.org/wiki/Consistent_hashing • https://www.mongodb.org/