Being a Data Scientist with Oracle Big Data

Being a Data Scientist with Oracle Big Data Tang Tao Oracle University Principal Instructor

The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle.

Program Agenda • Big Data • Big Data Appliance • Hadoop • HDFS / NoSQL • Big Data Connectors

Large Volume of Data • Data Storage • Years of data on: • Sales History • Financial Statements • Vitals of patients • Weather patterns

Data verses Information • What we have / What we want • Data is raw, unorganized facts that need to be processed. Data can be something simple and seemingly random and useless until it is organized. • When data is processed, organized, structured or presented in a given context so as to make it useful.

Real-World Use Cases

Relational Databases • Can handle terabytes of data • What if - Data is not just located in relational databases but files, documents, email, web traffic, audio, video, and social media networks • Entity relationship modeling is useful to communicate information about the attributes of the data and relationships between the data. • What if – There is no structure or relationship between all the data

What Is Big Data? • Big data is defined as voluminous unstructured data from many different sources, such as: • Social networks • Banking and financial services • E-commerce services • Web-centric services • Internet search indexes • Scientific searches • Document searches • Medical records • Weblogs

How Did Big Data Evolve? • More people interacting with data • Smartphones • Internet • Greater volumes of data being generated (machine-to-machine generation) • Sensors • General Packet Radio Services (GPRS)

Characteristics of Big Data Social networks Micro blogs RSS Feeds Variety Value Volume Velocity

The Four Phases of Data Conversion 1 2 3 4

The 4 Phases of Big Data • Acquire • The Oracle Big Data Appliance combines leading open source technologies with software developed by Oracle to meet big data requirements. • Full rack configuration with 18 Sun servers from Oracle for a total storage capacity of 648 terabytes. • Oracle NoSQL Database and the Hadoop Distributed File System (HDFS) to provide mechanisms to acquire and organize massive volumes of unstructured data in the context of enterprise architecture

The 4 Phases of Big Data (con’t) • Organize • Oracle Big Data Appliance acquires the data, it can then process the information using Hadoop’s MapReduce capabilities. • The output can then be loaded into the Oracle Database via the Oracle Loader for Hadoop or it can be accessed from the Database via the Oracle Direct Connector for HDFS. • Once the results are in the Oracle Database, the information can be merged and enhanced by other structured data that may have come from existing traditional business systems.

The 4 Phases of Big Data (con’t) • Analyze • Oracle Exadata Database Machine provides outstanding performance for hosting data warehouses and for OLTP databases. • Data is in mass-consumption format, Oracle Exalytics can provide high speed data access for analysts and it is optimized to run Oracle Business Intelligence Enterprise Edition and Oracle Essbase.

The 4 Phases of Big Data (con’t) • Decide • Oracle’s Business Intelligence solutions help organization personnel discover insights from the wide variety of data • Oracle R Enterprise, which integrates the open-source statistical environment R with Oracle Database 11g, provides deep statistical analysis for statisticians and data scientists. • Oracle’s Endeca solutions provide tools for ad-hoc data discovery and exploration by business users. • Oracle Real Time Decisions can integrate with big data to automatically take action as certain results are discovered, Oracle Business Intelligence Foundation Suite facilitates enterprise-wide reporting and dashboard summaries.

Oracle Big Data Appliance: Introduction • Oracle Big Data Appliance is an engineered system containing both hardware and software components. Oracle Big Data Appliance delivers: • A complete and optimized solution for big data • Single-vendor support for both hardware and software • An easy-to-deploy solution • Tight integration with Oracle Database

Benefits of Using Oracle Big Data Appliance • Optimized and complete • Integrated with Oracle Exadata • Easy to deploy • Deep analytics • High agility • Massive scalability • Low latency • Single vendor support

Oracle Big Data Appliance: Where It Stands? DataVariety Big Data Appliance Unstructured Schema-less Acquire Organize Analyze Schema InformationDensity

Oracle Big Data Appliance: Hardware Components 18 Sun X4270 M2 Nodes 48 GB memory per node 12 Intel cores per node Big Data Appliance 24 TB storage per node

Oracle Big Data Appliance: Software Components Oracle NoSQL Database Oracle Big Data Connectors Open Source R Distribution Cloudera Manager & Cloudera's Distribution including Apache Hadoop Oracle Linux 5.6 and Java Hotspot VM Oracle Big Data Appliance

Road Map: Phases of Big Data

Mapping the Phases with Software • Acquire Phase • Hadoop Distributed File System • Oracle NoSQL Database • Organize Phase • Hadoop Software Framework • Oracle Data Integrator • Analyze Phase • R Statistical Programming Environment • Oracle Data Warehouse

What Is Hadoop? • Hadoop is a framework for executing applications on large clusters built of commodity hardware. Hadoop: • Was developed by Apache • Is open source • Is written in Java • Incorporates the MapReduce paradigm for splitting data • Uses the Hadoop Distributed File System (HDFS) to store and replicate data across nodes

Hadoop: Components

Hadoop Components • Hadoop client - Terminal which initiates processing inside the Big Data Appliance. • Name Node - A single node which manages the metadata and access control. It is often redundant with exactly one secondary Name Node. • Job Tracker - Hands out the tasks to the Task Trackers. • Data Nodes - Store and process the data. Data is replicated and stored across multiple Data Nodes. • Hadoop Distributed File System (HDFS) - Stores input and output data

What Is HDFS? • Hadoop Distributed File System (HDFS) is a scalable, fault tolerant, distributed file system that is designed to handle very large data files on commodity hardware. HDFS has the following features: • Write-once and read-many • Intelligent client • Hierarchical directories • Single namespace for the entire appliance • Data replicated without RAID • Computation possible where data resides • High aggregate bandwidth

Hadoop Distributed File System • Hadoop uses the Hadoop Distributed File System (HDFS) for storing large data files in smaller chunks, replicated across the servers, to assure data availability.

HDFS: Architecture • HDFS is typically a two-level hierarchical construction. • Level 1: NameNode (master) • Level 2: DataNodes (slave) NameNode DataNode1 DataNode2 DataNodeN Client Data Blocks HDFS

Functions of NameNode • The NameNode provides several functions: • Acts as the repository for all HDFS metadata • Executes the directives for opening, closing, and renaming files and directories • Stores HDFS state in an image file (fsimage) • Stores file system modifications in an edit log file (edits) • On startup, merges fsimage and edits files, then empties edits • Places replicas of blocks on multiple racks for fault tolerance

Functions of Checkpoint Node • The Checkpoint Node is a secondary NameNode that can be imported to the primary NameNode, if necessary. The Checkpoint Node: • Runs on a different machine, because it needs the same amount of memory as the NameNode • Has a directory structure identical to the NameNode directory, enabling it to replace the primary NameNode at any time • Periodically merges the fsimage and edits log files to keep the edits file size within a limit

Functions of DataNodes • DataNodes reside in each node of the Big Data Appliance and manage the storage attached to the node. DataNodes perform the following functions: • Serve read and write requests from the file system clients • Perform block creation, deletion, and replication based on instructions from the NameNode • Provide simultaneous send/receive operations to DataNodes during replication (replication pipelining)

A Typical HDFS Cluster CheckpointNode 3 Download periodic checkpoints Download periodic checkpoints Passive secondaryNameNode Passive secondaryNameNode 1 1 NameNode NameNode Query Coordinator Query Coordinator DataNode1 DataNode1 2 2 Client Client DataNode2 DataNode2 Job Tracker Job Tracker DataNodeN DataNodeN DataNodes /Task Trackers DataNodes /Task Trackers

What Is MapReduce? • MapReduce is the Framework for Distributed Computation on large clusters. It is a set of code and infrastructure for parsing and building large data sets. • MapReduce operates exclusively on key-value pairs. • Reducers use the data provided by the Mappers. • Master schedules the tasks, monitors them, and re-executes failed tasks. • Slaves execute the tasks directed by the master.

Oracle NoSQL Database • Oracle NoSQL Database is a distributed key-value database, built on the proven storage technology of Berkeley DB Java Edition. • It is designed to provide highly reliable, scalable, predictable, and available data storage. • The key-value pairs are stored in partitions, replicated across multiple storage nodes to ensure high availability. • Oracle NoSQL Database supports fast querying of the data, typically by key lookup and indexing. NoSQL

About Oracle NoSQL Database • Oracle NoSQL Database is: • A distributed, highly scalable, key-value database • Built on the Oracle Berkeley DB Java Edition • Accessible using Java APIs • One of the Oracle solutions for acquiring Big Data • Available in Community and Enterprise editions NoSQL

Supported Data Types Text Numeric Video Image Voice Spatial XML Document

Storage Nodes Data Center A Storage Nodes Data Center B Oracle NoSQL Database: Architecture Web Server Application Server NoSQL

What Is a Key-Value Store? • A KV Store is essentially a two-column table consisting of a key and a value associated with the key. • The key acts as the index, and the value can be referenced as a look up. Records Opaque Data Structures

Accessing the KVStore • You access the KVStore for two different needs. • For access to key-value data: • Use Java APIs. • For administrative actions: • Use the command-line interface. • Use the graphical web console. NoSQL

Road Map: Phases of Big Data

Mapping the Phases with Software • Acquire Phase • Hadoop Distributed File System • Oracle NoSQL Database • Organize Phase • Hadoop Software Framework • Oracle Data Integrator • Analyze Phase • R Statistical Programming Environment • Oracle Data Warehouse

Oracle Big Data Connectors Connectors are used to provide data communication between big data and Oracle Database. • Oracle Direct Connector for HDFS - Oracle Database to access big data on a Hadoop cluster without loading the data • Oracle Loader for Hadoop - Loads big data into the tables in Oracle Database • Oracle Data Integrator Application Adapter for Hadoop - Extracts, transforms, and loads big data into partitioned tables in Oracle Database, as simple as using a graphical user interface • Oracle R Connector for Hadoop - Interface between a local R environment, Oracle Database, and Hadoop

What Is Oracle Direct Connector for HDFS? • Oracle Direct Connector for HDFS (ODCH) is a connector which facilitates read access from HDFS to Oracle Database using external tables. • It uses the ORACLE_LOADER access driver. • It enables you to: • Access big data without loading the data • Access the data stored in HDFS files • Access CSV (comma-separated values) files and Data Pump files generated by Oracle Loader for Hadoop • Load data extracted and transformed by Oracle Data Integrator

Using External Tables in ODCH • An external table is an Oracle Database Object that identifies the location of data residing outside the database. • While creating an external table for HDFS, you should specify the PREPROCESSOR clause as mentioned below: • PREPROCESSOR “HDFS_BIN_PATH:hdfs_stream” • You can confirm the table creation by executing the following SQL statement: SELECT count(*) FROM external_table; The above query should not return any rows.

Oracle Direct Connector for HDFS: Benefits • It creates external tables for retrieving the content present in HDFS. • You can also query the data in HDFS files from the Oracle Database. • You can import the data into database whenever you want to.

Oracle Loader for Hadoop • Oracle Loader for Hadoop is an efficient and high-performance loader for fast movement of data from Hadoop into a table in Oracle Database. • The loader partitions the data and transforms it into an Oracle-ready format. • It optionally sorts records by primary key before loading the data into the table. • The loader can perform online loads using JDBC or OCI Direct Path, and offline loads by writing Oracle Data Pump format files that can be loaded using External Tables.

OLH: Online Database Mode Read target table metadata from the database. Connect to the database from reducer nodes, load into database partitions in parallel. Perform partitioning, sorting, and data conversion. Oracle Loader for Hadoop Shuffle/Sort MAP Reduce MAP Reduce MAP Shuffle/Sort MAP Reduce MAP Reduce MAP Reduce

OLH: Offline Database Mode Read target table metadata from the database. Write from reducer nodes to Oracle Data Pump files. Copy files from HDFS to where database can access them. Perform partitioning, sorting, and data conversion. Import into the database in parallel using external table mechanism. Oracle Loader for Hadoop Shuffle/Sort MAP Reduce MAP Reduce MAP Shuffle/Sort MAP Reduce MAP Reduce MAP Reduce

ODI and Hadoop Hadoop ODI Application Adapter for Hadoop Oracle Data Integrator

Being a Data Scientist with Oracle Big Data