Wrangling Data

Wrangling Data With Oracle Connectors for Hadoop Gwen Shapira, Solutions Architect gshapira@cloudera.com @gwenshap

Data Has Changed in the Last 30 Years END-USER APPLICATIONS THE INTERNET MOBILE DEVICES SOPHISTICATED MACHINES DATA GROWTH UNSTRUCTURED DATA – 90% STRUCTURED DATA – 10% 1980 2013

Data is Messy

Data Science is OSEMN (pronounced AWESOME) Obtaining, Scrubbing, Exploring, Modeling, and iNterpretingdata Mostly Scrubbing. Actually.

Data Wrangling (n): The process of converting “raw” data into a format that allows convenient consumption

Hadoop rocks Data Wrangling • Cheap storage for messy data • Tools to play with data: • Acquire • Clean • Transform • Especially unstructured data • Flexibility where you need it most

Got unstructured data? • Data Warehouse: • Text • CSV • XLS • XML • Hadoop: • HTML • XML, RSS • JSON • Apache Logs • Avro, ProtoBuffs, ORC, Parquet • Compression • Office, OpenDocument, iWorks • PDF, Epup, RTF • Midi, MP3 • JPEG, Tiff • Java Classes • Mbox, RFC822 • Autocad • TrueType Parser • HFD / NetCDF

But Eventually, you need Your data in your DWH Oracle Connectors for HadoopRock Data Loading

What Data Wrangling Looks Like?

Data Sources • Internal • OLTP • Log files • Documents • Sensors / network events • External: • Geo-location • Demographics • Public data sets • Websites

Free External Data

Data for Sell

Getting Data into Hadopp • Sqoop • Flume • Copy • Write • Scraping • Data APIs

Sqoop Import Examples • Sqoop import --connect jdbc:oracle:thin:@//dbserver:1521/masterdb--username hr --table emp--where “start_date > ’01-01-2012’” • Sqoop import jdbc:oracle:thin:@//dbserver:1521/masterdb--username myuser--table shops --split-by shop_id--num-mappers 16 Must be indexed or partitioned to avoid 16 full table scans

Or… • hadoopfs -put myfile.txt /bigdata/project/myfile.txt • curl –ilist_of_urls.txt curl https://api.twitter.com/1/users/show.json?screen_name=cloudera { "id":16134540, "name":"Cloudera", "screen_name":"cloudera", "location":"Palo Alto, CA", "url":"http://www.cloudera.com” "followers_count":11359 }

And even… $cat scraper.py import urllib from BeautifulSoup import BeautifulSoup txt = urllib.urlopen("http://www.example.com/") soup = BeautifulSoup(txt) headings = soup.findAll("h2") for heading in headings: print heading.string

Data Quality Issues • Given enough data – quality issues are inevitable • Main issues: • Inconsistent – “99” instead of “1999” • Invalid – last_update: 2036 • Corrupt - #$%&@*%@

Inconsistencies are endless • Upper vs. lower case • Date formats • Times, timezones, 24h… • Missing values • NULL vs. empty string vs. NA… • Variation in free format input • 1 PATCH EVERY 24 HOURS • Replace patches on skin daily

Hadoop Strategies • Hive does not validate data on LOAD • Validation script is ALWAYS first step • But not always enough • We have known unknowns • And unknowns unknowns

Known Unknowns • Map-only script to: • Check number of columns per row • Validate not-null • Validate data type (“is number”) • Date constraints • Other business logic

Unknown Unknowns • Bad records will happen • Your job should move on • Use counters in Hadoop job to count bad records • Log errors • Write bad records to re-loadable file

Solving Bad Data • Can be done at many levels: • Fix at source • Improve acquisition process • Pre-process before analysis • Fix during analysis • How many times will you analyze this data? • 0,1, many, lots

Anonymization • Remove PII data • Names, addresses, possibly more • Remove columns • Remove IDs *after* joins • Hash • Use partial data • Create statistically similar fake data

87% of US populationcan be identified from gender, zip code and date of birth

Joins • Do at source if possible • Can be done with MapReduce • Or with Hive (Hadoop SQL ) • Joins are expensive: • Do once and store results • De-aggregate aggressively • Everything a hospital knows about a patient

DataWrangler

Process Tips • Keep track of data lineage • Keep track of all changes to data • Use source control for code

Sqoop sqoop export --connect jdbc:mysql://db.example.com/foo --table bar --export-dir /results/bar_data

FUSE-DFS • Mount HDFS on Oracle server: • sudo yum install hadoop-0.20-fuse • hadoop-fuse-dfsdfs://<name_node_hostname>:<namenode_port> <mount_point> • Use external tables to load data into Oracle

That’s nice. • But can you load data FAST?

Oracle Connectors • SQL Connector for Hadoop • Oracle Loader for Hadoop • ODI with Hadoop • OBIEE with Hadoop • R connector for Hadoop No, you don’t need BDA

Oracle Loader for Hadoop • Kinda like SQL Loader • Data is on HDFS • Runs as Map-Reduce job • Partitions, sorts, converts format to Oracle Blocks • Appended to database tables • Or written to Data Pump files for later load

Oracle SQL Connector for HDFS • Data is in HDFS • Connector creates external table • That automatically matches Hadoop data • Control degree of parallelism • You know External Tables, right?

Data Types Supported • Data Pump • Delimited text • Avro • Regular expressions • Custom formats

Main Benefit: • Processing is done in Hadoop

Benefits • High performance • Reduce CPU usage on Database • Automatic optimizations: • Partitions • Sort • Load balance

Measuring Data Load

I Know What This Means:

What does this mean?

Measuring Data Load • Disks: ~300MB /s each • SSD: ~ 1.6 GB/s each • Network: • ~ 100MB/s (1gE) • ~ 1GB/s (10gE) • ~ 4GB/s (IB) • CPU: 1 CPU second per second per core. • Need to know: CPU seconds per GB

Lets walk through this… We have 5TB to load Each core: 3600 seconds per hour 5000GB will take: With Fuse: 5000*150 cpu-sec = 750000/3600 = 208 cpu-hours With SQL Connector: 5000 * 40 = 55 cpu-hours Our X2-3 half rack has 84 cores. So, around 30 minutes to load 5TB at 100% CPU. Assuming you use Exadata (Infiniband + SSD = 8TB/h load rate) And use all CPUs for loading

Given fast enough network and disks, data loading will take all available CPU • This is a good thing

Wrangling Data

Wrangling Data

Presentation Transcript

Lecture 4- Data Wrangling

Data Mining: Data

Tidy data, wrangling, and pipelines in R

Data, Data, and more Data

Wrangling Customer Usage Data with Hadoop

Data, Data, and more Data

Data Data Data

Wrangling The User Profile Service

Wrangling the [Biblio] Module

Social Media Wrangling: A Comparison of Feed Tools

Data ! Data! Data!

Wrangling Short Read Data with SHRiMP

Field Based Data Validation: a very real experience in wrangling data, taxonomic names, and photos

Data Mining: Data

Data, Data, and More Data

Data Wrangling Market Share, Industry Size, Growth, Opportunities, and Market Forecast to 2018

Data Wrangling Market Size to reach $2.8 billion by 2023

Data Wrangling Market Size, Share, Industry Analysis & Forecast to 2025

Data Mining: Data

Data Wrangling

Data Wrangling Market Forecast Report by 2026