1 / 47

Wrangling Data

Wrangling Data. W ith Oracle Connectors for Hadoop Gwen Shapira, Solutions Architect gshapira@cloudera.com @ gwenshap. Data Has Changed in the Last 30 Years. END-USER APPLICATIONS THE INTERNET MOBILE DEVICES SOPHISTICATED MACHINES. DATA GROWTH. UNSTRUCTURED DATA – 90%.

zarek
Télécharger la présentation

Wrangling Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Wrangling Data With Oracle Connectors for Hadoop Gwen Shapira, Solutions Architect gshapira@cloudera.com @gwenshap

  2. Data Has Changed in the Last 30 Years END-USER APPLICATIONS THE INTERNET MOBILE DEVICES SOPHISTICATED MACHINES DATA GROWTH UNSTRUCTURED DATA – 90% STRUCTURED DATA – 10% 1980 2013

  3. Data is Messy

  4. Data Science is OSEMN (pronounced AWESOME) Obtaining, Scrubbing, Exploring, Modeling, and iNterpretingdata Mostly Scrubbing. Actually.

  5. Data Wrangling (n): The process of converting “raw” data into a format that allows convenient consumption

  6. Hadoop rocks Data Wrangling • Cheap storage for messy data • Tools to play with data: • Acquire • Clean • Transform • Especially unstructured data • Flexibility where you need it most

  7. Got unstructured data? • Data Warehouse: • Text • CSV • XLS • XML • Hadoop: • HTML • XML, RSS • JSON • Apache Logs • Avro, ProtoBuffs, ORC, Parquet • Compression • Office, OpenDocument, iWorks • PDF, Epup, RTF • Midi, MP3 • JPEG, Tiff • Java Classes • Mbox, RFC822 • Autocad • TrueType Parser • HFD / NetCDF

  8. But Eventually, you need Your data in your DWH Oracle Connectors for HadoopRock Data Loading

  9. What Data Wrangling Looks Like?

  10. Data Sources • Internal • OLTP • Log files • Documents • Sensors / network events • External: • Geo-location • Demographics • Public data sets • Websites

  11. Free External Data

  12. Data for Sell

  13. Getting Data into Hadopp • Sqoop • Flume • Copy • Write • Scraping • Data APIs

  14. Sqoop Import Examples • Sqoop import --connect jdbc:oracle:thin:@//dbserver:1521/masterdb--username hr --table emp--where “start_date > ’01-01-2012’” • Sqoop import jdbc:oracle:thin:@//dbserver:1521/masterdb--username myuser--table shops --split-by shop_id--num-mappers 16 Must be indexed or partitioned to avoid 16 full table scans

  15. Or… • hadoopfs -put myfile.txt /bigdata/project/myfile.txt • curl –ilist_of_urls.txt curl https://api.twitter.com/1/users/show.json?screen_name=cloudera { "id":16134540, "name":"Cloudera", "screen_name":"cloudera", "location":"Palo Alto, CA", "url":"http://www.cloudera.com” "followers_count":11359 }

  16. And even… $cat scraper.py import urllib from BeautifulSoup import BeautifulSoup txt = urllib.urlopen("http://www.example.com/") soup = BeautifulSoup(txt) headings = soup.findAll("h2") for heading in headings: print heading.string

  17. Data Quality Issues • Given enough data – quality issues are inevitable • Main issues: • Inconsistent – “99” instead of “1999” • Invalid – last_update: 2036 • Corrupt - #$%&@*%@

  18. Inconsistencies are endless • Upper vs. lower case • Date formats • Times, timezones, 24h… • Missing values • NULL vs. empty string vs. NA… • Variation in free format input • 1 PATCH EVERY 24 HOURS • Replace patches on skin daily

  19. Hadoop Strategies • Hive does not validate data on LOAD • Validation script is ALWAYS first step • But not always enough • We have known unknowns • And unknowns unknowns

  20. Known Unknowns • Map-only script to: • Check number of columns per row • Validate not-null • Validate data type (“is number”) • Date constraints • Other business logic

  21. Unknown Unknowns • Bad records will happen • Your job should move on • Use counters in Hadoop job to count bad records • Log errors • Write bad records to re-loadable file

  22. Solving Bad Data • Can be done at many levels: • Fix at source • Improve acquisition process • Pre-process before analysis • Fix during analysis • How many times will you analyze this data? • 0,1, many, lots

  23. Anonymization • Remove PII data • Names, addresses, possibly more • Remove columns • Remove IDs *after* joins • Hash • Use partial data • Create statistically similar fake data

  24. 87% of US populationcan be identified from gender, zip code and date of birth

  25. Joins • Do at source if possible • Can be done with MapReduce • Or with Hive (Hadoop SQL ) • Joins are expensive: • Do once and store results • De-aggregate aggressively • Everything a hospital knows about a patient

  26. DataWrangler

  27. Process Tips • Keep track of data lineage • Keep track of all changes to data • Use source control for code

  28. Sqoop sqoop export --connect jdbc:mysql://db.example.com/foo --table bar --export-dir /results/bar_data

  29. FUSE-DFS • Mount HDFS on Oracle server: • sudo yum install hadoop-0.20-fuse • hadoop-fuse-dfsdfs://<name_node_hostname>:<namenode_port> <mount_point> • Use external tables to load data into Oracle

  30. That’s nice. • But can you load data FAST?

  31. Oracle Connectors • SQL Connector for Hadoop • Oracle Loader for Hadoop • ODI with Hadoop • OBIEE with Hadoop • R connector for Hadoop No, you don’t need BDA

  32. Oracle Loader for Hadoop • Kinda like SQL Loader • Data is on HDFS • Runs as Map-Reduce job • Partitions, sorts, converts format to Oracle Blocks • Appended to database tables • Or written to Data Pump files for later load

  33. Oracle SQL Connector for HDFS • Data is in HDFS • Connector creates external table • That automatically matches Hadoop data • Control degree of parallelism • You know External Tables, right?

  34. Data Types Supported • Data Pump • Delimited text • Avro • Regular expressions • Custom formats

  35. Main Benefit: • Processing is done in Hadoop

  36. Benefits • High performance • Reduce CPU usage on Database • Automatic optimizations: • Partitions • Sort • Load balance

  37. Measuring Data Load

  38. I Know What This Means:

  39. What does this mean?

  40. Measuring Data Load • Disks: ~300MB /s each • SSD: ~ 1.6 GB/s each • Network: • ~ 100MB/s (1gE) • ~ 1GB/s (10gE) • ~ 4GB/s (IB) • CPU: 1 CPU second per second per core. • Need to know: CPU seconds per GB

  41. Lets walk through this… We have 5TB to load Each core: 3600 seconds per hour 5000GB will take: With Fuse: 5000*150 cpu-sec = 750000/3600 = 208 cpu-hours With SQL Connector: 5000 * 40 = 55 cpu-hours Our X2-3 half rack has 84 cores. So, around 30 minutes to load 5TB at 100% CPU. Assuming you use Exadata (Infiniband + SSD = 8TB/h load rate) And use all CPUs for loading

  42. Given fast enough network and disks, data loading will take all available CPU • This is a good thing

More Related