Big Data Distilled Separating the hype from reality
220 likes | 417 Vues
Big Data Distilled Separating the hype from reality. Mike King Technical Fellow Fedex Services November 8, 2012 Midsouth DAMA. What is Big Data?.
Big Data Distilled Separating the hype from reality
E N D
Presentation Transcript
Big Data DistilledSeparating the hype from reality Mike King Technical Fellow Fedex Services November 8, 2012 Midsouth DAMA
What is Big Data? • Applying analytics to construct a model to predict an outcome where two or more dimension([VC])s exist AND your existing solutions can’t solve it. • The dimensions - 4 V’s, 1 C • Volume • Velocity • Variety • Variability • Complexity
The Market • Growing fast • Lots of players • Small and nimble • Large • Changing fast • Hype • Contenders and pretenders • Commercials are deceiving
Why do we need it? • Competitive Intelligence • Joining dissimilar data • Linking data • Adding context to data • Discovery • Diapers • Pregnancy • To supplement our BI/DW • Table stakes
Use Cases? • Customer analysis • Sentiment • Defection • Cannibilization • Cross selling • Network analysis • M2M • Fraud detection • Risk management • Text analytics • Social media analytics • Log analysis
Apache Hadoop • Batch • Open Source • Components • HDFS • DB • Hbase • Cassandra • Map/Reduce • Hive • Pig • Mahout • Chuckwa • Avro • Zookeeper
Solutions • Which stack/distribution? • Varying components • Apples & oranges • Types • Partial • Overlapping • Complementary • Substitute • Fast pace of change • Flux of partnerships
Dealing with vendors, choices • Decide what your requirements are • Don’t let them tell you what you need • Beware bait and switch • Extras • Some are looking to sell • Professional Services • Other Software • All solutions are incomplete • Many solutions are lacking • Multiple…is one enough? • Switching is possible • Low cost? • Beware • Proprietary components • Solutions that have already been fixed….Apache nn • Hammer and nail
My Big Data Vendors • MapR • Kaggle • Karmasphere • Hadapt • Datameer • Lucid Works • 1010data? • Splunk • SAS • IBM • Oracle • Hortonworks • Cloudera • EMC • Teradata • Amazon • Microsoft • HP
Not My Big Data Vendors • Pentaho • Palantir • Kalido • Composite • Couchbase • Marklogic • StoredIQ • Syncsort • Datastax? • IBI • Informatica • SAP • 10Gen • Talend • Denodo • Tableau • Tibco • ParAccel
What’s missing? • Collaboration • Directory, dictionary, metadata • Context • Relevance, value • DQ • Search • Security • Performance • Monitoring • Management tools • Governance • Backup
Counterintuitive & Anti-dogma Notions • Size matters • But not unitarily • Smaller is better • Sampling • Quality matters • GIGO • All data must have structure to be consumed • There is no unstructured data!
Myths • You don’t need a DBA • Schemaless • B.D. is just for unstructured data • Your unstructured data has lots of value • It’s separate from your other BI stuff incl.. • OLAP • DW • Datamarts • Analytics • Nosql
Prerequisites • Many varied skill sets are needed • DBA • Sysadmin • BI analytics • Math (statistics) • Programming • Reading • Training • Scope
Training options • Read books • Add some blogs to your feeds • Follow some of the right people on twitter • Search #bigdata #nosql #datascience …. • Online training • Big Data University (free) • EMC , Hortonworks, Cloudera, Karmashpere • Tutorials • Conferences • Get a degree • NC State, Stanford, Northwestern, Syracuse, UCSD
Suggestions • Start small • Conduct triage on your possible sources • It should be integrated w/ the DW • Silos are bad….think spread marts • Grow your own Data Scientists • Move disparate LOB analysts in a single org • Train and cross train • Limit the BD user population • Design is still required • Mind and mine your structured data first • Get more training
Don’t • Make your nosqldb the system of record • Put all your data in hadoop…to start • Ignore open source • Connect your garden variety query tools to hadoop • Open it up to everyone • Keep data indefinitely • Get heavy handed on security
Other items • Cloud • SIEM • Tools to complement your solution(s) • Which db(s) to use? • For what? • External tables • Nosqldbs • Persist map reduce results in your db • Storage • Servers • X86 linux • External data sources
Trends • March 2012 article by Munish Gupta • SaaS for analytics • Crowdsourcing • Data analysis libraries • Nosql market shakeup • Additionally from the article • RDBMS’s will not make a comeback • Other • More diverse sources • More data • More jobs • More choices, solutions, products, services, etc… • Query tools - yek
Links of interest • http://wikibon.org/wiki/v/Enterprise_Big-data • My diigo bookmarks on Big Data • http://www.diigo.com/user/morpheus/bigdata 266 • Curt Monash’s Blog … http://www.dbms2.com • http://www.keithrozario.com/2012/07/opensource-gold-the-greatest-crowdsourcing-story-ever-told.html • http://www.analyticbridge.com/ • http://gigaom.com/data/ • This deck • http://92lobos.wikispaces.com/file/detail/Big+Data+Distilled.pptx • Future B.D. items • http://92lobos.wikispaces.com/bigdata
Mike.King@fedex.com • mikeking60@gmail.com • @redleg60 Contact Feel free to drop me a note with any questions