1 / 48

Lifecycle Seminar Series

Lifecycle Seminar Series. Welcome to the Community!. Live Tweet to #DSSS2. The Lifecycle Series. #1: July 10 The Scientist, The Team and The Purpose #2: July 31 Organizing and Feeling Out Your Data Dates and Topics not Finalized, but roughly: #3: Data / Analytics Preparation

justus
Télécharger la présentation

Lifecycle Seminar Series

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lifecycle Seminar Series Welcome to the Community! Live Tweet to #DSSS2

  2. The Lifecycle Series • #1: July 10 The Scientist, The Team and The Purpose • #2: July 31 Organizing and Feeling Out Your Data Dates and Topics not Finalized, but roughly: • #3: Data / Analytics Preparation • #4: Modeling, Classification, and Decision-Making • #5: The Data Science Team • #6: Telling The Story: Visualizing Results

  3. We Want Contributors! • Looking for people willing to lead one of the Topics in given seminars • Looking for people who have an interesting anecdote or challenge to offer • Want to try integrating with main speaker or kick off networking session • Particularly interested in experiences/anecdotes for Session II (July 31) : Organizing and Feeling Out Your Data

  4. Data Lifecycle = Where we are But!...

  5. Data Science Lifecycle • Tonight, Focus is on Feeling Out Data • Primarily early-stage skill, but a part of all stages • Something everyone can do, increasingly so with modern tools Organizing and Feeling Out your Data

  6. Tonight’s Agenda • The Data Scientist Seminar Series • Followup from Seminar 1 • Participation opportunities • Jason Sroka: “Organizing and Feeling Out your Data” • Wrap-up & Announcements • Networking Session – Buy Jason Tequila!

  7. MarketMeSuite – Our Venue Sponsor MarketMeSuite’s Inbox For Social is how small businesses convert leads and market on social media

  8. Approach & Goals • Walk through steps of organizing and feeling out data • Focus on Data Scientist Survey • Use Survey data and anecdotes to touch on Data Science topics • Not going deep, but trying to give a real feel • Tool Discussion • Tableau and Google Refine

  9. Data Setup • We are all getting our data from somewhere • Personal data • Private data • Public data • Need tool(s) to look at it with • Will see Tableau here, many others available • Focus is on feeling out the data, not managing it • Will only mention some data management challenges • Not dealing with Big Data tonight (when we go international…) • These are topics that will be more central to future Meetup Seminars

  10. What I did • Quick scan of source • Excel File • Nulls in Beige • True flags in Green • 84 Data Rows • Import the data • Tableau reads straight from Excel Source(s) Import Analytics Tool

  11. What a Quick Scan Shows • Organization of Raw Data • Nulls in Beige • True flags in Green

  12. Start with the Basics • The first question • How many data? • 85 records imported • Move to things you know/understand • Simple categories (gender, age, ..) • Check assumptions (e.g. more males than females)

  13. Gender • Simple category • Binary • Meaningful to everyone • Data not quite so simple • What is a Null, compared to a Blank

  14. Message #1: Data is Messy! • Data Scientists have gender issues! • We have a Null and 3 blanks • Back to the source… • Null is a bad record (header?) • Blanks were user option • Clean it up • Don’t re-discover and re-implement • Someone needs to track these! • Null filtered in Tableau • Count now at 84 • Blank relabeled to “N/A” in Excel • Tools Discussion and Seminar 3 will go into Data Cleansing in more detail Before Cleaning After Cleaning

  15. Handedness • Didn’t we just fix the NULL thing? • Yes – this is a new Null • Excel had a cut-and-paste error! • Formula wasn’t used in column – values were hard-coded • Fixed formula, copied throughout Before Cleaning After Cleaning

  16. Data Scientist Ethic • Don’t ignore the warts! • Most warts are meaningless • Of those that aren’t, most are easy to figure out • Of those that aren’t, most are at least easy to fix once you figure it out • Of those that aren’t, most times you can get someone else to help you fix it • Of those that aren’t, you’ll usually improve your implementation skills when you resolve it • Sometimes this line of work sucks • The ones that aren’t help you understand the data • In this case, a problem with the data process • In other cases, interesting quirks and potential insights!

  17. Age • Survey question: Birth Year • Seeing old and new issues • Blanks • Number ranges • Survey did not constrain to YYYY

  18. Age • Survey question: Birth Year • Seeing old and new issues • Blanks • Number ranges • Survey did not constrain to YYYY • Fixed these three entries

  19. Age • Survey question: Birth Year • Seeing old and new issues • Nulls • Turn out to be blanks – valid option in Survey • Number ranges • Survey did not constrain to YYYY • Fixed these three entries

  20. Age, as Age • Birth Year isn’t our interest, Age is • Transform your data to suit your needs • Be as direct between the data and the context as you can Age Birth Year Decade

  21. The Art of Data Science • Message #2: Connect the Data to the Context • Transform the data to suit your needs • Easy investigation/understanding • Analytics goals • Operational goals • This is where Telling the Story feeds back • Effective plots help the data tell their story to you • Try things out!

  22. Favorite Color • Here, I’ve assigned colors near the named color • Sorting by most prevalent to least • Blank isn’t adding anything • Removing

  23. Favorite Color • Now, let’s add Gender • Okay – I see differences! • Something to form an impression from • Something to come back to • Blue is now the Official Data Scientist color!

  24. Check Assumptions • Assumption 1: More Males than Females • Assumption 2: 10-15% Lefties • Underestimate! • Assumption 3: Different color preferences by Gender

  25. Checking Assumptions… • Familiarizes You with the Data • Identifies data issues • Tests your assumptions • Gives you Confidence in the Data… • Confidence in the initial source • Confidence in Extraction, Transformation, Load • …and Your Assumptions • Confidence in your Intuition where it was right • Updates to your Intuition where it was off

  26. Building a Data Model • Data comes in different types • Categorical • Gender, Handedness, Favorite Color, any true/false • Scalar • Age, height, weight • Label/identifier • … • These data types often associate with the purpose to which it will be applied • Categories are dimensions along which we might divide the records • Measurements (Scalars) are facts about specific instances of what we’re modeling • A good data model allows for rapid analytics • Modular construction of sets of dimensions and measurements • Automated investigation of cross-relationships

  27. Survey Duration • Another processed ‘field’ • End Time – Start Time • Plotting it all: sparse info • A lot of short times • A few long times • Outliers are hiding the data! • After filtering out extremely high values, a different picture emerges… Same Data, Different Lenses

  28. Playing with Plots 1:Beware Bad Binners! • How you choose bins and plot a histogram can impact your interpretation Same Data, Different Axes Very flat; One entry per bin Still flat, but the voids in X-axis have meaning

  29. Survey Duration: 1 Second Bins

  30. Survey Duration: 3 Second Bins

  31. Survey Duration: 5 Second Bins

  32. Survey Duration: 10 Second Bins

  33. Survey Duration: 15 Second Bins

  34. Survey Duration: 20 Second Bins

  35. Survey Duration: 30 Second Bins

  36. Survey Duration: 45 Second Bins

  37. Survey Duration: 60 Second Bins

  38. Survey Duration: 1,000 Second Bins

  39. The Practice of Data Science Bin Size (seconds) 1,000 • I just tricked you into looking at a bunch of data! • That is Data Science in action • It is a skill like many others • We all have some ability • We get better with practice • It’s pattern recognition 1 60 3 45 5 10 15 20 30

  40. The Science of Data • Distributions have meaning • Flat: random, fixed • Normal distributions: repeated processes • Exponential: cumulative processes • Over time, we interpret data in terms of known distributions • Survey Duration: Gaussian + Exponential Wikipedia.org Wikipedia.org

  41. Survey Duration • Another processed ‘field’ • End Time – Start Time • Plotting it all: sparse info • A lot of short times • A few long times • Outliers are hiding the data! • After filtering out extremely high values, a different picture emerges • Normal Distribution plus sparse tail • People who start, complete, end • People who start, stop, return, <repeat>, end Same Data, Different Lenses

  42. Tools • I used Tableau here • A lot can be done directly in Excel • Google Refine looks impressive http://www.youtube.com/watch?v=B70J_H_zAWM&feature=player_embedded • Highlights cleansing issues, supports resolution Source(s) Import Analytics Tool

  43. Data Science Lifecycle • Tonight, Focus is on Feeling Out Data • Primarily early-stage skill, but a part of all stages • Something everyone can do, increasingly so with modern tools Organizing and Feeling Out your Data

  44. Closing Thoughts • Message #1: Data is Messy • Don’t ignore the warts • Message #2: Connect the Data to the Context • Translate data so it is expressed in your terms • Message #3: Check Your Assumptions • Explore the data for insights • Message #4: Develop Your Intuition • Look at a lot of data in a lot of ways

  45. Who Rocks? • A HUGE thanks to Peggy Sue for executing the survey and organizing the results! • Super thanks to Tammy for live tweeting and sponsoring us at CIC!

  46. The Lifecycle Series Quick Note: • #6: Telling The Story: Visualizing Results • Speaker: • Hjalmar Gislason • CEO of DataMarket.com • Conference Speaker • Currently writing a book for O’Reilly called Effective Data Visualization

  47. Connect with us!

More Related