1 / 21

Dirty Data - Can You Afford It?

Dirty Data - Can You Afford It?. Faron Kincheloe, Baylor University. Types of Dirty Data. “Too many wrong mistakes” – Yogi Berra Nonidentical duplicates (names & addresses) Missing data (gender) Non-standard entities (church names). Identity Crisis. Ever Get 2 of the Same Piece of Mail?.

oshin
Télécharger la présentation

Dirty Data - Can You Afford It?

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Dirty Data - Can You Afford It? Faron Kincheloe, Baylor University

  2. Types of Dirty Data • “Too many wrong mistakes” – Yogi Berra • Nonidentical duplicates (names & addresses) • Missing data (gender) • Non-standard entities (church names)

  3. Identity Crisis Ever Get 2 of the Same Piece of Mail? Laura Lauren • 2 Girls • 5 Viewbooks • Daughters of Baylor administrators

  4. Lauren & Laura

  5. Duplication Dirt Devils • Character recognition • Misspelling • Marriage/Divorce • Middle name preference • Nicknames • First & last name reversal • Electronic downloads • Delimiters in data • Variable field lengths

  6. How Bad Was the Problem? • 1.15% – 725 out of 62,000 (1450 pairs) • $2175 at $3 per viewbook • 3% yield on 725 prospects = 21 students • $15,000 per year per student not enrolled • $315,000 upper limit • Misapplied data • Lost credibility

  7. The Little Brwon Chruch

  8. Top Ten List

  9. Where’s My Church? How many ways can you say, “First Baptist?”

  10. …Let Me Count the Ways

  11. Faron’s New Top Ten List

  12. It’s Not My Mess!!! • Why should I clean it up? • Overall financial impact • Data expertise • “Data Mine” instead of “Not Mine” • Improved accuracy • Partner with data owners

  13. Cleaning Tools • DataFlux dfPower Suite (GUI) • SAS Data Quality Server (Code module)

  14. ACCOUNT NUMBER ADDRESS CITY DATE E-MAIL NAME ORGANIZATION PHONE STATE TEXT ZIP Data Knowledge Definitions

  15. Cleaning Functions • PARSE – Creates delimited text string • GENDER – Estimates gender based on name • MATCH – Creates index for matching • SCHEME – Standardizes data

  16. Parsing

  17. Gender

  18. A Sneak Peek at Match Codes

  19. Using Match Codes • Prepare the data • Create multiple match codes • Create groups to target specific matches • Merge groups together • Remove repeated rows and sort by clusters • Print list for cleanup • Mark records for future match tests

  20. Scheming Against the Data • Create scheme • Church Names • Cities • Customize & finalize schemes • Apply schemes to data • Use SAS code to override exceptions • Compare with original entries • Update student information system • Automate the process

  21. Questions? Faron_Kincheloe@baylor.edu

More Related