1 / 26

Unsupervised Inference of Data Formats in Human-Readable Notation

Unsupervised Inference of Data Formats in Human-Readable Notation. Christopher Scaffidi Carnegie Mellon University. Target audience. In 2012, we project that there will be 90 million computer end users (“EUs”) in American workplaces.

Télécharger la présentation

Unsupervised Inference of Data Formats in Human-Readable Notation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Unsupervised Inference of Data Formats inHuman-Readable Notation Christopher Scaffidi Carnegie Mellon University

  2. Target audience • In 2012, we project that there will be 90 millioncomputer end users (“EUs”) in American workplaces. • Of these, at least half will create spreadsheets, databases, and/or web applications. These are called end-user programmers (“EUPs”). • For professional programmers, programs are a deliverable. • For EUPs, programs are a means to an end. motivation ●overview ● algorithm ●evaluation ● conclusion

  3. Current practice:Storing data as strings • Typical tools of EUPs: • Excel “text” cells • Access/MSSQL “varchar” fields • FrontPage/Dreamweaver “textfield” inputs • Validation involves… • Learning an exotic new notation (VBScript, regexps, etc) • Writing cumbersome expressions in that notation • Most EUPs do not know these notations and have no time, interest, or incentive to learn the notations. motivation ●overview ● algorithm ●evaluation ● conclusion

  4. Our Topes system to date… • Formats are presented in human-readable notationin our format editor • Format = sequence of parts with constraints on parts • Constraints can be “often” true (rather than “always”) • The format is automatically converted to a context-free grammar, with constraints attached to productions. • At runtime, our parser checks values against formats, returning a confidence in the range [0,1] for each value. motivation ●overview ● algorithm ●evaluation ● conclusion

  5. Needed: Format inference • Problem: To date, the system offers limited support for helping users to get started. • Although users do not need to learn specialized notation, there is still the cognitive work of… • examining data • breaking it into parts • representing parts in the format • Solution: Infer a boilerplate format from examples motivation ●overview ● algorithm ●evaluation ● conclusion

  6. Talk Outline • Motivation / Problem • Solution • Overview of Topei • Inference algorithm • Evaluation • Conclusion motivation ●overview ● algorithm ●evaluation ● conclusion

  7. PrototypeTask flow diagram User creates a format from scratch or User highlights spreadsheet cells Plug-in flags cells that don’t match format User loads an existing format from a file or Algorithm infers a format from cell values User reviews and customizes format [1][6] motivation ●overview ● algorithm ●evaluation ● conclusion

  8. Sample task: validating a spreadsheetwith the prototype we have built • The second column is “supposed” to contain first names, but some outlier values containing initials have snuck in. motivation ●overview ● algorithm ●evaluation ● conclusion

  9. Sample task: validating a spreadsheetCustomizing an inferred format • Inferred format is presented in editor with sentence-like prompts to improve human-readability • User can specify meaningful names for parts motivation ●overview ● algorithm ●evaluation ● conclusion

  10. Sample task: validating a spreadsheetCustomizing constraints in our prototype • User can add/edit constraints motivation ●overview ● algorithm ●evaluation ● conclusion

  11. Sample task: validating a spreadsheetFlagging potential errors • A red flag (reviewer comment, actually) appears on cells that do not match the format; mouse over for message motivation ●overview ● algorithm ●evaluation ● conclusion

  12. Our algorithm has 2 phases Input: An array of strings Phase 1: Identify format parts Phase 2: Identify constraints on each part of each format Output: An array of formats • Sorted according to how many examples they match motivation ●overview ● algorithm●evaluation ● conclusion

  13. Phase 1: Identify format parts • For each string, replace each character with its class, then collapse runs, generating a string “signature” • Supported character classes: A uppercase letter a lowercase letter 0 digit • Example: apple@gmail.com a@a.a banana1@hotmail.com a0@a.a carrot@company.com a@a.a DATE@UNIVERSITY.edu A@A.a eggplant@firm-name.com a@a-a.a fig.plant@mail.univ.edu a.a@a.a.a motivation ●overview ● algorithm●evaluation ● conclusion

  14. Phase 1: Identify format parts • Pack strings with identical signatures (often leads to significant performance improvement) • Example: apple@gmail.com a@a.a banana1@hotmail.com a0@a.a carrot@company.com <<packed with 1st, above>> DATE@UNIVERSITY.edu A@A.a eggplant@firm-name.com a@a-a.a fig.plant@mail.univ.edu a.a@a.a.a motivation ●overview ● algorithm●evaluation ● conclusion

  15. Phase 1: Identify format parts • Align signatures based on separators • Example: apple@gmail.com a @ a . a banana1@hotmail.com a0 @ a . a DATE@UNIVERSITY.edu A @ A . a eggplant@firm-name.com a@a-a.a fig.plant@mail.univ.edu a.a@a.a.a motivation ●overview ● algorithm●evaluation ● conclusion

  16. Phase 1: Identify format parts • Abstract to least general composite character class, yielding the parts of each format. • Example (3 formats below): apple@gmail.com a @ a . a banana1@hotmail.com a0 @ a . a DATE@UNIVERSITY.edu A @ A . a a0A aA a eggplant@firm-name.com a @ a - a . a fig.plant@mail.univ.edu a . a @ a . a . a motivation ●overview ● algorithm●evaluation ● conclusion

  17. Phase 2: Identify constraints on each part • Constrain each part’s contents to the character classes • Require indicated separators before/after parts • Infer an additional content constraint that is “often” true: • Must be in a set of 3 or fewer literals? • Must be in a numeric range? • Must start with or end with certain characters? • A content constraint is inferred if it covers at least 95% of the examples supporting that format’s signature. • Afterward, the user can review/customize format. motivation ●overview ● algorithm●evaluation ● conclusion

  18. Evaluation as an outlier finder • Outlier finding: • Infer a format from example values • Use the inferred format to check the examples  Reveals “outliers” that might contain typos or other errors • Comparison algorithm: Lapis Lapis example @DayOfMonth is Number equal to /[12][0-9]|3[01]|0?[1-9]/ ignoring nothing @ShortMonth is Number equal to /1[012]|0?[1-9]/ ignoring nothing @ShortYear is Number equal to /\d\d/ ignoring nothing Date is flatten @ShortMonth then @DayOfMonth then @ShortYear ignoring either Spaces or Punctuation motivation ●overview ● algorithm ●evaluation ● conclusion

  19. Evaluation dataDrawn from EUSES spreadsheet corpus • 6288 US phone numbers in 37 columns • First cell in column contains “phone” • And at least 20 cells have exactly 10 digits • And at least 2/3 of cells have exactly 10 digits • 1124 country names in 7 columns • First cell in column contains “country” • And there are at least 20 cells • And at least one cell contains “Portugal” motivation ●overview ● algorithm ●evaluation ● conclusion

  20. Run each algorithm (Topei & Lapis)and compare their output to hand-labeling • For determining “true outliers” in calculating accuracy: • Outlier phone numbers have an area code that is not in service, or if they contain errant separators such as spaces not shared by most cells in the column. • Outlier country names contain abbreviations, misspellings or a different name than the one usually used by English-speakers, except for a specific list of allowed exceptions that are very commonly used (e.g.: Brasil, US, UK) [note: allowing these exceptions hurts Topei’s accuracy] motivation ●overview ● algorithm ●evaluation ● conclusion

  21. Results: Topei’s precision/recall exceed Lapis’s • Standard machine learning measures for outlier finding • Precision = # outliers found / # outliers claimed • Recall = # outliers found / # true outliers motivation ●overview ● algorithm ●evaluation ● conclusion

  22. Limitations & future work • Topei still makes mistakes: • Doesn’t infer constraints aggressively enough • Doesn’t recognize non-ASCII chars in character classes • Doesn’t handle formats with repeating parts • Need deeper integration with EUPs’ tools • Computational complexity: • Is intended to be O(# examples), seems to be true • More careful verification needed • Usability has not yet been evaluated in user study motivation ●overview ● algorithm ●evaluation ● conclusion

  23. Thank You… • …to you for your interest and attention • …to INSTICC for the opportunity to present • …to NSF and EUSES for funding (ITR-0325273 and CCF-0438929) motivation ●overview ● algorithm ●evaluation ● conclusion

  24. Another example:Carnegie Mellon University phone #: 8-1234 motivation ●overview ● algorithm ●evaluation ● conclusion

  25. Integration • Formats can be inferred from… • Spreadsheet cells • Database queries (e.g.: Access/MSSQL) • Arbitrary collection of text strings (via C#) • Formats can then be used without modification in other venues, as well. • E.g.: infer a format from spreadsheet cells, then use it to create a trigger for a database table motivation ●overview ● algorithm ●evaluation ● conclusion

  26. Related Work • Many algorithms train a recognizer to notice features • See (Mitchell, 1997) for a summary • Such algorithms do not infer a human-editable format. • Others generate formats in specialized notation. • (Miller, 2001) (Blackwell, 2001) (Lerman, 2000), (Lieberman, 2001) (Nardi, 1998) • Regular expressions and CFGs have limited readability. • Several tools recognize or manipulate some of the same kinds of data as Topei. • (Hong, 2006) (Pandit, 1997) (Stylos 2004) • Custom formats are unsupported (only hardcoded formats) motivation ●overview ● algorithm ●evaluation ● conclusion

More Related