Statistical Magic: Progress in Automatic Tool Generation for Ad Hoc Data

Statistical Magic: Progress in Automatic Tool Generationfor Ad Hoc Data Qian Xi 2008/5/13 Joint work with Professor David Walker, Kathleen Fisher (AT&T) & Kenny Zhu

Ad Hoc Data • Standardized data formats: HTML, XML • Data processing tools: Visualizers (HTML browsers), XQuery • Non-standard, semi-structured • Not many data processing tools • Examples: web server log (CLF), phone call provisioning data, train schedule, stock trading info… Table 1-9: ADA-Accessible Rail Transit Stations by Agency,,,,,,,,,,,,,,, Type of rail transit / agency,Primary city served,Number of stations,,,,,,,Number of ADA-accessible stations,,,,,, ,,1996,1997,1998,1999,2000,2001,2002,1996,1997,1998,1999,2000,2001,2002 Heavy rail,,,,,,,,,,,,,,, Bay Area Rapid Transit,"San Francisco, CA",36,39,39,39,39,39,39,36,39,39,39,39,39,39 Los Angeles County Metropolitan Transportation Authority ,"Los Angeles, CA",5,8,8,13,16,16,16,5,8,8,13,16,16,16 9152272|9152272|1|2813640092|2813640092|2813640092|2813640092||no_ii152272|EDTF_6|0|MARVINS1|UNO|10|1000295291 9152272|9152272|1|2813640092|2813640092|2813640092|2813640092||no_ii15222|EDTF_6|0|MARVINS1|UNO|10|1000295291|20|1000295291|17|1001649600|19|1001 YHOO YAHOO INC 23.93 10:15AM ET 4.74 (16.53%) 103,880,260 MSFT MICROSOFT CP 29.93 10:15AM ET 0.69 (2.36%) 39,165,715 SPY S&P DEP RECEIPTS 141.52 10:10AM ET 0.01 (0.01%) 20,723,717 QQQQ POWERSHARES QQQ TR 148.88 10:15AM ET 0.11 (0.23%) 22,074,278 CSCO CISCO SYS INC 26.68 10:15AM ET 0.07 (0.26%) 13,934,552 CFC COUNTRYWIDE FNL CP 5.36 10:10AM ET 0.62 (10.37%) 13,019,603 207.136.97.49 - - [15/Oct/1997:18:46:51 -0700] "GET /tk/p.txt HTTP/1.0" 200 30 244.133.108.200 - - [16/Oct/1997:14:32:22 -0700] "POST /scpt/ddorg/confirm HTTP/1.0" 200 941 1/33

Analytical Tasks • Format Converters: XML converter • Statistical Analyzers • Which pages on the website are visited most frequently? • Visualizers graph from: http://www.data360.org 2/33

learnPADS Goal • Automatically generates a description of the format • Automatically generates a suite of data processing tools XML converter “0,24” “bar,end” “foo,16” Grapher ... 3/33

learnPADS Architecture XML Analysis Report XML converter Raw Data Profiler Chunking & Tokenization Format Inference Engine Structure Discovery Format Refinement PADS Compiler Data Description 4/33 Slide credit: Kenny Zhu, POPL08 talk

learnPADS Architecture XML Analysis Report XML converter Raw Data Profiler Chunking & Tokenization Chunking & Tokenization Format Inference Engine Quote Int Comma Int Quote Quote String Comma String Quote Quote String Comma Int Quote “0, 24” “bar, end” “foo, 16” Quote * (String | Int) * Comma * (String | Int) * Quote Structure Discovery Structure Discovery Format Refinement PADS Compiler Data Description 5/33

Motivation: Token Ambiguity Problem (TAP) • Given a string, there’re multiple ways to tokenize it. • Example 1: 127.0.0.1 • IP • Float Dot Float • Int Dot Int Dot Int Dot Int Example 2: • Message • Word White Word White Word White... White URL • Word White Quote Filepath Quote White Word White... 6/33

How Does learnPADS deal with TAP ? • Tokenization Phase: • Take the first, longest match. Float • A fixed order is assigned by the end user. • We have no order to pick. Int ID Path As a result, the current learning system: can’t have ambiguous base tokens – Message, Text, ID. sometimes produces descriptions that are too precise. 7/33

A Concrete Example : Tokenization By Lex Sat Jun 24 06:38:46 crashreporterd[120]: mach_msg() reply failed: (ipc/send) invalid destination port date[Sat Jun 24] white[ ] time[06:38:46] white[ ] int[2006] white[ ] string[crashreporterd] char[[] int[120] char[]] char[:] white[ ] string[mach_msg] char[(] char[)] white[ ] string[reply] white[ ] string[failed] char[:] white[ ] char[(] string[ipc] char[/] string[send] char[)] white[ ] string[invalid] white[ ] string[destination] white[ ] string[port] 8/33

Inspiration • Human distinguish tokens by their background knowledge: • Purl: usually starts from “http://” • Pdate: “March” with “12:30:55” nearby • Ptext and Pmessage: long, sometimes go to the end of a line • This knowledge can be encoded in statistical models. • Statistical models are very successful in Natural Language Processing, Speech Recognition. 9/33

learnPADS Architecture Recall XML Analysis Report XML converter Raw Data Profiler Chunking & Tokenization Format Inference Engine Structure Discovery Format Refinement PADS Compiler Data Description 10/33

Tokenization Problem Specification • Inputs: • A set of tokens with regular expression definitions • A collection of strings annotated with token sequences • a tool to label the chunks automatically given a description • A test string • Output: • a valid, best token sequence for test string • Supervised learning problem 11/33

Token sequence representation: Seqset • A valid token sequence for a given string: • can parse the string • takes the longest match rule. • An example: 0.1 • valid: Int Dot Int • invalid: Float Dot Int • Seqset: the directed acyclic graph (DAG) to represent all possible token sequences of a chunk. • Vertices: characters / positions • Edges: tokens from the pos after the starts to the pos of the ends • String with 26 chars: 14,521,680 token sequences Float Int Dot Int S 0 . 1 Float Float 12/33

2-step Tokenization Algorithm • Given token definitions of regular expressions, construct the Seqsets. • For each record, find the most likely token sequence • a Hidden Markov Model • a Hierarchical Maximum-Entropy Model 13/33

Hidden Markov Model (HMM) • Observation ci • Character vs. Character Feature Vector • Character Features: upper/lower case, digit, punctuation... • Hidden state si : token (partial token) Quote Word Word Word Comma Int Int Quote “ f o o , 1 6 “ 14/33

Hidden Markov Model Parameters 15/33

Test Data Sources 16/33

HHM Discussion Error rate: percentage of tokens not identified w.r.t. the labeled token sequences lex: fixed token priority, take the first, longest match 17/33

More HMM Tokenization Results 18/33

Hierarchical Maximum-Entropy Model Quote Word Comma Int Quote , “ foo 16 “ 19/33

Hierarchical Max-Ent Model Discussion Average log probabilities = log likelihood / length of token sequence s = “qxi qxi@cs.princeton.edu 1.63” P( ID | “qxi” ) = 0.8 P( White | ““ ) = 1.0 P( Email | “qxi@cs.princeton.edu” ) = 0.9 P( Float | “1.63” ) = 0.9 P( White | Others ) = 0.7 P( Others | White ) = 0.8 normal log: P( ID White Email White Float | “qxi qxi@cs.princeton.edu 1.63” ) = 0.8 * 1.0 * 0.9 * 1.0 * 0.9 * 0.7 * 0.8 * 0.7 * 0.8 = 0.203 P( Blob | “qxi qxi@cs.princeton.edu 1.63”) = 0.3 average log: P( ID White Email White Float | “qxi qxi@cs.princeton.edu 1.63” ) = [(-0.097) + (-0.046) + (-0.046) + (-0.155) + (-0.097) + (-0.155) + (-0.097)]/5 = -0.139 P( Blob | “qxi qxi@cs.princeton.edu 1.63”) = -0.523 20/33

Hierarchical Max-Ent Model Discussion Error rate: percentage of tokens not identified w.r.t. the labeled token sequences lex normal emission probabilities average log emission probabilities 21/33

Hierarchical Max-Ent Model Results 22/33

Lex vs. HMM vs. Hierarchical Max-Ent • Ambiguity is increased. • Training corpus is not large enough. 23/33

learnPADS Architecture Recall XML Analysis Report XML converter Raw Data Profiler Chunking & Tokenization Format Inference Engine Structure Discovery Format Refinement PADS Compiler Data Description 24/33

Structure Discovery Phase “0,24” “bar,end” “foo,16” { Struct, Union, Array } Quote Int Comma Int Quote Quote String Comma String Quote Quote String Comma Int Quote Quote Int Comma Int Quote Quote String Comma String Quote Quote String Comma Int Quote Struct Int String String Int String Int Quote Union Comma Union Quote String Int String Int Struct: classify chunks by token counts { (Quote, 2), (Comma, 1) }. 25/33

Extended Viterbi Algorithm token counts: { (Quote, 2), (Comma, 1) } pos n pos 0 pos 1 {(Quote, 0), (Comma, 0)} {(Quote, 0), (Comma, 0)} {(Quote, 0), (Comma, 0)} Msg Msg {(Quote, 1), (Comma, 0)} ... ... P_(0,0)_Txt P_(0,0)_Msg P_(0,0)_Txt P_(0,0)_Msg . . . {(Quote, 2), (Comma, 0)} {(Quote, 1), (Comma, 0)} {(Quote, 1), (Comma, 0)} Int ... Quote ... P_(1,0)_Float {(Quote, 2), (Comma, 1)} P_(1,0)_Quote P_(1,0)_Int Quote P_(2,1)_Quote 26/33

Evaluation 1: Qualitative Judge by Human 27/33

Evaluation 2: Complexity Scores • Minimum Description Length (MDL) Principle: • cost (in bits) of transmitting the data • cost (in bits) to transmit the description: CT • cost (in bits) to transmit the data given the description: CD 28/33

Evaluation 3: Execution Time 29/33

Evaluation 4: Success Rates 30/33

Related Work • Grammar induction & structure discovery without token ambiguity problem Arasu & Garcia-Molina ’03 “extracting structure from web pages” Garofalakis et al. ’00 “XTRACT for infering DTDs” Kushmerick et al. ’97 “wrapper induction” • Detect row table components by Hidden Markov Model & Conditional Random Fields: Pinto et al. ’03 • Extract certain fields in records from text: Borkar et al. ’01 • Predict exons and introns in DNA sequences using generalized HMM: Kulp ‘96 • Part-of-speech tagging in natural language processing: • Heeman’99 (Decision Tree) • Speech Recognition: Rabiner ‘89 31/33

Future Work • Statistical Model Accuracy • HHM: parameters re-estimation by Balm-Welch algorithm • Hierarchical Max-Ent Model: token generating model P(S|T) • How to make use of “vertical” information • one record is not independent of others • not suitable for large data sets • key: alignment • Conditional Random Fields • Online learning: • old description, old data + new data new description 32/33

Contributions • Resolve the Token Ambiguity Problem by statistical approaches • Use all possible token sequences. • Integrate 2 statistical approaches into the learnPADS framework. • Hidden Markov Model • Hierarchical Maximum Entropy Model • Improve chunk partition in structure discovery with the help of Seqsets. • Evaluate correctness and performance by a number of measures • Results have shown that multiple token sequences and statistical methods achieve partial success. 33/33

End

Extended Viterbi Algorithm with Token Counts Input: a hidden markov model H, a string C=c1...cn , token counts tc=[(s1, f1), ..., (sl, fl)] Output: the best token sequence that satisfies tc. Define where si occurs oi times till position k. 1/9

Proof 2/9

Reduce Execution Time • Parallel Computing: • seqset construction • most likely token sequence finder • “embarrassingly parallel” • Learn the description by a portion of the data • How many data are needed to learn a good description? 3/9

Better Training Set ? lex hmm results: use 19/20 data sources as training data hmm results: use 19/20 data sources + 5% of the test data source as training data 4/9

Find Common Initial Tokens union: classify chunks into different branches by the 1st token of each chunk Given S = { i = 1 to n | chunki }, T = { i = 1 to k | tokeni }, init (chunki) = { tokeni1, ... tokeniz } ≦ T, get the smallest subset I of T s.t. for all i = 1 to n, init (chunki) ∩I ≠ Φ. Set Cover Problem: Given S (tokeni) = { chunkj | tokeni ∈ init (chunkj) }, select a minimum number of these sets so that the sets you have picked contain all the elements that are contained in any of the sets. NP-complete approximation: greedy algorithm 5/9

Evaluation 1: Qualitative Judge by Human 6/9

Evaluation 2: Complexity Score 7/9

A Concrete Example: Tokenization By HMM Sat Jun 24 06:38:46 crashreporterd[120]: mach_msg() reply failed: (ipc/send) invalid destination port date[Sat Jun 24] white[ ] time[06:38:46] white[ ] int[2006] white[ ] string[crashreporterd] char[[] int[120] char[]] char[:] white[ ] string[mach_msg] char[(] char[)] white[ ] string[reply] white[ ] string[failed] char[:] white[ ] char[(] string[ipc] char[/] string[send] char[)] white[ ] string[invalid] white[ ] string[destination] white[ ] string[port] word[Sat] white[ ] word[Jun] white[ ] int[24] white[ ] time[06:38:46] white[ ] int[2006] white[ ] word[crashreporterd] punctuation:[[[] int[120] punctuation:][]] punctuation::[:] message[mach_msg() reply failed] punctuation::[:] message[(ipc/send) invalid destination port] 8/9

A Concrete Example: Tokenization By Max-Ent Sat Jun 24 06:38:46 crashreporterd[120]: mach_msg() reply failed: (ipc/send) invalid destination port date[Sat Jun 24] white[ ] time[06:38:46] white[ ] int[2006] white[ ] string[crashreporterd] char[[] int[120] char[]] char[:] white[ ] string[mach_msg] char[(] char[)] white[ ] string[reply] white[ ] string[failed] char[:] white[ ] char[(] string[ipc] char[/] string[send] char[)] white[ ] string[invalid] white[ ] string[destination] white[ ] string[port] word[Sat] white[ ] word[Jun] white[ ] int[24] white[ ] time[06:38:46] white[ ] int[2006] white[ ] word[crashreporterd] punctuation:[[[] int[120] punctuation:][]] punctuation::[:] message[mach_msg() reply failed] punctuation::[:] message[(ipc/send) invalid destination port] date[Sat Jun 24] white[ ] time[06:38:46] white[ ] int[2006] white[ ] word[crashreporterd] punctuation:[[[] int[120] punctuation:][]] punctuation::[:] message[mach_msg() reply failed] punctuation::[:] message[(ipc/send) invalid destination port] 9/9

Statistical Magic: Progress in Automatic Tool Generation for Ad Hoc Data

Statistical Magic: Progress in Automatic Tool Generation for Ad Hoc Data

Presentation Transcript

Classification of Microarray Data - Recent Statistical Approaches

Chapter 2: Data Preprocessing

Statistical Process Control

Automatic Transmission Fundamentals

Unit 1: Statistical Analysis

Automatic Generation of Taxonomies from the WWW

Using Academic Progress Monitoring for Individualized Instructional Planning

Predictive Learning from Data

Data Mining: Concepts and Techniques — Chapter 2 —

Magic Carpet

Anonymized Data: Generation, Models, Usage

Data Preprocessing

13 Collecting Statistical Data

PATTERNS FOR AUTOMATIC INTEGRATION OF THE DOMAIN-DATA LAYERS IN ENTERPRISE-SYSTEMS

Data Mining: Concepts and Techniques

Generation 4000

Automatic Generation of Inputs of Death and High-Coverage Tests

Predictive Learning from Data

Informatica MDM - Multidomain

Chapter 2: Data Preprocessing

Magic Topic 1