1 / 51

Grammar Induction

Grammar Induction. An ADIOS Review. ADIOS in outline. Composed of three main elements A representational data structure A segmentation criterion (MEX) A generalization ability We will consider each of these in turn. cat. ?. node. edge. where. (1). 101. (2). (5). 104. (6). (1).

oliana
Télécharger la présentation

Grammar Induction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Grammar Induction An ADIOS Review

  2. ADIOS in outline • Composed of three main elements • A representational data structure • A segmentation criterion (MEX) • A generalization ability • We will consider each of these in turn

  3. cat ? node edge where (1) 101 (2) (5) 104 (6) (1) 101 (2) BEGIN is (1) (2) 102 END (6) (5) 104 103 (2) (7) 103 (3) and (1) (6) 104 (4) (3) 102 (4) the (5) 102 101 (3) that a (3) (4) (6) horse (5) (4) dog The Model: Graph representation with words as vertices and sentences as paths. And is that a horse? Is that a dog? Where is the dog? Is that a cat?

  4. Detecting significant patterns • Identifying patterns becomes easier on a graph • Sub-paths are automatically aligned

  5. Rewiring the graph Once a pattern is identified as significant, the sub-paths it subsumes are merged into a new vertex and the graph is rewired accordingly. Repeating this process, leads to the formation of complex, hierarchically structured patterns.

  6. Motif EXtraction

  7. The Markov Matrix • The top right triangle defines the PL probabilities, bottom left triangle the PR probabilities • Matrix is path-dependent

  8. Pattern significance • Say we found a potential pattern-edge from nodes 1 to n. Define • m - the number of paths from 1 to n • r – the number of paths from 1 to n+1 • Because it’s a pattern edge, we know that • Let’s suppose that the true probability for n+1 given 1 through n is • r/m is our best estimate, but just an estimate • What are the odds of getting r and m but still have ?

  9. Pattern significance • Assume • The odds of getting result r and m or better are then given by • If this is smaller than a predetermined α, we say the pattern-edge candidate is significant

  10. The algorithm so far • Initialization – load data into pseudograph • Until no more patterns are found do • For each path detect all sub-paths that live up to the MEX criterion • Pick best pattern, add it to graph and rewire paths

  11. How to choose patterns • Obviously, the more significant the pattern the better • Turns out it helps choosing longer patterns first when segmenting text • Lowers the probability for accidentally linking words • Also turns out it helps to gradually increase ALPHA

  12. ADIOS at work

  13. Syntagmatic and Paradigmatic relations • Words can take part of two forms of relations with other words – • Syntagmatic relations – indicating the words appear together in some contexts • Paradigmatic relations – indicating the words can replace one another in a given context • Syntagmatic relations are discovered by MEX • Candidates for paradigmatic relations are established during a preprocessing step for each search path

  14. Generalized search path: boston philadelphia denverdallas show me flights from to san francisco on wednesdays Generalization – defining an equivalence class show me flights from philadelphia to san francisco on wednesdays list all flights from boston to san francisco with the maximum number of stops may i see the flights from denver to san francisco please show flights from dallas to san francisco

  15. boston philadelphia denverdallas P1: from _E1 to _E1 = Generalization boston philadelphia denverdallas show me flights from to san francisco on wednesdays list all flights going from boston to atlanta on wednesday… i need to fly from boston to baltimore please give me… which airlines fly from dallas to denver please give me a flight from philadelphia to atlanta before ten a m in the morning

  16. Context-sensitive generalization • Slide a context window of size L across current search path • For each 1≤i≤L • look at all paths that are identical with the search path for 1≤k≤L, except for k=i • Define an equivalence class containing the nodes at index i for these paths • Replace i’th node with equivalence class • Find significant patterns using MEX criterion

  17. Determining L • Involves a tradeoff • Larger L will demand more context sensitivity in the inference • Will hamper generalization • Smaller L will detect more patterns • But many might be spurious

  18. The effects of context window width

  19. Generalized search path: believesthinksbelieve john that to please is easy When it all goes wrong • john believes that to please is easy • john thinks that to please is fun • jack and john believe that to please is hard

  20. A pre-existing equivalence class: boston philadelphia denverdallas Generalized search path I: boston philadelphia denverdallas boston philadelphia denverdallas What are the cheapest flights from to that stop in atlanta Bootstrapping what are the cheapest flights from denver to boston that stop in atlanta

  21. Bootstrapping boston philadelphia denverdallas boston philadelphia denverdallas What are the cheapest flights from to that stop in atlanta what is the cheapest fare from denver to philadelphia and from pittsburgh to atlanta i would… like the cheapest airfare from boston to denver december twenty sixth show me the cheapest flight from philadelphia to dallas which arrives…

  22. _P2: the cheapest _E2 from _E3 to _E4 denverphiladelphiadallas flightflightsairfare fare boston philadelphia denver _E2 = _E3 = _E4 = Bootstrapping Generalized search path II: flightflightsairfare fare boston philadelphia denver denverphiladelphiadallas What are the cheapest from to that stop in atlanta

  23. Bootstrapping • Slide a context window of length L along the current search path • Consider all sub-paths of length L that begin in a1 and end in aL • These are the candidate paths • For each 1≤i≤L • For each 1≤k≤L, k≠i • Replace node k with the EC that contains node k and maximally overlaps the set of nodes at index k of the candidate paths • Continue as before

  24. The ADIOS algorithm • Initialization – load all data into a pseudograph • Until no more patterns are found • For each path P • Create generalized search paths from P • Detect significant patterns using MEX • If found, add best new pattern and equivalence classes and rewire the graph

  25. Alternative rewiring tacks • Single mode • as just mentioned. Best pattern is selected and added to graph • Multiple mode • All patterns from the current search path are added to graph in order of significance • Batch mode • The search is conducted over all paths, best patterns added in the end

  26. Another example

  27. More Patterns

  28. Evaluating performance • Define • Recall – the probability of ADIOS recognizing an unseen grammatical sentence • Precision – the proportion of grammatical ADIOS productions • Recall can be assessed by leaving out some of the training corpus • Precision is trickier • Unless we’re learning a known CFG

  29. An ADIOS drawback • ADIOS is inherently a heuristic and greedy algorithm • Once a pattern is created it remains forever – errors conflate • Sentence ordering affects outcome • Running ADIOS with different orderings gives patterns that ‘cover’ different parts of the grammar

  30. An ad-hoc solution • Train multiple learners on the corpus • Each on a different sentence ordering • Create a ‘forest’ of learners • To create a new sentence • Pick one learner at random • Use it to produce sentence • To check grammaticality of given sentence • If any learner accepts sentence, declare as grammatical

  31. The Real Deal http://www.tau.ac.il/~zsolan/adios/algorithm.html

  32. The ADIOS executables • A C++/LINUX implementation • There are 4 relevant executables – • adios.exe • The actual implementation of the algorithm • create_graph.exe • Loads a corpus into the ADIOS’ pseudograph • scrambler.exe • Randomizes the order of sentences in a corpus • convert_grammar.exe • Converts a CFG to an ADIOS representation

  33. Preparing the corpus • Each path should be in a line of its own • Starts with a ‘*’ and ends with an ‘#’ • Represent the BEGIN and END nodes, respectively • Words (nodes) separated by spaces * Jim and Cindy have a winning personality  # * Beth won't be released until Friday # * a horse barked # * the dog loved a cat # * the cats are living very far away #

  34. Creating the graph • Done by create_graph.exe – ./create_graph.exe –f corpus_file –o proj_name • Two files will be created – • proj_name.idx – an index file containing the list of nodes (the lexicon) and a numeric code for each node • proj_name.grp – a text file describing the pseudograph

  35. Running ADIOS • General usage –./adios.exe [-options] –o proj_name • ADIOS continuously updates and saves the current graph and pattern files – • graph.dat • patterns.dat • sysparams.dat • These files, along with the index file, are important for all other ADIOS operations

  36. Training ADIOS • To train, usually the following parameters are used ./adios.exe –a train –i proj.idx –g proj.grp –E 0.8 –S 0.01 –o proj • Some parameters – • -a – the action to perform (train / test / generate / print) • -i – the index file name • -g – the graph file name • -E – eta (the threshold used by MEX) – default 0.8 • -S – alpha (the significance level required by MEX) – default 0.01 • -o – the project name, which will be used for output and log files

  37. Some additional parameters • -W – the context window width – default 5 (use 1000 for no ECs) • -r – rewiring mode • 0 – no rewiring • 1 – single (the most commonly used) • 2 – multiple • 3 – batch (used for text segmentation) • -A – largest pattern size; all patterns above this size will be treated as equal in the rewiring process (default 1)

  38. Result files • proj.trace.log – a summary of the algorithm’s run • Includes several statistics throughout the processing of the corpus • proj.results.txt – the set of patterns the algorithm has detected, along with a ‘pattern spectra’ analysis • Best viewed with Excel

  39. Resuming training • If ADIOS stalls for some reason, or that you want to continue a run with different parameters (e.g. when incrementing alpha), use –./adios.exe –a train –i proj_name.idx • ADIOS will use the existing graph.dat, patterns.dat and sysparams.dat files to resume its operation

  40. Testing ADIOS • ./adios.exe –a test –i idx_file –I test_file –R 10 –o proj • -I – the file containing the test sentences • -R – determines the maximum depth of the parse trees. Paths that require deeper parse trees will not be accepted. Default value – 10. • Assumes graph.dat and patterns.dat are in same directory

  41. Testing ADIOS • Output files – • proj.test.results.txt • a detailed text file listing the partial parses of each test path • proj.test.summary.txt • a summary file, listing for each test path the patterns accepted on it and whether it’s accepted as a whole • proj.test.classify.txt • a text file with a 0/1 result for each test path (number of accepted sentences = number of lines with a ‘1’ in this file)

  42. Testing multiple learners • Running adios.exe on a second learner will not overwrite proj.classify.txt • Each line will contain the number of learners that accepted the corresponding sentence

  43. Generating new sentences • ./adios.exe –a generate –i proj.idx –n 100 –R 10 –o proj_name • -i – the index file • -n – number of sentences to generate • -R – maximum parse depth • -o – project name

  44. The generator’s output • The output file is proj.generate.txt • Will contain the new sentences in the ADIOS format • Some sentences may be ‘incomplete’ because of the –R option • In these, a ~ symbol will appear • Before using the generated sentences, these should be removed • Use the ‘sed’ command as explained on the webpage

  45. Scrambling sentences • Before creating the graph, the sentences in the input corpus can be scrambled using scrambler.exe. • Usage - ./scrambler.exe –f input_file –o output_file

  46. Using an artificial CFG • An artificial CFG in a proper format can be converted to an ADIOS representation • For testing precision/recall • Using convert_grammar.exe • The CFG should be stored in two files • CFG_lex.txt – a lexicon file • E.g. TA1_lex.txt • CFG_grammar.txt – the rewrite rules • E.g. TA1_grammar.txt

  47. Convert grammar • Usage – ./convert_grammar.exe –l lex_file –g grammar_file –o proj_name • output files – • proj_name.idx – index file • graph.dat – the graph • patterns.dat – the patterns

  48. Displaying patterns • First print the ADIOS learner’s results • ./adios.exe –a print –i proj.idx • Open Matlab and set its workspace to the ADIOS directory • Use the pattern.m script • pattern(123, ‘proj_name’) will graphically display the pattern/EC from the project names proj_name and whose ID is 123

More Related