1 / 26

CSCI 3130: Automata theory and formal languages

Fall 2010. The Chinese University of Hong Kong. CSCI 3130: Automata theory and formal languages. Ambiguity Parsing algorithm for CFGs. Andrej Bogdanov http://www.cse.cuhk.edu.hk/~andrejb/csc3130. Ambiguity. A grammar is ambiguous if some strings have more than one parse tree.

Télécharger la présentation

CSCI 3130: Automata theory and formal languages

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Fall 2010 The Chinese University of Hong Kong CSCI 3130: Automata theory and formal languages AmbiguityParsing algorithm for CFGs Andrej Bogdanov http://www.cse.cuhk.edu.hk/~andrejb/csc3130

  2. Ambiguity • A grammar is ambiguous if some strings have more than one parse tree E  E + E | E * E | (E)| N N  1N | 2N | 1 | 2  E E E E * E + E 1+2*2 + E E V V E * E V V 2 V V 1 = 5 = 6 1 2 2 2

  3. Disambiguation • Sometimes we can rewrite the grammar to remove the ambiguity E  E + E | E * E | (E)| N N  1N | 2N | 1 | 2 same precedence! F F T T Divide expression into terms and factors F F 2 * (1 + 2 * 2)

  4. Disambiguation E  E + E | E * E | (E)| N N  1N | 2N | 1 | 2 An expression is a sum of one or more terms E  T | E + T Each term is a product of one or more factors T  F | T * F Each factor is a parenthesized expression or a number F  (E) |1 | 2

  5. Parsing example E E  T | E + T T  F | T * F F  (E) |1 | 2 E T + T F F T * E ( ) F E + T E T T F + * T F F F 2 * (1 + 1 + 2 * 2) + 1

  6. Disambiguation • Disambiguation is not always possible • There exist inherently ambiguous languages • There is no general procedure for disambiguation • In programming languages, ambiguity comes from precedence rules, and we can do like in example • In English, ambiguity is sometimes a problem: He ate the cookies on the floor

  7. Parsing • Do we have a method for building a parse tree? • Can we tell if the parse tree is unique? S → 0S1 | 1S0S1 | T T → S | e input: 00111

  8. First attempt • Maybe we can try all possible derivations: S → 0S1 | 1S0S1 | T T → S |  x = 00111 S 0S1 00S11 01S0S11 0T1 when do we stop? 1S0S1 10S10S1 ... T S 

  9. Problems • How do we know when to stop? S → 0S1 | 1S0S1 | T T → S |  x = 00111 S 0S1 00S11 01S0S11 when do we stop? 0T1 1S0S1 10S10S1 ...

  10. Problems • Idea: Stop derivation when length exceeds |x| • Not right because of -productions • We want to eliminate -productions S → 0S1 | 1S0S1 | T T → S |  x = 01011 S  0S1  01S0S11  01S011  01011 1 3 7 6 5

  11. Problems • Loops among the variables (S→T→S) might make us go forever • We want to eliminate such loops S → 0S1 | 1S0S1 | T T → S |  x = 00111

  12. Removal of -productions • A variable N is nullable if there is a derivation • How to remove-productions * N • Find all nullable variables N • For every production of the form A → N, • add another production A →  • If N →  is a production, remove it • If S is nullable, add the special productionS →    

  13. Example • Find the nullable variables grammar nullable variables B C D S  ACD A a B   C  ED |  D  BC | b E  b • Find all nullable variables 

  14. Finding nullable variables • To find nullable variables, we work backwards • First, mark all variables A s.t. A   as nullable • Then, as long as there are productions of the formwhere all of A1,…, Ak are marked as nullable, mark A as nullable A → A1… Ak

  15. Eliminating e-productions D  C S  AD D  B D  e S  AC S  A C  E S  ACD A a B   C  ED |  D  BC | b E  b nullable variables:B, C, D  • For every production of the form A → N, • add another production A →  • If N →  is a production, remove it

  16. Dealing with loops • A unit production is a production of the formwhere A1 and A2 are both variables • Example A1 → A2 grammar: unit productions: S → 0S1 | 1S0S1 | T T → S | R |  R → 0SR S T R

  17. Removal of unit productions • If there is a cycle of unit productionsdelete it and replace everything with A1 • Example A1 → A2 → ... → Ak→ A1 S T  S → 0S1 | 1S0S1 | T T → S | R |  R → 0SR S → 0S1 | 1S0S1 S → R |  R → 0SR  R T is replaced by S in the {S, T} cycle

  18. Removal of unit productions • For other unit productions, replace every chainby productions A1 → ,... , Ak→  • Example A1 → A2 → ... → Ak→  S → 0S1 | 1S0S1 | R |  R → 0SR S → 0S1 | 1S0S1 | 0SR |  R → 0SR S → R → 0SR is replaced by S → 0SR, R → 0SR

  19. Recap • After eliminating e-productions and unit productions, we know that every derivationdoesn’t shrink in length and doesn’t go into cycles • Exception: S → • We will not use this rule at all, except to check if e  L • Note • e-productions must be eliminated before unit productions * S  a1…ak where a1, …, ak are terminals

  20. eliminate unit, e-prod Example: testing membership S →  | 01 | 101 | 0S1 |10S1 | 1S01 | 1S0S1 S → 0S1 | 1S0S1 | T T → S |  x = 00111 01, 101 S 0S1 0011, 01011 00S11 strings of length ≥ 6 only strings of length ≥ 6 10011, strings of length ≥ 6 10S1 10101, strings of length ≥ 6 1S01 only strings of length ≥ 6 1S0S1

  21. Algorithm 1 for testing membership • How to check if a string x ≠ e is in L(G)  • Eliminate all e-productions and unit productions • Let X := S • While some new rule R can be applied to X • Apply R to X • If X = x, you have found a derivation for x • If |X| > |x|, backtrack • If no more rules can be applied to X, x is not in L   

  22. Practical limitations of Algorithm I • This method can be very slow if x is long • There is a faster algorithm, but it requires that we do some more transformations on the grammar G = CFG of the java programming language x = code for a 200-line java program algorithm might take about 10200 steps!

  23. Chomsky Normal Form • A CFG is in Chomsky Normal Formif every production (except S → e)is • Convert to Chomsky Normal Form: A → a A → BC or Noam Chomsky A → BcDE A → BX1 X1→ CX2 X2→ DE A → BCDE C → c break up sequences with new variables replace terminals with new variables C → c

  24. Algorithm 2 for testing membership SAC S  AB | BC A  BA | a B  CC | b C  AB | a – SAC – B B SA B SC SA B AC AC B AC x = baaba b a a b a Idea: We generate each substring of x bottom up

  25. SAC – SAC – B B SA B SC SA B AC AC B AC b a a b a Parse tree reconstruction S  AB | BC A  BA | a B  CC | b C  AB | a x = baaba Tracing back the derivations, we obtain the parse tree

  26. Cocke-Younger-Kasami algorithm table cells Input: Grammar G in CNF, string x = x1…xk 1k … … • For cells in last rowIf there is a production A  xiPutA in table cell ii • For cells st in other rows If there is a production A  BC whereB is in cell sj and C is in cell jtPutA in cell st 23 12 22 kk 11 x1 x2 … xk s j t k 1 Cell ij remembers all possible derivations of substring xi…xj

More Related