1 / 449

Parsing Techniques A Practical Guide by Dick Grune and Ceriel J. H. Jacobs

Parsing Techniques A Practical Guide by Dick Grune and Ceriel J. H. Jacobs. Book in slide format. F antastic book: Parsing Techniques I went through chapters 1 and 2 of the book and created slides of them. That is, the following slides are chapters 1 and 2, in slide form.

luther
Télécharger la présentation

Parsing Techniques A Practical Guide by Dick Grune and Ceriel J. H. Jacobs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Parsing TechniquesA Practical Guideby Dick Grune andCeriel J. H. Jacobs

  2. Book in slide format • Fantastic book: Parsing Techniques • I went through chapters 1 and 2 of the book and created slides of them. • That is, the following slides are chapters 1 and 2, in slide form. • Additionally, there are several slides from: • Personal correspondence with one of the authors, Dick Grune. • Material from other sources. • Slides that I created, applying the concepts to XML. Roger L. Costello June 1, 2014

  3. Why grammars, not automata? • There is a close relationship between formal grammars and other abstract notions used in computer science, such as automata and algorithms. • Indeed, since the results in one theory can often be translated into another, it seems to be an arbitrary decision as to which interpretation is primary. • In these slides formal grammars are given preferential treatment because they are probably the most commonly known of the various theories among computer scientists. This is due to the success of context-free grammars in describing the syntax of programming languages.

  4. Chapter 1Defining Parsing and Grammars

  5. Parsing • Parsing is the process of structuring a linear representation in accordance with a given grammar. • This definition has been kept abstract on purpose to allow as wide an interpretation as possible. • The “linear representation” may be: • a sentence • a computer program • a knitting pattern • a sequence of geological strata • a piece of music • actions of ritual behavior In short, any linear sequence in which the preceding elements in some way restrict the next element. • For some of the examples the grammar is well known, for some it is an object of research, and for some our notion of a grammar is only just beginning to take shape.

  6. Parsing grammar parser linear representation structure • Parsing is the process of structuring a linear representation in accordance with a given grammar. A “linear representation” is any linear sequence in which the preceding elements in some way restrict the next element.

  7. Grammar: a succinct summary • For each grammar, there are generally an infinite number of linear representations (“sentences”) that can be structured with it. • That is, a finite-sized grammar can supply structure to an infinite number of  sentences. • This is the main strength of the grammar paradigm and indeed the main source of the importance of grammars: they summarize succinctly the structure of an infinite number of objects of a certain class.

  8. Reasons for parsing There are several reasons to perform this structuring process called parsing. • One reason derives from the fact that the obtained structure helps us to process the object further. When we know that a certain segment of a sentence is the subject, that information helps in understanding or translating the sentence. Once the structure of a document has been brought to the surface, it can be processed more easily. • A second reason is related to the fact that the grammar in a sense represents our understanding of the observed sentences: the better a grammar we can give for the movement of bees, the deeper our understanding of them. • A third lies in the completion of missing information that parsers, and especially error-repairing parsers, can provide. Given a reasonable grammar of the language, an error-repairing parser can suggest possible word classes for missing or unknown words on clay tablets.

  9. Grammatical inference • Grammatical inference: Given a (large) set of sentences, find the/a grammar which produces them. • Grammatical inference is also known as grammar induction or syntactic pattern recognition.

  10. XML Schema from an XML instance The XML tool oXygen XML does grammatical inference when it creates an XML Schema from an XML instance document.

  11. The science of parsing • Parsing is no longer an arcane art. • In the 1970s Aho, Ullman, Knuth, and many others put parsing techniques solidly on their theoretical feet.

  12. Mathematician vs. Computer Scientist • To a mathematician all structures are static. They have always existed and will always exist. The only time-dependence is that we have not discovered all the structures yet. • Example: the Peano axioms create the integers without reference to time. • The computer scientist is concerned with (and fascinated by) the continuous creation, combination, separation, and destruction of structures. Time is of the essence. • Example: if the computer scientist uses the Peano axioms to implement integer addition, he finds they describe a very slow process, which is why he will look for a more efficient approach.

  13. Many uses for parsing Parsing is for anyone who has parsing to do: • The compiler writer • The linguist • The database interface writer • The geologist who wants to test grammatical descriptions of a sequence of geological strata • The musicologist who wants to test grammatical descriptions of a music piece

  14. Requirements for a parser developer It requires a good ability to visualize, some programming experience, and the willingness and patience to follow non-trivial examples.

  15. Chapter 2Grammars as a Generating Device

  16. Need to define some terms • In computer science as in everyday parlance, a grammar serves to describe a language. • To establish our terminology and to demarcate the universe of discourse, we shall examine these terms: • Language • Grammar • Language Descriptions

  17. Language We examine three views of the word “language”: • How the larger part of mankind views language • How the computer scientist views language • How the formal-linguist views language

  18. Layman’s view of languages • To the larger part of mankind, language is first and foremost a means of communication. • Communication is brought about by sending messages, through air vibrations or through written symbols. • Languages have three levels of composition: • Messages fall apart into sentences, • which are composed of words, • which in turn consist of symbol sequences when written.

  19. Computer scientist view of languages • A language has sentences, and these sentences possess structure. • Information may possibly be derived from the sentence’s structure; that information is called the meaning of the sentence. • Sentences consist of words called tokens, each possibly carrying a piece of information, which is its contribution to the meaning of the whole sentence.

  20. Computer scientist view of languages • A language is a probably infinitely large set of sentences, each composed of tokens in such a way that it has structure. • The tokens and structure cooperate to describe the semantics (meaning) of the sentence. • To a computer scientist is a sentence in the language of “arithmetics on single digits”. Its structure can be shown by inserting parentheses ( and its semantics is .

  21. Formal-linguist view of languages • A language is a “set” of sentences, and each sentence is a “sequence” of “symbols”. • There is no meaning, no structure. Either a sentence belongs to the language or it does not. • The only property of a symbol is that is has an identity. • In any language there are a certain number of different symbols – the alphabet – and that number must be finite. Just for convenience we write these symbols as a, b, c, …, but ◊,▪,ⱴ, … would do equally well, as long as there are enough symbols.

  22. Formal-linguist view of languages • The word “sequence” means that the symbols in each sentence are in a fixed order and we should not shuffle them. • The word “set” means an unordered collection with all the duplicates removed. A set can be written down by writing the objects in it, surrounded by curly braces. • All this means is that to a formal-linguist the following is a language: {a, b, ab, ba} • The formal-linguist also calls a sentence a “word” and he says that “the word ab is in the language {a, b, ab, ba}”

  23. Formal-linguist vs. computer scientist • The formal-linguist holds his views of language because he wants to study the fundamental properties of languages in their naked beauty. It gives him a grip on a seemingly chaotic and perhaps infinitely complex object: natural language. • The computer scientist holds his view of language because he wants a clear, well-understood, and unambiguous means of describing objects in the computer and of communication with the computer (a most exacting communication partner).

  24. Grammars We examine three views of the word “grammar”: • How the larger part of mankind views grammar • How the formal-linguist views grammar • How the computer scientist views grammar

  25. Layman’s view of grammars A grammar is a book of rules and examples which describes and teaches the language.

  26. Formal-linguist’s view of grammars • A generative grammar is an exact, finite-size, recipe for constructing the sentences in the language. • This means that, following the recipe, it must be possible to construct each sentence of the language (in a finite number of actions) and no others. • This does not mean that, given a sentence, the recipe tells us how to construct that particular sentence, only that it is possible to do so.

  27. Computer scientist’s view of grammars The computer scientist has the same view as the formal-linguist, with the additional requirement that the recipe should imply how a sentence can be constructed.

  28. Infinite sets from finite descriptions A language is a possibly infinite set of sequences of symbols and a grammar is a finite recipe to generate those sentences.

  29. Example of an infinite set from a finite description The set of all positive integers is a very finite-size description of a definitely infinite-size set.

  30. Not all languages are describable • Can all languages be described by finite descriptions? • Answer: No.

  31. Outline of the proof • The proof that not all languages can be described by finite descriptions is not trivial. But it is very interesting and famous. We will present an outline of it. • The proof is based on two observations and a trick.

  32. Enumerate language descriptions The language descriptions can be listed. This is done as follows: • Take all descriptions of size one, that is, those of only one letter long, and sort them alphabetically. • Depending on what, exactly, we accept as a description, there may be zero descriptions of size one, or 27 (all letters + space), or 95 (all printable ASCII characters), or something similar. • Take all descriptions of size two, sort them alphabetically. Do the same for lengths 3, 4, and further. This is observation number one.

  33. Each description has a well-defined position • Now we have a list of descriptions. Each describes a language. • So each description has a position on the list. • Example: our description the set of all positive integers is 32 characters long. To find its position on the list, we have to calculate how many descriptions there are with less than 32 characters, say L. We then have to generate all descriptions of size 32, sort them and determine the position of our description in it, say P, and add the two numbers L and P. This will, of course, give a huge number but it does ensure that the description is on the list in a well-defined position. This is observation number two.

  34. Our example description is at position L + P { descriptions of size 1 { descriptions of size 2 { descriptions of size 3 . . . . . . { descriptions of size 31 . . . . . . . . . . . . . . . . . . . . descriptions of size 32 the set of all positive integers L { P

  35. Two things to note • Note #1: Just listing all descriptions alphabetically, without reference to their lengths, would not do. There are already infinitely many descriptions starting with an “a”, so no description starting with a higher letter could get a number on the list. • Note #2: there is no need to actually do all this. It is just a thought experiment that allows us to examine and draw conclusions about the behavior of a system in a situation which we cannot possibly examine physically.

  36. Both nonsensical and meaningful descriptions There will be many nonsensical descriptions on the list. This is immaterial to the argument. The important thing is that all meaningful descriptions are on the list, and the strategy ensures that.

  37. Alphabet • The words (sentences) in a language are composed of a finite set of symbols. • This set of symbols is called the alphabet. • We will assume the symbols in the alphabet are ordered. • Then the words in the language can be ordered too. • We shall indicate the alphabet by Σ.

  38. Language that consists of all possible words • The language that consists of all possible words that can be built from an alphabet is called Σ* • For the alphabet Σ = {a, b} we get the language { , a, b, aa, ab, ba, bb, aaa, …} The empty word (the word consisting of zero as and zero bs). It may be easily overlooked, so we shall write it as ε (epsilon), regardless of the alphabet. So, Σ* = {ε, a, b, aa, ab, ba, bb, aaa, …}

  39. Words in Σ*can be enumerated • Since the symbols in the alphabet Σ are ordered, we can list the words in the language Σ*, using the same technique as in the previous slides: • First, list all words of size zero, sorted; then list all words of size one, sorted; and so on • This is actually the order already used in our set notation for Σ*

  40. Compare language L against Σ* • Since Σ* contains all possible words, all languages using alphabet Σ are subsets of it. • Let L be a language over Σ (the word “over” means “built out of”). • We can go through the list of words in Σ* and put checkmarks on all words that are in L. • Suppose our language L is “the set of all words that contain more as than bs”. L is (a, aa, aab, aba, baa, …} ε ✓ a b ✓ aa ab ba bb ✓ aaa ✓ aab ✓ aba abb . . .

  41. Encode languages using 0 and 1 • The list of blanks and checkmarks is sufficient to identify and describe a language. • For convenience we write the blank as 0 and the checkmark as 1, as if they were bits in a computer. • We can now write L = 01010001110… • So, we have attached the infinite bit-string 01010001110… to the language description “the set of all words that contain more as than bs”. • The set of all words over an alphabet is Σ* = 1111111… ε ✓ a b ✓ aa ab ba bb ✓ aaa ✓ aab ✓ aba abb . . .

  42. Languages are infinite bit-strings • Any language can be encoded as an infinite bit-string, be it a formal language like L, a programming language like Java, or a natural language like English. • For the English language the 1s in the bit-string will be very scarce, since hardly any arbitrary sequence of letters is a good English sentence.

  43. List of languages • We attached the infinite bit-string 01010001110… to the language description “the set of all words that contain more as than bs”. • In the same way, we can attach bit-strings to all descriptions. • Some descriptions may not yield a language, in which case we can attach an arbitrary infinite bit-string to it. • Since all descriptions can be put on a single numbered list, we get, for example, this table: Language 000000100… 110010001… 011011010… 110011010… 100000011.. 111011011… . . . Description Description #1 Description #2 Description #3 Description #4 Description #5 Description #6 . . .

  44. The list is incomplete Language 000000100… 110010001… 011011010… 110011010… 100000011.. 111011011… . . . Description Description #1 Description #2 Description #3 Description #4 Description #5 Description #6 . . . • Many languages exist that are not on the list of languages above. • The above list is far from complete, although the list of descriptions is complete. • We shall prove this by using the diagonalization process (“Diagonalverfahren”) of Cantor.

  45. Flip the bits along the diagonal Language 000000100… 110010001… 011011010… 110011010… 100000011.. 111011011… . . . Description Description #1 Description #2 Description #3 Description #4 Description #5 Description #6 . . . • Consider the language C = 100110…, which has the property that its n-th bit is unequal to the n-th bit of the language described by Description #n. • The first bit of C is 1 because the first bit of Description #1 is 0. The second bit of C is 0 because the second bit of Description #2 is 1. And so on.

  46. Create a language Language 000000100… 110010001… 011011010… 110011010… 100000011.. 111011011… . . . Description Description #1 Description #2 Description #3 Description #4 Description #5 Description #6 . . . So C is created by walking the top-left to bottom-right diagonal of the language table and copying the opposites of the bits we meet. Language 000000100… 110010001… 011011010… 110011010… 100000011.. 111011011… . . . C = 100110…

  47. It’s a new language! Language 000000100… 110010001… 011011010… 110011010… 100000011.. 111011011… . . . • The language C cannot be on the list! • C cannot equal line 1 since its first bit differs from that line. • Ccannot equal line 2 since its second bit differs from that line. • And so forth. • So, C cannot be on the list. C = 100110…

  48. Infinite number of new languages • So in spite of the fact that we exhaustively listed all possible finite descriptions, we have created a language that has no description on the list. • There are many more languages not on the list: • Construct, for example, the language whose n+6-th bit differs from the n+6-th bit in Description #n. Again, it cannot be on the list since for each Description #n the new language differs in the n+6-th bit. That means that bits 1…5 play no role, and can be chosen arbitrarily; this yields another 25 = 32 languages that are not on the list. Language 000000100… 110010001… 011011010… 110011010… 100000011.. 111011011… . . . C+6 = xxxxx1101…

  49. Even more new languages And there are many more languages not on the list: • Construct, for example, the language whose 2n-th bit differs from the 2n-th bit in Description #n. Again, it cannot be on the list since for each Description #n the new language differs in the 2n-th bit. That means that the odd bits play no role and can be chosen freely. Language 000000100… 110010001… 011011010… 110011010… 100000011.. 111011011… . . . 2C = x1x1x0x0…

  50. Infinitely many languages cannot be described • We can create an infinite number of languages, none of which allows a finite description. • For every language that can be described there are infinitely many that cannot.

More Related