1 / 21

LING/C SC/PSYC 438/538

LING/C SC/PSYC 438/538. Lecture 13 10/7 Sandiway Fong. Administrivia. Homework 3 graded Homework 4 out today, due next Tuesday by midnight. Homework 4. Article247_3400.txt. Raw data: line breaks. (ANC – American National Corpus: 100 million words)

truman
Télécharger la présentation

LING/C SC/PSYC 438/538

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. LING/C SC/PSYC 438/538 Lecture 13 10/7 Sandiway Fong

  2. Administrivia • Homework 3 • graded • Homework 4 • out today, due next Tuesday by midnight

  3. Homework 4 Article247_3400.txt Raw data: line breaks • (ANC – American National Corpus: 100 million words) • Genre: journal, (Slate Magazine article from 1998)

  4. Homework 4 • Write a Perl program that takes the supplied raw text and produces html-style formatted output. • Example: <s>Will There Be Life After Greenspan?</s> <p> <s>If you blinked you missed it, but for a short while yesterday morning the stock and bond markets dived after a rumor that Alan Greenspan was resigning hit the Street.</s> <s>The story was quickly ... well, it wasn't exactly refuted, since Greenspan didn't say actually say "I'm not resigning," but it was rejected as unlikely, and both markets rebounded nicely.</s> <p>

  5. Homework 4 • 2nd corpus file (also from Slate.com): Article559_10072013.txt

  6. Homework 4 • Instructions: • i.e. your program should produce reformatted raw text as • <p> • <s>sentence 1</s> • <s>sentence 2</s> • … • Use <p> as a paragraph separator. • Each <s>..</s> should occupy exactly one line of your output. • Leading and trailing spaces of a sentence should be deleted, e.g. • <s> If you blinked you missed it, … • vs. <s>If you blinked you missed it, … • Print out the total number of lines detected for each article • Print out the total number of paragraphs for each article • Submit your program and its output on Article247_3400.txt and Article559_10072013.txt

  7. Applications of FSA • Let’s take a look at one • Efficient String Matching

  8. String Matching • Example: • Pattern (P): aaba • Corpus (C): aaabbaabab • Naïve algorithm: • compare Pattern against Corpus from left to right a character at a time • P: aaba • C: aaabbaabab • P: _aaba • C: aaabbaabab • P: __aaba • C: aaabbaabab • P: ___aaba • C: aaabbaabab • P: ____aaba • C: aaabbaabab • P: _____aaba • C: aaabbaabab Matched!

  9. Knuth-Morris-Pratt (KMP) • can do better (i.e. use fewer comparisons) if we use a FSA to represent the Pattern • plus extra links for restarts in the event of a failure • Example • Pattern: aaba all backpointers are restarts for failed character a a a a b b a a 0 0 1 1 2 2 3 3 4 4 ^a ^b ^a ^a • Suppose the alphabet was limited to just {a,b}, then restarts can be for the character following the failed character b a b b

  10. Knuth-Morris-Pratt (KMP) • Example: • Pattern: aaba • Corpus {a,b}: aaabbaabab • KMP: a a b a 0 1 2 3 4 • P: aaba • C: aaabbaabab • P: _aaba • C: aaabbaabab • P: _____aaba • C: aaabbaabab • Other possible algorithms: • e.g. Boyer-Moore (start match from the back of the pattern) b a b b

  11. Regular Languages • Three formalisms, same expressive power • Regular expressions • Finite State Automata • Regular Grammars We’ll look at this case using a logic programming language: Prolog

  12. SWI Prolog • Install on your laptop: • On Mac or Linux • Use terminal/shell • Executable is /opt/local/bin/swipl http://www.swi-prolog.org/download/stable

  13. Chomsky Hierarchy Chomsky Hierarchy • division of grammar into subclasses partitioned by “generative power/capacity” • Type-0 General rewrite rules • Turing-complete, powerful enough to encode any computer program • can simulate a Turing machine • anything that’s “computable” can be simulated using a Turing machine • Type-1 Context-sensitive rules • weaker, but still very power • anbncn • Type-2 Context-free rules • weaker still • anbn Pushdown Automata (PDA) • Type-3 Regular grammar rules • very restricted • Regular Expressionsa+b+ • Finite State Automata (FSA) finite state machine tape read/write head Turing machine: artist’s conception from Wikipedia

  14. FSA Regular Expressions Regular Grammars Chomsky Hierarchy Type-1 Type-3 Type-2 DCG = Type-0

  15. Prolog Grammar Rule System • known as “Definite Clause Grammars” (DCG) • based on type-2 (context-free grammars) • but with extensions • (powerful enough to encode the hierarchy all the way up to type-0) • Prolog was originally designed (1970s) to also support natural language processing • we’ll start with the bottom of the hierarchy • i.e. the least powerful • regular grammars (type-3)

  16. Definite Clause Grammars (DCG) • Background • a “typical” formal grammar contains 4 things • <N,T,P,S> • a set of non-terminal symbols (N) • these symbols will be expanded or rewritten by the rules • a set of terminal symbols (T) • these symbols cannot be expanded • production rules (P) of the form • LHS  RHS • In regular and CF grammars, LHS must be a single non-terminal symbol • RHS: a sequence of terminal and non-terminal symbols: possibly with restrictions, e.g. for regular grammars • a designated start symbol (S) • a non-terminal to start the derivation • Language • set of terminal strings generated by <N,T,P,S> • e.g. through a top-down derivation

  17. Background a “typical” formal grammar contains 4 things <N,T,P,S> a set of non-terminal symbols (N) a set of terminal symbols (T) production rules (P) of the form LHS  RHS a designated start symbol (S) Example grammar (regular) S aB BaB BbC B b CbC Cb Notes: Start symbol: S Non-terminals: {S,B,C} (uppercase letters) Terminals: {a,b} (lowercase letters) Definite Clause Grammars (DCG)

  18. Example Formal grammar Prolog format S aBs--> [a],b. BaBb--> [a],b. BbCb--> [b],c. B bb--> [b]. CbCc--> [b],c. Cbc--> [b]. Notes: Start symbol: S Non-terminals: {S,B,C} (uppercase letters) Terminals: {a,b} (lowercase letters) Prolog format: both terminals and non-terminal symbols begin with lowercase letters variables begin with an uppercase letter (or underscore) --> is the rewrite symbol terminals are enclosed in square brackets (list notation) nonterminals don’t have square brackets surrounding them the comma (,: and) represents the concatenation symbol a period (.) is required at the end of every DCG rule DefiniteClause Grammars (DCG)

  19. Regular Grammars • Regular or Chomsky hierarchy type-3grammars • are a class of formal grammars with a restricted RHS • LHS → RHS “LHS rewrites/expands to RHS” • all rules contain only a single non-terminal, and (possibly) a single terminal) on the right hand side • Canonical Forms: x --> y, [t].x --> [t]. (left recursive) or x --> [t], y.x --> [t]. (right recursive) • where x and y are non-terminal symbols and • t (enclosed in square brackets) represents a terminal symbol. • Note: • can’t mix these two forms (and still have a regular grammar)! • can’t have both left and right recursive rules in the same grammar Terminology: or “left/right linear”

  20. Definite Clause Grammars (DCG) • What language does our regular grammar generate? • by writing the grammar in Prolog, • we have a ready-made recognizer program • no need to write a separate grammar rule interpreter (in this case) • Example queries • ?-s([a,a,b,b,b],[]).Yes • ?-s([a,b,a],[]).No • Note: • Query uses the start symbol s with two arguments: • (1) sequence (as a list) to be recognized and • (2) the empty list [] 1. s --> [a],b. 2. b--> [a],b. 3. b--> [b],c. 4. b --> [b]. 5. c --> [b],c. 6. c --> [b]. one or more a’s followed by one or more b’s Prolog lists: In square brackets, separated by commas e.g. [a] [a,b,c]

  21. Definite Clause Grammars (DCG) • Tree representations • Examples • ?-s([a,a,b,b,b],[]). • ?-s([a,a,b],[]).

More Related