1 / 58

Lecture 1 Overview

Lecture 1 Overview. CSCE 771 Natural Language Processing. Topics Overview Readings: Chapters 1,2. January 14, 2013. Overview. Pragmatic issues Course Plans Foundation for research Today Challenge of 2001’s HAL Areas of Research Examples of Language Processing.

jeremy-good
Télécharger la présentation

Lecture 1 Overview

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture 1 Overview CSCE 771 Natural Language Processing • Topics • Overview • Readings: Chapters 1,2 January 14, 2013

  2. Overview • Pragmatic issues • Course Plans • Foundation for research • Today • Challenge of 2001’s HAL • Areas of Research • Examples of Language Processing

  3. Slide from: Speech and Language Processing Jurafsky and Martin NLP Why Should You Care? Two trends • Anenormous amount of knowledge is now available in machine readable form as naturallanguage text • Conversational agents are becoming an important form of human-computer communication Much of human-human communication is now mediated by computers

  4. Commercial World • Lot’s of exciting stuff going on… Powerset Slide from: Speech and Language Processing Jurafsky and Martin

  5. Commercial World • Lot’s of exciting stuff going on…

  6. Google Translate Slide from: Speech and Language Processing Jurafsky and Martin

  7. Google Translate Slide from: Speech and Language Processing Jurafsky and Martin

  8. Web Q/A Slide from: Speech and Language Processing Jurafsky and Martin

  9. HAL 9000 of 2001: A Space Odyssey • A scene from Arthur Clarke and Stanley Kubrick’s 2001 • DAVE: Open the pod bay doors, HAL. • HAL: I’m sorry Dave, I’m afraid I can’t do that. • Notes on Context: • HAL is the main computer on the spaceship • HAL is paranoid and decides to kill off the crew

  10. Clarke a little too Optimistic • We don’t have a HAL today in 2009. • How close are we? • Computers replaced bank tellers (in many instances) • But the NASA computers don’t talk yet • Microsoft XP/Vista’s voice commands • Adobe Reader reading PDF documents • But can they understand spoken commands?

  11. Challenges in developing HAL • So what are the major challenges in developing HAL? • Speech recognition • Natural Language understanding • Information retrieval • Information extraction • Inference • Speech generation

  12. Samples of Language Processing • Text processing (in Unix) • wc – word count • grep regexpr files - print lines in the files that match re • find • More knowledgeable processing • spelling checking/correcting • grammar checking • Information retrieval • Find all documents on decomposition by David Parnas

  13. Even More knowledgeable processing • Information extraction • Reading the “online” Wall Street Journal • What was the dividend paid by GM last year? • USC Handbook • How many hours does it take to get a PhD in CSE? • Machine translation • The spirit is willing but the body is weak. • To Russian: Sprit охотно готово но тело слабо. • Back to English: Vodka is good but the meat is rotten. (Rich 86) • Babelfish - http://world.altavista.com/tr • Back to English: Sprit is willingly prepared but body weakly.

  14. Even Deeper Understanding • Email access over the phone • Respond to commands “list all emails from Bob” • Read email message 8 • Text to speech • Assistants • Agents reading the net summarizing a topic

  15. Subcategories of Knowledge in S&L • Phonetics/phonology • Morphology – shape and behavior of words in contexts • Syntax – the legitimate sequences of words • Semantics – the meanings of words, phrases, sentences and documents • Pragmatics – the appropriate use of language – politeness, direct/indirectness • Discourse conventions – correctly structuring conversations

  16. Ambiguity: I made her duck. • . • . • . • . • .

  17. Word Ambiguity • Her – who is this? • Made • Verb with meanings: 1) create 2) cook 3) force • Duck • Noun: the waterfowl, the food • Verb • So how do we resolve this sentence?

  18. Turing Test • Computer simulate intelligence http://en.wikipedia.org/wiki/Turing_test

  19. The Chinese room • John Searle's 1980 paper Minds, Brains, and Programs proposed an argument against the Turing Test known as the "Chinese room" thought experiment. • Searle argued that software (such as ELIZA) could pass the Turing Test simply by manipulating symbols of which they had no understanding. • Without understanding, they could not be described as "thinking" in the same sense people do. • Loebner Prize – competition since 1991 to best attempt at passing Turing Test http://en.wikipedia.org/wiki/Turing_test

  20. Loebner Prize • The prizes for each year include: • $2,000 for the most human seeming of all bots for that year - awarded every year • $25,000 for the first bot that judges cannot distinguish from a real human in a text-only based Turing Test (awarded once only) • $100,000 to the first bot that judges cannot distinguish from a real human in a Turing Test that includes deciphering and understanding text, visual, auditory (and tactile?) input. • http://en.wikipedia.org/wiki/Loebner_prize • http://www.loebner.net/Prizef/loebner-prize.html

  21. Finite Automata arose in the 1950’s • 1936 Turing’s model of algorithmic computation • 1943 McCulloch-Pitts model of the neuron • 1951, 1956 Kleene first introduced finite automata and regular expressions • 1959 Rabin and Scott - Nondeterministic finite automata • 1968 Thompson first to compile regular expressions into an editor for text searching

  22. Key Concepts #1 Formal Language • A formal language is a set of strings (finite) from a finite alphabet. • Key Concept #1: A model that can both recognize and generate all and only the strings of a formal language acts as a definition of the language. • L(re) = L(Mnfa) • Formal languages are not the same as natural languages. • Linguists are generally more interested Generative Grammars, CS are more interested in recognizing.

  23. Formal Languages • Alphabet: Σ (finite set of symbols) • Strings: • s = c1c2 … cn (finite sequence of characters) • Length | s | = n • Language: • a language is a set of strings • Example languages over Σ = {a, b, c}

  24. Regular Expressions • .

  25. Regular Expression Examples

  26. Finite Automata to recognize a Language

  27. CSCE 531 – Overview in one slide % flex lang.l // lex.yy.c % bison lang.y // lang.c % gcc lex.yy.c lang.c –o parse % parse input Input source program lex.yy.c yylex() lang.l FLEX lang.y BISON lang.c yyparse() Executable Program

  28. Regular Expressions in Unix tools • Ken Thompson regular expressions in ed  ex  vi • Reg-expr  NFA then simulate • Global pattern match command • g/Unix/s/Unix/UNIX/g • g/re/print == grep

  29. Grep family • Global match Regular Expression and Print (GREP) • grep [uU]nix f1 f2 … fn • egrep pat files // efficient NFADFA, then execute • fgrep pat files // fixed grep for fixed strings • Find for searching directories (not really reg expr) • find dir –name pat // search for files with name matching pat • find dir -exec grep pat {} //search in files for the pattern pat

  30. Editing scripts • Create a script of editing commands then execute with • ex file1 < edScript • Example: • 1,$s/[uU]nix/UNIX/g • 1,$s/langauge/language/g • g/^$/d // delete empty lines ^=start of line $=end • … • w • q

  31. Other Unix regular expression Based Tools • sed (stream editor) • awk • Perl – scripting language • Python • Ruby • reg_comp, reg_exec in C

  32. Python String constants • http://docs.python.org/2/library/stdtypes.html • string.ascii_letters - • string.ascii_lowercase • string.ascii_uppercase - • string.digits - The string '0123456789'. • string.hexdigits - The string '0123456789abcdefABCDEF'. • string.letters - The specific value is updated when locale.setlocale() is called. • string.lowercase • string.octdigits - The string '01234567'. • string.punctuation - String of ASCII characters which are considered punctuation • string.printable • string.uppercase • string.whitespace

  33. String Method Examples • s = "i think 771 is going great!" • print s.capitalize( ) • #center( width[, fillchar]) • print ':'+ s.center(44, '.') + ':‘ • #count( sub[, start[, end]]) • print s.count("in") • print s.count("in", 13) • print s.count("in", 3) • print s.count("in", 13, 22) • print s.count("in", 13, 15) • #decode( [encoding[, errors]]) • #encode( [encoding[,errors]]) • #endswith( suffix[, start[, end]])

  34. expandtabs( [tabsize]) • find( sub[, start[, end]]) • index( sub[, start[, end]]) Like find(), but raise ValueError when the substring is not found. • isalnum( ) • isalpha( ) • isdigit( )

  35. rpartition( sep) • rsplit( [sep [,maxsplit]]) • rstrip( [chars]) • split( [sep [,maxsplit]]) • splitlines( [keepends]) • startswith( prefix[, start[, end]]) • strip( [chars]) swapcase( ) • title( ) • translate( table[, deletechars]) • upper( ) • zfill( width)

  36. Python re — Regular expressions • http://docs.python.org/library/re.html • re — Regular expression module • Operators (special characters) • Lookahead / lookbehind • Search vs match • re module contents

  37. Python Regular Expressions • http://docs.python.org/2/library/re.html

  38. Fundamental Re Operators in Python

  39. Other Operators in Python

  40. Greedy Operators in Python

  41. Non Greedy Operators in Python

  42. Groups • The actual text that matches a re in parentheses is a group can be referred to later • Example: (?P<frst> [a-z]{3}) (?P=frst)

  43. Group related

  44. Positional special characters

  45. Positional special characters • \A Matches only at the start of the string. • \b Matches the empty string, but only at the beginning or end of a word. • \B • \d matches any decimal digit ---\D any non-digit character • \s matches any whitespace character, equivalent to [ \t\n\r\f\v] --- \S • \w matches any alphanumeric character and the underscore --- \W • \Z Matches only at the end of the string

  46. re Module - Matching vs Searching • import re • re.match(pattern, line) • re.search(pattern, line) • >>> re.match("c", "abcdef") # No match • >>> re.search("c", "abcdef") # Match • <_sre.SRE_Match object at ...>

  47. re.compile • re.compile(pattern[, flags]) • prog = re.compile(pattern) • result = prog.match(string)

  48. Python’s Raw String Format • What regular expression matches the two character pattern “\\”? • Re = “\\\\” • Sometimes it simplifies patterns to disable the ‘\’. The “raw” modifier changes the interpretation of ‘\’ in regular expressions. • For instance • “\n” is an regular expression matches one character the newline • r“\n” is a regular expression with two characters ‘\’ and ‘n’

  49. Natural Language Toolkit • http://nltk.org/ • interfaces to over 50 corpora and • lexical resources such as WordNet • suite of text processing libraries for • classification, • tokenization, • stemming, • tagging, • parsing, and • semantic reasoning.

  50. Installing NLTK • http://nltk.org/install.html • Windows 32-bit binary installation • Install Python: http://www.python.org/download/releases/2.7.3/ • Install Numpy (optional): http://sourceforge.net/projects/numpy/files/NumPy/1.6.2/numpy-1.6.2-win32-superpack-python2.7.exe • Install NLTK: http://pypi.python.org/pypi/nltk • Install PyYAML: http://pyyaml.org/wiki/PyYAML • Test installation: Start>Python27, then type import nltk

More Related