580 likes | 733 Vues
Lecture 1 Overview. CSCE 771 Natural Language Processing. Topics Overview Readings: Chapters 1,2. January 14, 2013. Overview. Pragmatic issues Course Plans Foundation for research Today Challenge of 2001’s HAL Areas of Research Examples of Language Processing.
E N D
Lecture 1 Overview CSCE 771 Natural Language Processing • Topics • Overview • Readings: Chapters 1,2 January 14, 2013
Overview • Pragmatic issues • Course Plans • Foundation for research • Today • Challenge of 2001’s HAL • Areas of Research • Examples of Language Processing
Slide from: Speech and Language Processing Jurafsky and Martin NLP Why Should You Care? Two trends • Anenormous amount of knowledge is now available in machine readable form as naturallanguage text • Conversational agents are becoming an important form of human-computer communication Much of human-human communication is now mediated by computers
Commercial World • Lot’s of exciting stuff going on… Powerset Slide from: Speech and Language Processing Jurafsky and Martin
Commercial World • Lot’s of exciting stuff going on…
Google Translate Slide from: Speech and Language Processing Jurafsky and Martin
Google Translate Slide from: Speech and Language Processing Jurafsky and Martin
Web Q/A Slide from: Speech and Language Processing Jurafsky and Martin
HAL 9000 of 2001: A Space Odyssey • A scene from Arthur Clarke and Stanley Kubrick’s 2001 • DAVE: Open the pod bay doors, HAL. • HAL: I’m sorry Dave, I’m afraid I can’t do that. • Notes on Context: • HAL is the main computer on the spaceship • HAL is paranoid and decides to kill off the crew
Clarke a little too Optimistic • We don’t have a HAL today in 2009. • How close are we? • Computers replaced bank tellers (in many instances) • But the NASA computers don’t talk yet • Microsoft XP/Vista’s voice commands • Adobe Reader reading PDF documents • But can they understand spoken commands?
Challenges in developing HAL • So what are the major challenges in developing HAL? • Speech recognition • Natural Language understanding • Information retrieval • Information extraction • Inference • Speech generation
Samples of Language Processing • Text processing (in Unix) • wc – word count • grep regexpr files - print lines in the files that match re • find • More knowledgeable processing • spelling checking/correcting • grammar checking • Information retrieval • Find all documents on decomposition by David Parnas
Even More knowledgeable processing • Information extraction • Reading the “online” Wall Street Journal • What was the dividend paid by GM last year? • USC Handbook • How many hours does it take to get a PhD in CSE? • Machine translation • The spirit is willing but the body is weak. • To Russian: Sprit охотно готово но тело слабо. • Back to English: Vodka is good but the meat is rotten. (Rich 86) • Babelfish - http://world.altavista.com/tr • Back to English: Sprit is willingly prepared but body weakly.
Even Deeper Understanding • Email access over the phone • Respond to commands “list all emails from Bob” • Read email message 8 • Text to speech • Assistants • Agents reading the net summarizing a topic
Subcategories of Knowledge in S&L • Phonetics/phonology • Morphology – shape and behavior of words in contexts • Syntax – the legitimate sequences of words • Semantics – the meanings of words, phrases, sentences and documents • Pragmatics – the appropriate use of language – politeness, direct/indirectness • Discourse conventions – correctly structuring conversations
Ambiguity: I made her duck. • . • . • . • . • .
Word Ambiguity • Her – who is this? • Made • Verb with meanings: 1) create 2) cook 3) force • Duck • Noun: the waterfowl, the food • Verb • So how do we resolve this sentence?
Turing Test • Computer simulate intelligence http://en.wikipedia.org/wiki/Turing_test
The Chinese room • John Searle's 1980 paper Minds, Brains, and Programs proposed an argument against the Turing Test known as the "Chinese room" thought experiment. • Searle argued that software (such as ELIZA) could pass the Turing Test simply by manipulating symbols of which they had no understanding. • Without understanding, they could not be described as "thinking" in the same sense people do. • Loebner Prize – competition since 1991 to best attempt at passing Turing Test http://en.wikipedia.org/wiki/Turing_test
Loebner Prize • The prizes for each year include: • $2,000 for the most human seeming of all bots for that year - awarded every year • $25,000 for the first bot that judges cannot distinguish from a real human in a text-only based Turing Test (awarded once only) • $100,000 to the first bot that judges cannot distinguish from a real human in a Turing Test that includes deciphering and understanding text, visual, auditory (and tactile?) input. • http://en.wikipedia.org/wiki/Loebner_prize • http://www.loebner.net/Prizef/loebner-prize.html
Finite Automata arose in the 1950’s • 1936 Turing’s model of algorithmic computation • 1943 McCulloch-Pitts model of the neuron • 1951, 1956 Kleene first introduced finite automata and regular expressions • 1959 Rabin and Scott - Nondeterministic finite automata • 1968 Thompson first to compile regular expressions into an editor for text searching
Key Concepts #1 Formal Language • A formal language is a set of strings (finite) from a finite alphabet. • Key Concept #1: A model that can both recognize and generate all and only the strings of a formal language acts as a definition of the language. • L(re) = L(Mnfa) • Formal languages are not the same as natural languages. • Linguists are generally more interested Generative Grammars, CS are more interested in recognizing.
Formal Languages • Alphabet: Σ (finite set of symbols) • Strings: • s = c1c2 … cn (finite sequence of characters) • Length | s | = n • Language: • a language is a set of strings • Example languages over Σ = {a, b, c}
CSCE 531 – Overview in one slide % flex lang.l // lex.yy.c % bison lang.y // lang.c % gcc lex.yy.c lang.c –o parse % parse input Input source program lex.yy.c yylex() lang.l FLEX lang.y BISON lang.c yyparse() Executable Program
Regular Expressions in Unix tools • Ken Thompson regular expressions in ed ex vi • Reg-expr NFA then simulate • Global pattern match command • g/Unix/s/Unix/UNIX/g • g/re/print == grep
Grep family • Global match Regular Expression and Print (GREP) • grep [uU]nix f1 f2 … fn • egrep pat files // efficient NFADFA, then execute • fgrep pat files // fixed grep for fixed strings • Find for searching directories (not really reg expr) • find dir –name pat // search for files with name matching pat • find dir -exec grep pat {} //search in files for the pattern pat
Editing scripts • Create a script of editing commands then execute with • ex file1 < edScript • Example: • 1,$s/[uU]nix/UNIX/g • 1,$s/langauge/language/g • g/^$/d // delete empty lines ^=start of line $=end • … • w • q
Other Unix regular expression Based Tools • sed (stream editor) • awk • Perl – scripting language • Python • Ruby • reg_comp, reg_exec in C
Python String constants • http://docs.python.org/2/library/stdtypes.html • string.ascii_letters - • string.ascii_lowercase • string.ascii_uppercase - • string.digits - The string '0123456789'. • string.hexdigits - The string '0123456789abcdefABCDEF'. • string.letters - The specific value is updated when locale.setlocale() is called. • string.lowercase • string.octdigits - The string '01234567'. • string.punctuation - String of ASCII characters which are considered punctuation • string.printable • string.uppercase • string.whitespace
String Method Examples • s = "i think 771 is going great!" • print s.capitalize( ) • #center( width[, fillchar]) • print ':'+ s.center(44, '.') + ':‘ • #count( sub[, start[, end]]) • print s.count("in") • print s.count("in", 13) • print s.count("in", 3) • print s.count("in", 13, 22) • print s.count("in", 13, 15) • #decode( [encoding[, errors]]) • #encode( [encoding[,errors]]) • #endswith( suffix[, start[, end]])
expandtabs( [tabsize]) • find( sub[, start[, end]]) • index( sub[, start[, end]]) Like find(), but raise ValueError when the substring is not found. • isalnum( ) • isalpha( ) • isdigit( )
rpartition( sep) • rsplit( [sep [,maxsplit]]) • rstrip( [chars]) • split( [sep [,maxsplit]]) • splitlines( [keepends]) • startswith( prefix[, start[, end]]) • strip( [chars]) swapcase( ) • title( ) • translate( table[, deletechars]) • upper( ) • zfill( width)
Python re — Regular expressions • http://docs.python.org/library/re.html • re — Regular expression module • Operators (special characters) • Lookahead / lookbehind • Search vs match • re module contents
Python Regular Expressions • http://docs.python.org/2/library/re.html
Groups • The actual text that matches a re in parentheses is a group can be referred to later • Example: (?P<frst> [a-z]{3}) (?P=frst)
Positional special characters • \A Matches only at the start of the string. • \b Matches the empty string, but only at the beginning or end of a word. • \B • \d matches any decimal digit ---\D any non-digit character • \s matches any whitespace character, equivalent to [ \t\n\r\f\v] --- \S • \w matches any alphanumeric character and the underscore --- \W • \Z Matches only at the end of the string
re Module - Matching vs Searching • import re • re.match(pattern, line) • re.search(pattern, line) • >>> re.match("c", "abcdef") # No match • >>> re.search("c", "abcdef") # Match • <_sre.SRE_Match object at ...>
re.compile • re.compile(pattern[, flags]) • prog = re.compile(pattern) • result = prog.match(string)
Python’s Raw String Format • What regular expression matches the two character pattern “\\”? • Re = “\\\\” • Sometimes it simplifies patterns to disable the ‘\’. The “raw” modifier changes the interpretation of ‘\’ in regular expressions. • For instance • “\n” is an regular expression matches one character the newline • r“\n” is a regular expression with two characters ‘\’ and ‘n’
Natural Language Toolkit • http://nltk.org/ • interfaces to over 50 corpora and • lexical resources such as WordNet • suite of text processing libraries for • classification, • tokenization, • stemming, • tagging, • parsing, and • semantic reasoning.
Installing NLTK • http://nltk.org/install.html • Windows 32-bit binary installation • Install Python: http://www.python.org/download/releases/2.7.3/ • Install Numpy (optional): http://sourceforge.net/projects/numpy/files/NumPy/1.6.2/numpy-1.6.2-win32-superpack-python2.7.exe • Install NLTK: http://pypi.python.org/pypi/nltk • Install PyYAML: http://pyyaml.org/wiki/PyYAML • Test installation: Start>Python27, then type import nltk