Ch1. Introduction

Foundations of Statistical Natural Language Processing Ch1. Introduction 2002. 12. 31 임성신

Agenda • Why study NLP? • Why study NLP Statistically? • Subdivisions of NLP • Tools and Resources Used • Rational versus Empiricist Approaches to Language • Today’s Approach to NLP • Why is NLP Difficult? • Methods that don’t work well • What Statistical NLP can do for us • Things that can be done with Text Corpora • Word Counts • Zipf’s Law • Collocations • Concordances

Why study NLP? • Natural Language Processing (NLP) is a very important current area of investigation as it is necessary to many useful applications. • These applications include • information retrieval, extraction, and filtering • intelligent Web searching • spelling and grammar checking • automatic text summarization • pseudo-understanding and generation of natural language • multi-lingual systems including machine translation

Why study NLP Statistically? • Up to about 5-10 years, NLP was mainly investigated using a rule-based approach. • However, rules appear too strict to characterize people’s use of language. • This is because people tend to stretch and bend rules in order to meet their communicative needs. • Methods for making the modeling of language more accurate are needed and statistical methods appear to provide the necessary flexibility.

Subdivisions of NLP • Parts of Speech and Morphology (words, their syntactic function in sentences, and the various forms they can take). • Phrase Structure and Syntax (regularities and constraints of word order and phrase structure). • Semantics (the study of the meaning of words (lexical semantics) and of how word meanings are combined into the meaning of sentences, etc.) • Pragmatics (the study of how knowledge about the world and language conventions interact with literal meaning).

Tools and Resources Used • Probability/Statistical Theory: Statistical Distributions, Bayesian Decision Theory. • Linguistics Knowledge: Morphology, Syntax, Semantics and Pragmatics. • Corpora: Bodies of marked or unmarked text to which statistical methods and current linguistic knowledge can be applied in order to discover novel linguistic theories or interesting and useful knowledge organization.

Rational versus Empiricist Approaches to Language I • Question: What prior knowledge should be built into our models of NLP? • Rationalist Answer: A significant part of the knowledge in the human mind is not derived by the senses but is fixed in advance, presumably by genetic inheritance (Chomsky: poverty of the stimulus). • Empiricist Answer: The brain is able to perform association, pattern recognition, and generalization and, thus, the structures of Natural Language can be learned.

Rational versus Empiricist Approaches to Language II • Chomskyan/generative linguists seek to describe the language module of the human mind (the I-language) for which data such as text (the E-language) provide only indirect evidence, which can be supplemented by native speakers intuitions. • Empiricists approaches are interested in describing the E-language as it actually occurs. • Chomskyans make a distinction between linguistic competence and linguistic performance. They believe that linguistic competence can be described in isolation while Empiricists reject this notion.

Today’s Approach to NLP • From ~1970-1989, people were concerned with the science of the mind and built small (toy) systems that attempted to behave intelligently. • Recently, there has been more interest on engineering practical solutions using automatic learning (knowledge induction). • While Chomskyans tend to concentrate on categorical judgments about very rare types of sentences, statistical NLP practitioners concentrate on common types of sentences.

Why is NLP Difficult?(1) • NLP is difficult because Natural Language is highly ambiguous. • Example: “Our company is training workers” has 3 parses (i.e., syntactic analyses).

Why is NLP Difficult?(2) • “List the sales of the products produced in 1973 with the products produced in 1972” has 455 parses. • Therefore, a practical NLP system must be good at making disambiguation decisions of word sense, word category, syntactic structure, and semantic scope.

Methods that don’t work well • Maximizing coverage while minimizing ambiguity is inconsistent with symbolic NLP. • Furthermore, hand-coding syntactic constraints and preference rules are time consuming to build, do not scale up well and are brittle in the face of the extensive use of metaphor in language. • Example: if we code animate being --> swallow --> physical object I swallowed his story, hook, line, and sinker The supernova swallowed the planet.

What Statistical NLP can do for us • Disambiguation strategies that rely on hand-coding produce a knowledge acquisition bottleneck and perform poorly on naturally occurring text. • A Statistical NLP approach seeks to solve these problems by automatically learning lexical and structural preferences from corpora. In particular, Statistical NLP recognizes that there is a lot of information in the relationships between words. • The use of statistics offers a good solution to the ambiguity problem: statistical models are robust, generalize well, and behave gracefully in the presence of errors and new data.

Things that can be done with Text Corpora I: Word Counts • Word Counts to find out: • What are the most common words in the text. • How many words are in the text (word tokens and word types). • What the average frequency of each word in the text is. • Limitation of word counts: Most words appear very infrequently and it is hard to predict much about the behavior of words that do not occur often in a corpus. ==> Zipf’s Law.

Things that can be done with Text Corpora II: Zipf’s Law(1) • If we count up how often each word type of a language occurs in a large corpus and then list the words in order of their frequency of occurrence, we can explore the relationship between the frequency of a word, f, and its position in the list, known as its rank, r. • Zipf’s Law says that: f  1/r • Significance of Zipf’s Law: For most words, our data about their use will be exceedingly sparse. Only for a few words will we have a lot of examples.

Things that can be done with Text Corpora II: Zipf’s Law(2)

Things that can be done with Text Corpora II: Zipf’s Law(3) – The brown corpus • Famous early corpus. Made by W. Nelson Francis and Henry Kucera at Brown University in the 1960s. A balanced corpus of written American English in 1960 (except poetry!). • 1 million words, which seemed huge at the time. • Sorting the words to produce a word list took 17 hours of (dedicated) processing time, because the computer (an IBM 7070) had the equivalent of only about 40 kilobytes of memory, and so the sort algorithm had to store the data being sorted on tape drives. • Its significance has increased over time, but also awareness of its limitations. • Tagged for part of speech in the 1970s • The/AT General/JJ-TL Assembly/NN-TL ,/, which/WDTadjourns/VBZ today/NR ,/, has/HVZ performed/VBN

Things that can be done with Text Corpora II: Zipf’s Law(4) • f  1/r • There is a constant k such that • f · r = k • (Now frequently invoked for the web too!See http://linkage.rockefeller.edu/wli/zipf/) • Mandelbrot’s law • f = P(r + ρ) • log f = log P - B log(r +ρ) -B Zipf’s law for the Brown corpus Mandelbrot’s formula for the Brown corpus

Things that can be done with Text Corpora III: Collocations(1) • A collocation is any turn of phrase or accepted usage where somehow the whole is perceived as having an existence beyond the sum of its parts (e.g., disk drive, make up, bacon and eggs). • Collocations are important for machine translation. • Collocation can be extracted from a text (example, the most common bigrams can be extracted). However, since these bigrams are often insignificant (e.g., “at the”, “of a”), they can be filtered.

Things that can be done with Text Corpora III: Collocations(2) Commonest bigrams in the NYT Filtered common bigrams in the NYT

Things that can be done with Text Corpora IV: Concordances(1) • Finding concordances corresponds to finding the different contexts in which a given word occurs. • One can use a Key Word In Context (KWIC) concordancing program. • Concordances are useful both for building dictionaries for learners of foreign languages and for guiding statistical parsers.

Things that can be done with Text Corpora IV: Concordances(2)

Ch1. Introduction

Ch1. Introduction

Presentation Transcript

Ch1 Introduction

Ch1. Introduction and Historical Overview

SQL2-ch1

BD1-CH1

BD1-CH1

CSS430 Introduction Textbook Ch1 – Ch2

Ch1: Introduction

CH1

Ch1 Overview

CH1

IT 1 V4.0 Ch1. Introduction

SQL1-ch1

Ch1 Introduction