LING / C SC 439/539 Statistical Natural Language Processing

LING / C SC 439/539Statistical Natural Language Processing Lecture 9 2/11/2013

Recommended reading • Nilsson Chapter 5, Neural Networks • http://ai.stanford.edu/~nilsson/mlbook.html • http://en.wikipedia.org/wiki/Connectionism • David E. Rumelhart and James L. McClelland. 1986. On learning the past tenses of English verbs. In McClelland, J. L., Rumelhart, D. E., and the PDP research group, Parallel Distributed Processing: Explorations in the Microstructure of Cognition. Volume II. Cambridge, MA: MIT Press. • Steven Pinker and Alan Prince. 1988. On language and connectionism: Analysis of a parallel distributed processing model of language acquisition. Cognition, 28, 73-193. • Steven Pinker and Michael T. Ullman. 2002. The past and future of the past tense. Trends in Cognitive Science, 6, 456-463.

Outline • Cognitive modeling • Perceptron as a model of the neuron • Neural networks • Acquisition of English past tense verbs • Discuss WA #2 and WA #3

Learning in cognitive science vs. engineering • Cognitive science • Want to produce a model of the human mind • Develop learning algorithms that model how the brain (possibly) computes • Understand observed behaviors in human learning • Engineering • Solve the problem; do whatever it takes to increase performance • Issues of relevance to the brain/mind are secondary • No constraints on resources to be used

Cognitive science:model what humans do • Learn aspects of language • How children acquire Part of Speech categories • How children acquire the ability to produce the past tense forms of English verbs • Process language • Sentence comprehension • Vowel perception • Won’t see sentiment analysis of movie reviews in a cognitive science journal…

Cognitive science:model how humans learn/process • Model observed behaviors in human learning and processing • Learning: errors in time course of acquisition • Overregularization of English past tense verbs: “holded” instead of “held” • Processing: reproduce human interpretations, such as for reduced relative clauses: • The horse raced past the barn fell • Use appropriate input data • Language acquisition: corpus of child-directed speech

Cognitive science: model how brain computes • Brains and computers have different architectures • Brain: • Consists of over 100 billion neurons, which connect to other neurons through synapses • Between 100 billion and 100 trillion synapses • Parallel computation • Computer: • One very fast processor • Serial computation • (okay, you can have multi-processor systems, but you won’t have hundreds of billions of processors) • Cognitive modeling: write computer programs that simulate distributed, parallel computation • Field of Connectionism

Biological neuron • Receives input from other neurons through its dendrites • Performs a computation. • Sends result from its axon to the dendrites of other neurons. • All-or-nothing firing of neuron http://www.cse.unsw.edu.au/~billw/dictionaries/pix/bioneuron.gif

Issue of abstract representations • In modeling data, we use abstract representations • Language: linguistic concepts such as “word”, “part of speech”, “sentence”, “phonological rule”, “syllable”, “morpheme”, etc. • The brain performs all computations through its neurons. Does the brain also use higher-level abstract representations like those proposed in linguistics? • Connectionism: • Opinion: no, there is no reality to linguistic theory • Try to show that distributed, parallel computational model of the brain is sufficient to learn and process language

Origin of the perceptron • Originally formulated as a model of biological computation • Want to model the brain • The brain consists of a huge network of neurons • Artificial neural networks • Perceptron = model of a single neuron • Neural network: a network of perceptrons linked together

Biological neuron • Receives input from other neurons through its dendrites • Performs a computation. • Sends result from its axon to the dendrites of other neurons. • All-or-nothing firing of neuron http://www.cse.unsw.edu.au/~billw/dictionaries/pix/bioneuron.gif

McCulloch and Pitts (1943):first computational model of a neuron http://wordassociation1.net/mcculloch.jpg http://www.asc-cybernetics.org/images/pitts80x80.jpg

McCulloch-Pitts neuron • A picture of McCulloch and Pitt’s mathematical model of a neuron. The inputs xi are multiplied by the weights wi, and the neuron sum their values. If this sum is greater than the threshold θ then the neuron fires, otherwise it does not.

Perceptron learning algorithm:Rosenblatt (1958) http://www.enzyklopaedie-der-wirtschaftsinformatik.de/wi-enzyklopaedie/Members/wilex4/Rosen-2.jpg/image_preview

Perceptron can’t learn hyperplanefor linearly inseparable data • Marvin Minsky and Seymour Papert 1969: • Perceptron fails to learn XOR • XOR is a very simple function • There’s no hope for the perceptron • Led to a decline in research in neural computation • Wasn’t an active research topic again until 1980s

Have multiple output nodes, with different weights from each input • Each input vector X = ( x1, x2, …, xm ) • Set of j perceptrons, each computing an output • Weight matrix W: size m x j x1 x2 xm

A multi-layer perceptron, or neural network, has one or more hidden layers • Hidden layers consist of perceptrons (neurons)

Feedforward neural network • Output from hidden layer(s) serves as input to next layer(s) • Computation flows from left to right

Computational capacity of neural network • Can learn any smooth function • Perceptron and SVM learn linear decision boundaries • Can learn XOR

XOR function, using pre-defined weights • Input: A = 0, B = 1 • Input to C: -.5*1 + 1*1 = 0.5 Output of C: 1 • Input to D: -1*1 + 1*1 = 0 Output of D: 0 • Input to E: -.5*1 + 1*1 + -1*0 = 0.5 Output of E: 1

XOR function, using pre-defined weights • Input: A = 1, B = 1 • Input to C: -.5*1 + 1*1 + 1*1 = 1.5 Output of C: 1 • Input to D: -1*1 + 1*1 + 1*1 = 1 Output of D: 1 • Input to E: -.5*1 + 1*1 + -1*1 = -.5 Output of E: 0

Learning in a neural network • 1. Choose topology of network • 2. Define activation function • 3. Define error function • 4. Define how weights are updated

1. Choose topology of network • How many hidden layers and how many neurons? • There aren’t really any clear guidelines • Try out several configurations and pick the one that works best • In practice, due to the difficulty of training, people use networks that have one, or at most two, hidden layers

2. Activation function • Activation: when a neuron fires • Let h be the weighted sum of inputs to a neuron. • Perceptron activation function: • g(h) = 1 if h > 0 • g(h) = 0 if h <= 0

Activation function for neural network • Cannot use threshold function • Need a smooth function for gradient descent algorithm, which involves differentiation • Use a sigmoid function: where parameter βis a positive value

3. Error function • Quantifies difference between targets and outputs • Error function for a single perceptron: E = t – y t = target y = output • Value of error function • If t = 0 and y = 1, E = 0 - 1 = -1 • If t = 1 and y = 0, E = 1 - 0 = 1 • Training: modify weights to achieve zero error

Error function for neural network (I) • First attempt: E = ∑ (t – y) • Fails because errors may cancel out • Example: suppose we make 2 errors. • First: target = 0, output = 1, error = -1 • Second: target = 1, output = 0, error = 1 • Sum of errors: -1 + 1 = zero error!

Error function for neural network (II) • Need to make errors positive • Let error function be the sum of squares function:

4. Update weights • The error function is minimized by updating weights • Perceptron weight update: w = w + η * XT(t-o) • Updating weights in the perceptron was simple • Direct relationship between input and output • How do we do this in a neural network? • There may be multiple layers intervening between inputs and outputs

Backpropagation • Suppose a neural network has one hidden layer • 1st-layer weights: between input layer and hidden layer • 2nd-layer weights: between hidden layer and output layer • Backpropagation: adjust weights backwards, one layer at a time • Update 2nd-layer weights using errors at output layer • Then update 1st-layer weights using errors at hidden layer • See readings for details of algorithm (requires calculus)

Neural network training algorithm: quick summary

Neural network training algorithm: details

Gradient descent • Series of training iterations • Weights of the network are modified in each iteration • Gradient descent: • Each iteration tries to minimize the value of the error function • In the limit of number of iterations, tries to find a configuration of weights that leads to zero error in training examples

Problem: local minima in error function • Algorithm gets “stuck” in local minima • Weights may reach a stable configuration such that the updating function does not compute a better alternative for next iteration • Result determined by initial weights • Random initialization of weights • These values determine the final weight configuration, which may not necessarily lead to zero error

Gradient descent gets stuck in local minimum;want to find global minimum Local minimum in error function Global minimum in error function

Overfitting • Since neural networks can learn any continuous function, they can be trained to the point where they overfit training data • Overfitting: algorithm has learned about the specific data points in the training set, and their noise, instead of the more general input/output relationship • Consequence: although error on training data is minimized, performance on testing set is degraded

Training and testing error rates over time • During initial iterations of training: • Training error rate decreases • Testing error rate decreases • When the network is overfitting: • Training error rate continues to decrease • Testing error rate increases • A characteristic of many learning algorithms

Illustration of overfitting BLACK = training data RED = testing data Error rate Training iterations

Prevent overfitting with a validation set • Set aside a portion of your training data to be a validation set • Evaluate performance on validation set over time • Stop when error rate increases

Neural network and model selection • Available models • Range of possible parameterizations of model • Topology of network, random initial weights • How well the model fits the data • Separate points in training data • Generalize to new data • Balance simplicity of model and fit to data model can be rather complex if there are lots of layers and nodes • Noisy data can overtrain and fit to noise • Separabilityany smooth function • Maximum margin no • Computational issues • Can require more training data for good performance

Neural networks in practice • Cognitive scientists and psychologists like neural networks • Biologically-inspired model of computation (though it’s vastly simplified compared to brains) • They often don’t know about other learning algorithms! • Engineers don’t use (basic) neural networks • Don’t work as well as other algorithms, such as SVM • Takes a long time to train • However, there is always new research in distributed learning models • Dynamic Bayesian Nets (DBNs) have recently become popular, and have done very well in classification tasks…

Rumelhart & McClelland 1986 • David E. Rumelhart and James L. McClelland. 1986. On learning the past tenses of English verbs. In McClelland, J. L., Rumelhart, D. E., and the PDP research group, Parallel Distributed Processing: Explorations in the Microstructure of Cognition. Volume II. Cambridge, MA: MIT Press. • One of the most famous papers in cognitive science

Main issue: what is the source of linguistic knowledge? • Empiricism / connectionism: • Ability to use language is a result of general-purpose learning capabilities of the brain • “Linguistic structures” are emergent properties of the learning process (not hard-coded into brain) • Rationalism / Chomsky / generative linguistics: • The mind is not a “blank slate” • Human brains are equipped with representational and processing mechanisms specific to language

English past tense verbs • Regular inflection: • cook, cooked for all regulars, • cheat, cheated apply rule: “add –ed” • climb, climbed • Irregular inflection: less predictible • eat, ate rule: suppletion • drink, drank rule: lower vowel • swing, swung rule: lower vowel • choose, chose rule: shorten vowel • fly, flewrule: vowel  u

Children’s acquisition of past tense • Observed stages in children’s acquisition • 1. Initially memorize forms, including the correct irregular forms • 2. Then acquire regular rule But also overregularize: bited, choosed, drinked • 3. Correct usage • Wug test (Berko 1958) • Children can generate morphological forms of novel words • This is a wug. Here are two ___.

Rumelhart & McClelland’s test • A neural network does not have specific structures for language • Neural network can be applied to any learning task • Use a neural network to learn mapping from verb base form to its past tense form • If successful, supports the hypothesis that the brain does not have specific representations for language either

LING / C SC 439/539 Statistical Natural Language Processing