Basics of Neural Nets and Past-Tense model

Basics of Neural Nets and Past-Tense model References to read Chapter 10 in Copeland’s Artificial Intelligence. Chapter 2 in Aleksander and Morton ‘Neurons and Symbols’. Chapters 1,3 and 5 in Beale and Jackson, ‘Neural Computing: and Introduction’. Chapter 18 in Rich and Knight, ‘Artificial Intelligence’.

The Human Brain Contains approximately ten thousand million basic units, called Neurons. Each neuron is connected to many others. Neuron is basic unit of the brain. A stand-alone analogue logical processing unit. Only basic details of neurons really understood. Neuron accepts many inputs, which are all added up (in some fashion). If enough active inputs received at once, neuron will be activated, and fire. If not, remains in inactive quiet state. Soma is the body of neuron.

Attached to soma are long filaments: dendrites. Dendrites act as connections through which all the inputs to the neuron arrive. Axon: electrically active. Serves as output channel of neuron. Axon is non-linear threshold device. Produces pulse, called action potential when resting potential within soma rises above some threshold level. Axon terminates in synapse which couples axon with dendrite of another cell. No direct linkage, temporary chemical one. Synapse releases neurotransmitters which chemically activate gates on dendrites. Activating gates, when open allow charged ions to flow.

These charged ions alters the dendritic potential and provides voltage pulse on dendrite which is conducted to next neuron body/Soma. A single neuron will have many synaptic inputs on its dendrites, and may have many synaptic outputs connecting it to other cells.

axon synapse

Learning: occurs when modifications made to effective coupling between one cell and another at the synaptic junction. More neurotransmitters are released, which opens more gates in dendrite. i.e. coupling is adjusted to favourably reinforce good connections.

The human brain: poorly understood, but capable of immensely impressive tasks. For example: vision, speech recognition, learning etc. Also, fault tolerant: distributed processing, many simple processing elements sharing each job. Therefore can tolerate some faults without producing nonsense. Graceful degradation: with continual damage, performance gradually falls from high level to reduced level, but without dropping catastrophically to zero. (Computers do not exhibit graceful degradation: intolerant of faults). Idea behind neural computing: by modelling major features of the brain and its operation, we can produce computers that exhibit many of the useful properties of the brain.

Modelling single neuron Important features to model • The output from a neuron is either on or off • The output depends only on the inputs. A certain number must be on at any one time in order to make the neuron fire. The efficiency of the synapses at coupling the incoming signal into the cell body can be modelled by having a multiplicative factor (I.e. weights) on each of the inputs to the neuron. More efficient synapse has correspondingly larger weight.

Total input = weight on line 1 x input on 1 + weight on line 2 x input on 2 + weight on line n x input on n (for all n) Basic model: performs weighted sum of inputs, compares this to internal threshold level, and turns on if this level exceeded. This model of neuron proposed in 1943 by McCulloch and Pitts. Model of neuron, not a copy: does not have complex patterns and timings of actual nervous activity in real neural systems. Because it is simplified, can implement on digital computer (Logic gates and neural nets!) Remember: it is only one more metaphor of the brain!

Learning in simple neurons Training nets of interconnected units. Essentially, if a neuron produces incorrect input, we want to reduce the chances of it happening again. If it gives correct output, do nothing. For example, think of problem of teaching neural net to tell the difference between a set of handwritten As and a set of handwritten Bs. In other words, to output a 1 when an A is presented, and a 0 when a B is presented. Start up with random weights on input lines and present an A. Neuron performs weighted sum of inputs and compares this to threshold.

If it exceeds threshold, output a 1, otherwise output of 0. If correct, do nothing. If it outputs a 0 (when A is presented) increase weighted sum so next time it will output a 1 Do this by increasing the weights. If it outputs a 1 in response to a B, decrease the weights so next time output will be 0.

Summary • Set the weights and thresholds randomly • Present an input • Calculate the actual output by taking the threshold value of the weighted sum of the inputs. • Alter the weights to reinforce correct decisions – ie reduce the error Learning is guided by knowing what we want it to achieve = supervised learning. The above shows the essentials of the early Percepton learning algorithm. This early history was called Cybernetics (rather than AI)

10 11 1 0 0 1 00 01 But there are limitations to Perceptron learning.

Exclusive-OR truth table • Consider two propositions, either of which may be true or false • Exclusive-or is the relationship between them when JUST ONE OF THEM is true. • It EXCLUDES the case when both are true,so exclusive-or of the two is… • False when both are true or both are false, and true in the other two cases.

10 11 1 0 0 1 00 01 But there are limitations to Perceptron learning. Consider Perceptron trying to find straight line that separates classes. In some cases, cannot draw straight line to separate classes. Eg Xor (if 0 is FALSE and 1 is TRUE) Input 0 1 – 1 Input 1 0 – 1 Input 1 1 – 0 Input 0 0 -- 0

Cannot separate the two pattern classes by a straight line: They are linearly inseparable This failure to solve apparently simple problems like XOR pointed out by Minsky and Papert in Perceptrons in 1969. Stopped by research in the area for the next 20 years! During which time (non-neural) AI got under way.

1986: Rumelhart and McClelland: multi-layer perceptron. Output Units Hidden Units bias Input Units bias A feedforward net with two weight layers and three sets of units.

Adapted perceptron, with units arranged in layers: an input layer, an output layer, and a hidden layer. Added threshold function, and alter learning rule. New learning rule: backpropagation (also a form of supervised learning) Net is shown pattern, and output is compared to desired output (target). Weights in the network adjusted, by calculating the value of the error function for a particular input, and then backpropagating the error from one layer to the previous one. Output weights (weights connected to output layer) can adjust so that value of error function reduced.

Less obvious how to adjust weights for hidden units (not directly producing an output). Input weights adjusted in direct proportion to the error in the units to which it is connected.

0.5 -1 1 1.5 0.5 1 1 1 1 Inputs: 00 01 10 11 A solution to XOR problem Right-hand hidden unit detects when both inputs are on, ensures output unit gets a net input of zero. Only one of two on never meets right-hand threshold (which multiplies with the negative weight) .

When only one of the inputs on, left-hand hidden unit is on, turning on output unit. When both inputs are off, hidden units are inactive, and output unit is off BUT learning rule not guaranteed to produce convergence: can fall into situation where it cannot learn correct output. = local minimum. BUT training requires repeated presentations. Training multi-layer perceptrons, an inexact science: no guarantee that net will converge on a solution (ie that it will learn to produce the required output in response to inputs). Can involve long training times.

Little guidance about a number of parameters, including the number of hidden units needed for a particular task. Also need to find a good input representation of a problem. Often need to search for a good preprocessing method.

Generalisation Main feature of neural networks: ability to generalise and to go beyond the patterns they have been trained on. Unknown pattern will be classified with others that have same features. Therefore learning by example is possible; net trained on representative set of patterns, and through generalisation similar patterns will also be classified.

Fault Tolerance Multi-layer perceptrons are fault-tolerant because each node contributes to final output. If node or weights lost, only slight deterioration. Ie graceful degradation

Brief History of Neural Nets Connectionism/ Neural Nets/ Parallel Distributed Processing. McCulloch and Pitts (1943) Brain-like mechanisms – showing how artificial neurons could be used to compute logical functions. Simplification 1: Neural communication thresholded – neuron is either active enough to fire, or it is not. Thus can be thought of as binary computing device (ON or OFF). Simplification 2: Synapses – equivalent to weighted connections. So can add up weighted inputs to an element, and use binary threshold as output function. 1949: Donald Hebb showed how neural nets could form a system that exhibits memory and learning.

Learning – a process of altering the strengths of connection between neurons. Reinforcing active connections. Rosenblatt (1962) and the Perceptron. Single layer net, which can learn to produce an output given an input. But connectionist research almost killed off by Minsky and Papert, by their book called ‘Perceptrons’. Argument: Perceptrons computationally weak. (certain problems which cannot be solved by a 1-layer net, and no learning mechanism for 2 layer net). But resurgence of interest in neural computation. - result of new neural network architectures, new learning algorithms. Ie Backpropagation and 2 layer nets.

Rumelhart and McClelland and PDP Research group (1986) 2 books on Parallel Distributed Processing. Presented variety of NN models – including Past-tense model (see below). Huge impact of these volumes partly because they contain cognitive models, I.e a model of some aspect of human cognition. Cognition: thinking, understanding language, memory. Human abilities that imply our ability to represent the world. Best contrasted to behaviour-based approach.

Example applications of Neural Nets NETtalk; Sejnowski and Rosenberg, 1987: network that learns to pronounce English text. Takes text, maps text onto speech phonemes, and then produces sounds using electronic speech generator. It is difficult to specify rules to govern translation of text into speech – many exceptions and complicated interactions between rules. For example, ‘x’ pronounced as in ‘box’ or ‘axe’, but exceptions eg ‘xylophone’. Connectionist approach: present words and their pronunciations, and see if net can discover mapping relationship between them.

203 input units, 80 hidden units, and 26 output units, corresponding to phonemes (basic sound in language). A window seven letters wide is moved over the text, and net learns to pronounce the middle letter. Each character is 29 input units, one for each of 26 letters, and one for blanks, periods and other punctuation. (7 x 29 inputs=203) Trained on 1024 word text, after 50 passes NETtalk learns to perform at 95% accuracy on training set. Able to generalise to unseen words at level of 78%. Note here: training vs. test sets

So, for the string ‘SHOEBOX’ • The first of the seven inputs is • 00000000000000000010000000 • because S is the 19th letter • and the output will be the zerophoneme because the E is silent in SHOEBOX, I.e if the zerophoneme is placed first: • 10000000000000000000000000

Particularly influential example: taperecorded NETtalk starting out with poor babbling speech and gradually improving until output intelligible. Sounds like child learning to speak. Passed the Breakfast TV test of real AI. Cotterell et al 1987: image compression. Gorman and Sejnowski 1988: classification of sonar signals (mines versus rocks) Tesauro and Sejnowski, 1989: playing backgammon. Le Cun et al, 1989: recognising handwritten postcodes Pomerleau, 1989: navigation of car on winding road

Summary: What are Neural Nets? Important characteristics: • Large number of very simple neuronlike processing elements. • Large number of weighted connections between these elements. • Highly parallel. • Graceful degradation and fault tolerant Key concepts Multi-layer perceptron. Backpropagation, and supervised learning. Generalisation: nets trained on one set of data, and then tested on a previously unseen set of data. Percentage of previously unseen set they get right shows their ability to generalise.

What does ‘brain-style computing’ mean? Rough resemblance between units and weights in Artificial Neural Network (or ANNs) and neurons in brain and connections between them. • Individual units in a net are like real neurons. • Learning in brain similar to modifying connection strengths. • Nets and neurons operate in a parallel fashion. • ANNs store information in a distributed manner as do brains. • ANNs and brain degrade gracefully. • BUT these structures still model logic gates as well and are not a different kind of non-von Neumann machine

BUT Artificial Neural Net account is simplified. Several aspects of ANNs don’t occur in real brains. Similarly brain contains many different kinds of neurons, different cells in different regions. e.g. not clear that backpropagation has any biological plausibility. Training with backpropagation needs enormous numbers of cycles. Often what is modelled is not the kinds of process that are likely to occur at neuron level. For example, if modelling our knowledge of kinship relationships, unlikely that we have individual neurons corresponding to ‘Aunt’ etc.

Edelman, 1987 suggests that it may take units ‘in the order of several thousand neurons to encode stimulus categories of significance to animals’. Better to talk of Neurally inspired or Brain-style computation. Remember too that (as with Aunt) even the best systems have nodes pre-coded with artificial notion slike the phonemes (corresponding to the phonetic alphabet). These cannot be precoded in the brain (as they are n Sejnowski’s NETTALK) but must themselves be learned.

Getting closer to real intelligence? Idea that intelligence is adaptive behaviour. I.e. an organism that can learn about its environment is intelligent. Can contrast this with approach that assumes that something like playing chess is an example of intelligent behaviour. Connectionism still in its infancy: • still not impressive compared to ants, earthworms or cockroaches. But arguably still closer to computation that does occur in brain than is the case in standard symbolic AI. Though remember McCarthy’s definition of AI as common-sense reasoning (esp. of a prelinguistic child).

And might still be a better approach than the symbolic one. Like analogy of climbing a tree to reach the moon – may be able to perform certain tasks in symbolic AI, but may never be able to achieve real intelligence. Ditto with connectionism/ANNs ---both sides use this argument.

Past-tense learning modelreferences: Chapter 18: On learning the past tenses of English verbs. In McClelland, J.L., Rumelhart, D.E. and the PDP Research Group (1986) Parallel Distributed Processing: Explorations in the Microstructure of Cognition, vol 2: Psychological and Biological Models, Cambridge, MA: MIT Press/Bradford Books. Chapter 6: Two simulations of higher cognitive processes. Bechtel, W. and Abrahamsen, A. (1991) Connectionism and the mind: An introduction to parallel processing in networks. Basil Blackwell.

Past-tense model A model of human ability to learn past-tenses of verbs. Presented by Rumelhart and McClelland (1986) in their ‘PDP volumes’: Main impact of these volumes: introduced and popularised the ideas of Multi-layer Perceptron, trained by means of Backpropagation

Children learning to speak: Baby: DaDa Toddler: Daddy Very young child: Daddy home!!!! Slightly older child: Daddy came home! Older child: Daddy comed home! Even older child: Daddy came home!

Stages of acquisition in children: Stage 1 : past tense of a few specific verbs, some regular e.g. looked, needed Most irregular: came, got, went, took, gave As if learned by rote (memorised). Stage 2 : evidence of general rule for past-tense. I.e. add ed to stem of verb. And often overgeneralise irregulars e.g. camed or comed instead of came. Also (Berko, 1958) can generate past tense for an invented word. E.g. if they use rick describe an action, will tend to say ricked when using the word in the past-tense.

Stage 3 : Produce correct forms for both regular and irregular verbs. Table: Characteristics of 3 stages of past-tense acquisition

U-shaped curve – correct past-tense form used for verbs in Stage 1, errors in Stage 2 (overgeneralising rule), few errors in Stage 3. Suggests Stage 2 children have acquired rule, and Stage 3 children have acquired exceptions to rule. Aim of Rumelhart and McClelland: to show that connectionist network could show many of same learning phenomena as children. • same stages and same error patterns.

Overview of past-tense NN model Not a full-blown language processor that learns past-tenses from full sentences heard in everyday experience. Simplified: model presented with pairs, corresponding to root form of word, and phonological structure of correct past-tense version of that word. Can test model by presenting root form of word, and looking at past-tense form it generates.

More detailed accountInput and Output Representation To capture order information used Wickelfeatures method of encoding words. 460 inputs: Wickelphones: represent target phoneme and immediate context. e.g. came - #Ka, kAm, aM# These are coarse-coded onto Wickelfeatures, where 16 wickelfeatures correspond to each wickelphone. Input and output of net consist of 460 units. Inputs are ‘standard present’ forms of verbs, outputs are corresponding past forms, regular or irregular, and all are in the special ‘wikel’ format.

This is a good example of need to find a good way of representing the input can’t just present words to a net; have to find a way of encoding those words so they can be presented as a set of inputs. Assessing output: compare the pattern of output Wickelphone activations to the pattern that the correct response would have generated. Hits: a 1 in output when a 1 in target and a 0 in output when a 0 in target. False alarms: 1s in the output not in the target. Misses: 0s in output, not in target.

Training and Testing Verb is input, and propagated across weighted connections – will activate wickelfeatures in output that correspond to past-tense of verb. Used perceptron-convergence procedure to train net. (NB not multi-layer perceptron: no hidden layer, and not trained with backpropagation. Problem must be linearly separable). Target tells output unit what value it should have. When actual output matches target, no weights adjusted. When computed output is 0, and target is 1, need to increase the probability that unit will be active the next time that pattern presented. All weights from all active input units increased by small amount eta. Also threshold reduced by eta.

Basics of Neural Nets and Past-Tense model

Basics of Neural Nets and Past-Tense model

Presentation Transcript

Past Tense

Neural Nets

Past Tense

Past tense

Past tense

Past tense

Present Tense and Past Tense

Past tense

Past tense and Future tense

Past Tense

Past tense

Past Tense

Neural Nets Applications

Past tense

Artificial Neural Nets

Past Tense

The Past Tense Neural Networks and Non-Symbolic Computation

Past Tense

Neural Nets

PAST TENSE

SIMPLE PAST TENSE AND PAST CONTINIOUS TENSE