Introduction to Computational Natural Language LearningLinguistics 79400 (Under: Topics in Natural Language Processing)Computer Science 83000 (Under: Topics in Artificial Intelligence)The Graduate School of the City University of New YorkFall 2001 William Gregory Sakas Hunter College, Department of Computer Science Graduate Center, PhD Programs in Computer Science and Linguistics The City University of New York
Elman’s Single Recurrent Network book boy dog run see eat rock 1-to-1 exact copy of activations "regular" trainable weight connections book boy dog run see eat rock 1) activate from input to output as usual (one input word at a time), but copy the hidden activations to the contextlayer. 2) repeat 1 over and over - but activate from the input AND copy layers to the ouput layer.
From Elman (1990) Templates were set up and lexical items were chosen at random from "reasonable" categories. Categories of lexical items NOUN-HUM man, woman NOUN-ANIM cat, mouse NOUN-INANIM book, rock NOUN-AGRESS dragon, monster NOUN-FRAG glass, plate NOUN-FOOD cookie, sandwich VERB-INTRAN think, sleep VERB-TRAN see, chase VERB-AGPA move, break VERB-PERCEPT smell, see VERB-DESTROY break, smash VERB-EA eat Templates for sentence generator NOUN-HUM VERB-EAT NOUN-FOOD NOUN-HUM VERB-PERCEPT NOUN-INANIM NOUN-HUM VERB-DESTROY NOUN-FRAG NOUN-HUM VERB-INTRAN NOUN-HUM VERB-TRAN NOUN-HUM NOUN-HUM VERB-AGPAT NOUN-INANIM NOUN-HUM VERB-AGPAT NOUN-ANIM VERB-EAT NOUN-FOOD NOUN-ANIM VERB-TRAN NOUN-ANIM NOUN-ANIM VERB-AGPAT NOUN-INANIM NOUN-ANIM VERB-AGPAT NOUN-INANIM VERB-AGPAT NOUN-AGRESS VERB-DESTORY NOUN-FRAG NOUN-AGRESS VERB-EAT NOUN-HUM NOUN-AGRESS VERB-EAT NOUN-ANIM NOUN-AGRESS VERB-EAT NOUN-FOOD
Training data Supervisor's answers Resulting training and supervisor files. Files were 27,354 words long, made up of 10,000 two and three word "sentences." womansmashplatecatmovemanbreakcarboymovegirleatbreaddog smashplatecatmovemanbreakcarboymovegirleatbreaddogmove
Cluster (Similarity) analysis Hidden activations were for each word were averaged together. For simplicity assume only 3 hidden nodes (in fact there were 150).After the SRN was trained, the file was run through the network. The activations at the hidden nodes was recorded (I made up these numbers for the example).Now the average was taken for every word: boysmashplate...dragoneatboy...boyeatcookie... <.5 .3 .2><.4 .4 .2> <.2 .3 .8>...<.6 .1 .3><.1 .2 .4> <.9 .9 .7>...<.7 .6 .7><.4 .3 .6> <.2 .3 .4> <.70 .60 .53><.40 .40 .20> <.20 .30 .80><.60 .10 .30><.25 .25 .50> <.20 .30 .40> boysmashplatedragoneatcookie Each of these vectors represents a point in 3-D space – some vectors are close together, some furthur apart.
Each of these words represents a point in 150-Dimentional space averaged from all activations generated by the network when processing that word. Each joint (where there is a connection) represents the distance between clusters. So for example, the distance between animals and humans is approx .85 and the distance between ANIMATES and INANIMATES is approx 1.5.
Seems to correctly discover Nouns vs Verbs, verb subcategorization, animates/inanimates, etc. Cool, eh? Remarks: No information is represented in the input (localist, orthogonal) There are no "rules" in the traditional sense. The categories are learned from statistical regularities in the sentences – there is no structure being provided to the network (more on this in a bit) There are no "symbols" in the traditional sense. Classic symbol manipulating systems use names for well-defined classes of entities (N, V, adj, etc). In an SRN the representation of the concept of, say, boy, is:1. distributed – (as a vector of activations), and2. represented over context wrt to words that come before. (E.g. boy is represented one way when used as an object and another when used in subject position) Although. note that when a cluster analysis is performed on specific occurrences of a word, the cluster is very tight, but there is some variation based on a words context.
From Elman (1991) – Constituency, long distance relations, optionality. A simple context-free grammar was used S -> NP VP . NP -> PropN | N | N RC VP -> V (NP) RC -> who NP VP | who VP (NP) N -> boy | girl | cat | dog | boys | girls | cats | dogs V -> chase | feed | see | hear | walk | live | chases | feeds | sees | hears | walks | lives PropN -> John | Mary Plus constraints on number agreement, and verb argument subcategorization.
This allows a variety of interesting sentences that were used for training. (note *'d items were not used for training. For you CS people out there, * frequently means ungrammatical) Dogs live.*Dogs live cats. Boys see. Boys see dogs.Boys see dog. Boys hit dogs.*Boys hit. Dog who chases cat sees girl.*Dog who chase cat sees girl. Dog who cat chases sees girl. Boys who girls who dogs chase see hear. Boys see dogs who see girls who hear.Boys see dogs who see girls.Boys see dogs.Boys see. Transitive Optionally transitive intransitive long distance number agreement Ambiguous sentence boundaries.
Boys who Mary chase feed cats. • This is much, much difficult input than Elman 1990. • Long distance agreement: • chases agrees with Boys, but who Mary is in the way. • Subcategorization: • chases is mandatorily transitive, but in a relative clause, the network has to NOT mistake it as the independent sentenceMary chases.
But PCA give you more info: girlsubj girlobj activation at hidden node 3 activation at hidden node 1 boyobj boysubj activation at hiddennode 2 Analysis of results – Principle Component Analysis. Suppose you have 3 hidden nodes and four vectors of activation that correspond to: boysubj, boyobj, girlsubj, girlobj. And hierarchical clustering gives you this: boysubj boyobj girlobj girlsubj Adapted from Crocker (2001)