480 likes | 580 Vues
Nick Chater Department of Psychology University of Warwick. Paul Vit á nyi Centrum voor Wiskunde en Informatica Amsterdam. “Ideal” learning of language and categories. OVERVIEW. I. Learning from experience: The problem II. Learning to predict III. Learning to identify
E N D
Nick Chater Department of Psychology University of Warwick Paul Vitányi Centrum voor Wiskunde en Informatica Amsterdam “Ideal” learning of language and categories
OVERVIEW I. Learning from experience: The problem II. Learning to predict III. Learning to identify IV. A methodology for assessing learnability VI. Where next?
Model fitting Assume M(x) Optimize x Easy, but needs prior knowledge No assumptions Learning is impossible---”no free lunch” Learning: How few assumptions will work? ? ? ? ? Can a more minimal model of learning still work?
Learning from +/- vs. + data target language/category guess + data overlap - data Under-general Over-general
Categorization Language acquisition But how about learning from + data only? ? ? ? ? ? ?
In Categorization, rules out: Almost all learning experiments in psychology Exemplar models Prototype models NNs, SVMs… Language acquisition Assumed that children only needing access to positive evidence Sometimes viewed as ruling out learning models entirely The “Logical” problem of language acquisition(e.g., Hornstein & Lightfoot, 1981; Pinker, 1979) ? ? ? ? ? ? Learning from +ive data seems to raise in principle problems
Must be solvable: A parallel with science • Science only has access to positive data • Yet it seems to be possible • So overgeneral theories must be eliminated, somehow • e.g., “Anything goes” seems a bad theory • Theories must capture regularities, not just fit data
Absence as implicit negative evidence? • Thus overgeneral grammars may predict lots of missing sentences • And their absence is a systematic clue that the theory is probably wrong This idea only seems convincing if can be proved that convergence works well, statistically... So what do we need to assume?
Assume that data is generated by Random factors Computable factors i.e., nothing uncomputable “Monkeys typing into a programming language” A modest assumption! S NP V V NP Modest assumption: Computability constraint Chance …HHTTTHTTHTTHT… Computable process …The cat sat on the mat. The dog…
Learning by simplicity • Find explanation of “input” that is as simple as possible • An ‘explanation’ reconstructs the input • Simplicity measured in code length • Long history in perception: Mach, Koffka, Hochberg, Attneave, Leeuwenberg, van der Helm • Mimicry theorem with Bayesian analysis E.g., Li & Vitányi (2000); Chater (1996); Chater & Vitányi ( ms.) • Relation to Bayesian inference • Widely used in statistics and machine learning
Given the data, what is the shortest code How well does the shortest code work? Prediction Identification Ignore the question of search Makes general results feasible But search won’t go away…! Fundamental question: when is learning data-limited or search-limited? Consider “ideal” learning
Three kinds of induction • Prediction: • converge on correct predictions • Identification: • identify generating category/distribution in the limit • Learning causal mechanisms?? • Inferring counterfactuals---effects of intervention • (cf Pearl: from probability to causes)
Prediction by simplicity • Find shortest ‘program/explanation’ for current data • Predict using that program • Strictly, use ‘weighted sum’ of explanations, weighted by brevity… • Equivalent to Bayes with (roughly) a 2-K(x) prior, where K(x) is the length of the shortest program generating x
Summed error has finite bound (Solomonoff, 1978) So prediction converges [faster than 1/nlog(n), for corpus size n] Inductive inference is possible! No independence or stationarity assumptions; just computability of generating mechanism
Language A. Grammaticality judgements B. Language production C. Form-meaning mappings Categorization Learning from positive examples Applications
A: Grammaticality judgments • We want a grammar that doesn’t over- or under- generalize (much) w.r.t., ‘true’ grammar, on sentences that are statistically likely to occur • NB. No guarantees for… • Colorless green ideas sleep furiously (Chomsky) • Bulldogs bulldogs bulldogs fight fight fight (Fodor)
Converging on a grammar • Fixing undergeneralization is easy (such grammars get ‘falsified’) • Overgeneralization is the hard problem • Need to use absence as evidence • But the language is infinite; any corpus finite • So almost all grammatical sentences are also absent • Logical problem of language acquisition; • Baker’s paradox • Impossibility of ‘mere’ learning from positive evidence
Overgeneralization Theorem • Suppose learner has probability j of erroneously guessing an ungrammatical jth word • Intuitive explanation: • overgeneralization implies smaller than need probs to grammatical sentences; • and hence excessive code lengths
B: Language production • Simplicity allows ‘mimicry’ of any computable statistical method of generating a corpus • Arbitrary prob, ; simplicity prob, • Li & Vitányi, 1997
C: Learning form-meaning mappings • So far we have ignored semantics • Suppose language inputs consists of form-meaning pairs (cf Pinker) • Assume only the form→meaning and meaning → form mappings are computable (don’t have to be deterministic)…
A theorem • It follows that: • Total errors in mapping forms to (sets of) meanings (with probs) and • Total errors in mapping forms to (sets of) meanings (with probs) • …have a finite bound (and hence average errors/sentence tend to 0)
Sample n items from category C (assume each all items equally likely) Guess, by choosing the D that provides the shortest code for the data General proof method: 1. Overgeneralization D must be basis for a shorter code than C (or you wouldn’t prefer it) 2.Undergeneralization Typical data from category C will have no code shorter than nlog|C| Categorization
1. Fighting overgeneralization • D can’t be much bigger than C, or it’ll have a longer code length: • K(D)+nlog|D| ≤ K(C)+nlog|C| • as n, constraint is that • |D|/|C| ≤ 1+O(1/n)
But guess must cover most of the correct category---or it’d provide a “suspiciously” short code for the data Typicality: K(D|C)+nlog|CD|≥ nlog|C| as n, constraint is that |CD|/|C| ≥ 1-O(1/n) 2. Fighting undergeneralization C D D C
|D| converges to near |C| Accuracy bounded by O(1/n), with n samples i.i.d. assumptions Actual rate depends on structure of category is crucial Language: need lots of examples (but how many?) Some categories may only need a few (one?) example (Tenenbaum, Feldman) Implication
Hypothesis identification • Induction of ‘true’ hypothesis, or category, or language • In philosophy of science, typically viewed as hard problem… • Needs stronger assumptions than prediction
Assume endless data Goal: specify an algorithm that, at each point, picks a hypothesis And eventually locks in on the correct hypothesis though can never announce it---as there may always be an additional low frequency item that’s yet to be encountered Gold, Osherson et al have studied this extensively Sometimes viewed as showing identification not possible (but really a mix of positive and negative results) But i.i.d. and computability allows a general positive result Identification in the limit: The problem
Algorithms have two parts Program which specifies set Pr Sample from Pr, using average code length H(Pr) per data point Pick a specific set of data (which needs to be ‘long enough’) Won’t necessarily know what is long enough---an extra assumption Specify enumeration of programs for Pr, e.g., in order of length Run, dovetailing Initialize with any Pr Flip to Pr that corresponds to shortest program so far, that has generated data Algorithm
prog1 1 2 4 7 prog2 3 5 8 prog3 6 9 prog4 10 Runs for ever… Run these in order, dovetailing, where each program gets 2(-length) steps This process runs for ever (looping programs) Shortest prog so far “pocketed”… This will always finish on the “true” program Dovetailing
For large enough stream of n typical data, no alternative model does better Expected code length of coding data generated by Pr, by Pr’ rather than Pr, wastes n.D(Pr’||Pr) D(Pr’||Pr) > 0; so swamps initial code length, for large enough n Overwhelmingly likely to work... (as n, Prob correct identification1) Pr wins Initial Code n=8 K(Pr) K(Pr’)
Constraint c is learnable if code which 1. “invests” l(c) bits to encode c (investing) can… 2. recoup its investment save more than l(c) bits in encoding the data Nativism? c is acquired But not enough data can’t recoup investment (e.g., little/no relevant data) Viability of empiricism? Ample supply of data to recoup l(c) Assessing learnability in cognition? Cf Tenenbaum, Feldman…
Consider of linguistic constraint (e.g., noun-verb agreement; subjacency, phonological constraints) Cost assessed by length of formulation (length of linguistic rules) Saving: reduction in cost of coding data (perceptual, linguistic) Language acquisition: Poverty of the stimulus, quantified
John loves tennis They love_ tennis John loves tennis *John love_tennis They love_ tennis *They lovestennis x bits y bits x+1 bits y+1 bits Easy example: learning singular-plural • If constraint applies to proportion p of n sentences, constraint saves pn bits.
Depth from stereo: Invest: algorithm for correspondence Recoup: almost a whole image (that’s a lot!) Perhaps could infer stereo for a single stereo image? Object/texture models (Yuille) Investment in building the model But recoup in compression, over “raw” image description Presumably few images needed? Visual structure―ample data?
Quasi-regular structures are ubiquitous in language: e.g., alternations It is likely that John will come It is possible that John will come John is likely to come *John is possible to come (Baker,1979, see also Culicover) Strong winds High winds Strong currents *High currents I love going to Italy! I enjoy going to Italy! I love to go to Italy! *I enjoy to go to Italy! A harder linguistic case: Baker’s paradox(with Luca Onnis and Matthew Roberts)
Baker’s paradox (Baker, 1979) • Selectional restrictions: “holes” in the space of possible sentences allowed by a given grammar… • How does the learner avoid falling into the holes?? • i.e., how does the learner distinguish genuine ‘holes’ from the infinite number of unheard grammatical constructions?
Our abstract theory tells us something • Theorem on grammaticality judgments show that the paradox is solvable, in the asymptote, and with no computational restrictions • But can this be scaled down… • Learn specific ‘alternation’ patterns • With corpus the child hears
To encode an exception, which appears to have probability x, requires Log2(1/x) bits But this elimination of x makes all other sentences (1-x) times more likely, saving: n(Log2(x/1-x) bits Argument by information investment Does the saving outweigh the investment?
An example Recovery from overgeneralisations The rabbit hid You hid the rabbit! The rabbit disappeared *You disappeared the rabbit! Return on ‘investment’ over 5M words from the CHILDES database is easily sufficient But this methodology can be applied much more widely (and aimed at fitting time-course of U-shaped generalization; and when overgeneralizations do or do not arise).
Can we learn causal structure from observation? What happens if we move the left hand stick?
The output of perception provides a description in terms of causality • Liftability • Breakability • Edibility • Whats is attached to what • What is resting on what Without this, perception is fairly useless as an input for action
Formal question Suppose a modular computer program generate stream of data of indefinite length… Under what conditions can modularity be recovered? How might “interventions”/expts help? Inferring causality from observation: The hard problem of induction Sensory input Generative process (Key technical idea: Kolmogorov sufficient statistic)
If data is generated by independent processes Then one model of the data will involve recapitulation of those processes But will there be other alternative modular programs? Which might be shorter? Hopefully not! Completely open field… Fairly uncharted territory