ideal learning of language and categories n.
Skip this Video
Loading SlideShow in 5 Seconds..
“Ideal” learning of language and categories PowerPoint Presentation
Download Presentation
“Ideal” learning of language and categories

“Ideal” learning of language and categories

91 Vues Download Presentation
Télécharger la présentation

“Ideal” learning of language and categories

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Nick Chater Department of Psychology University of Warwick Paul Vitányi Centrum voor Wiskunde en Informatica Amsterdam “Ideal” learning of language and categories

  2. OVERVIEW I. Learning from experience: The problem II. Learning to predict III. Learning to identify IV. A methodology for assessing learnability VI. Where next?

  3. Learning from experience: The problem

  4. Model fitting Assume M(x) Optimize x Easy, but needs prior knowledge No assumptions Learning is impossible---”no free lunch” Learning: How few assumptions will work? ? ? ? ? Can a more minimal model of learning still work?

  5. Learning from +/- vs. + data target language/category guess + data overlap - data Under-general Over-general

  6. Categorization Language acquisition But how about learning from + data only? ? ? ? ? ? ?

  7. In Categorization, rules out: Almost all learning experiments in psychology Exemplar models Prototype models NNs, SVMs… Language acquisition Assumed that children only needing access to positive evidence Sometimes viewed as ruling out learning models entirely The “Logical” problem of language acquisition(e.g., Hornstein & Lightfoot, 1981; Pinker, 1979) ? ? ? ? ? ? Learning from +ive data seems to raise in principle problems

  8. Must be solvable: A parallel with science • Science only has access to positive data • Yet it seems to be possible • So overgeneral theories must be eliminated, somehow • e.g., “Anything goes” seems a bad theory • Theories must capture regularities, not just fit data

  9. Absence as implicit negative evidence? • Thus overgeneral grammars may predict lots of missing sentences • And their absence is a systematic clue that the theory is probably wrong This idea only seems convincing if can be proved that convergence works well, statistically... So what do we need to assume?

  10. Assume that data is generated by Random factors Computable factors i.e., nothing uncomputable “Monkeys typing into a programming language” A modest assumption! S NP V V NP Modest assumption: Computability constraint Chance …HHTTTHTTHTTHT… Computable process …The cat sat on the mat. The dog…

  11. Learning by simplicity • Find explanation of “input” that is as simple as possible • An ‘explanation’ reconstructs the input • Simplicity measured in code length • Long history in perception: Mach, Koffka, Hochberg, Attneave, Leeuwenberg, van der Helm • Mimicry theorem with Bayesian analysis E.g., Li & Vitányi (2000); Chater (1996); Chater & Vitányi ( ms.) • Relation to Bayesian inference • Widely used in statistics and machine learning

  12. Given the data, what is the shortest code How well does the shortest code work? Prediction Identification Ignore the question of search Makes general results feasible But search won’t go away…! Fundamental question: when is learning data-limited or search-limited? Consider “ideal” learning

  13. Three kinds of induction • Prediction: • converge on correct predictions • Identification: • identify generating category/distribution in the limit • Learning causal mechanisms?? • Inferring counterfactuals---effects of intervention • (cf Pearl: from probability to causes)

  14. II. Learning to predict

  15. Prediction by simplicity • Find shortest ‘program/explanation’ for current data • Predict using that program • Strictly, use ‘weighted sum’ of explanations, weighted by brevity… • Equivalent to Bayes with (roughly) a 2-K(x) prior, where K(x) is the length of the shortest program generating x

  16. Summed error has finite bound (Solomonoff, 1978) So prediction converges [faster than 1/nlog(n), for corpus size n] Inductive inference is possible! No independence or stationarity assumptions; just computability of generating mechanism

  17. Language A. Grammaticality judgements B. Language production C. Form-meaning mappings Categorization Learning from positive examples Applications

  18. A: Grammaticality judgments • We want a grammar that doesn’t over- or under- generalize (much) w.r.t., ‘true’ grammar, on sentences that are statistically likely to occur • NB. No guarantees for… • Colorless green ideas sleep furiously (Chomsky) • Bulldogs bulldogs bulldogs fight fight fight (Fodor)

  19. Converging on a grammar • Fixing undergeneralization is easy (such grammars get ‘falsified’) • Overgeneralization is the hard problem • Need to use absence as evidence • But the language is infinite; any corpus finite • So almost all grammatical sentences are also absent • Logical problem of language acquisition; • Baker’s paradox • Impossibility of ‘mere’ learning from positive evidence

  20. Overgeneralization Theorem • Suppose learner has probability j of erroneously guessing an ungrammatical jth word • Intuitive explanation: • overgeneralization implies smaller than need probs to grammatical sentences; • and hence excessive code lengths

  21. B: Language production • Simplicity allows ‘mimicry’ of any computable statistical method of generating a corpus • Arbitrary prob, ; simplicity prob,  • Li & Vitányi, 1997

  22. C: Learning form-meaning mappings • So far we have ignored semantics • Suppose language inputs consists of form-meaning pairs (cf Pinker) • Assume only the form→meaning and meaning → form mappings are computable (don’t have to be deterministic)…

  23. A theorem • It follows that: • Total errors in mapping forms to (sets of) meanings (with probs) and • Total errors in mapping forms to (sets of) meanings (with probs) • …have a finite bound (and hence average errors/sentence tend to 0)

  24. Sample n items from category C (assume each all items equally likely) Guess, by choosing the D that provides the shortest code for the data General proof method: 1. Overgeneralization D must be basis for a shorter code than C (or you wouldn’t prefer it) 2.Undergeneralization Typical data from category C will have no code shorter than nlog|C| Categorization

  25. 1. Fighting overgeneralization • D can’t be much bigger than C, or it’ll have a longer code length: • K(D)+nlog|D| ≤ K(C)+nlog|C| • as n, constraint is that • |D|/|C| ≤ 1+O(1/n)

  26. But guess must cover most of the correct category---or it’d provide a “suspiciously” short code for the data Typicality: K(D|C)+nlog|CD|≥ nlog|C| as n, constraint is that |CD|/|C| ≥ 1-O(1/n) 2. Fighting undergeneralization C D D C

  27. |D| converges to near |C| Accuracy bounded by O(1/n), with n samples i.i.d. assumptions Actual rate depends on structure of category is crucial Language: need lots of examples (but how many?) Some categories may only need a few (one?) example (Tenenbaum, Feldman) Implication

  28. III. Learning to identify

  29. Hypothesis identification • Induction of ‘true’ hypothesis, or category, or language • In philosophy of science, typically viewed as hard problem… • Needs stronger assumptions than prediction

  30. Assume endless data Goal: specify an algorithm that, at each point, picks a hypothesis And eventually locks in on the correct hypothesis though can never announce it---as there may always be an additional low frequency item that’s yet to be encountered Gold, Osherson et al have studied this extensively Sometimes viewed as showing identification not possible (but really a mix of positive and negative results) But i.i.d. and computability allows a general positive result Identification in the limit: The problem

  31. Algorithms have two parts Program which specifies set Pr Sample from Pr, using average code length H(Pr) per data point Pick a specific set of data (which needs to be ‘long enough’) Won’t necessarily know what is long enough---an extra assumption Specify enumeration of programs for Pr, e.g., in order of length Run, dovetailing Initialize with any Pr Flip to Pr that corresponds to shortest program so far, that has generated data Algorithm

  32. prog1 1 2 4 7 prog2 3 5 8 prog3 6 9 prog4 10 Runs for ever… Run these in order, dovetailing, where each program gets 2(-length) steps This process runs for ever (looping programs) Shortest prog so far “pocketed”… This will always finish on the “true” program Dovetailing

  33. For large enough stream of n typical data, no alternative model does better Expected code length of coding data generated by Pr, by Pr’ rather than Pr, wastes n.D(Pr’||Pr) D(Pr’||Pr) > 0; so swamps initial code length, for large enough n Overwhelmingly likely to work... (as n, Prob correct identification1) Pr wins Initial Code n=8 K(Pr) K(Pr’)

  34. IV. A methodology for assessing learnability

  35. Constraint c is learnable if code which 1. “invests” l(c) bits to encode c (investing) can… 2. recoup its investment save more than l(c) bits in encoding the data Nativism? c is acquired But not enough data can’t recoup investment (e.g., little/no relevant data) Viability of empiricism? Ample supply of data to recoup l(c) Assessing learnability in cognition? Cf Tenenbaum, Feldman…

  36. Consider of linguistic constraint (e.g., noun-verb agreement; subjacency, phonological constraints) Cost assessed by length of formulation (length of linguistic rules) Saving: reduction in cost of coding data (perceptual, linguistic) Language acquisition: Poverty of the stimulus, quantified

  37. John loves tennis They love_ tennis John loves tennis *John love_tennis They love_ tennis *They lovestennis x bits y bits x+1 bits y+1 bits Easy example: learning singular-plural • If constraint applies to proportion p of n sentences, constraint saves pn bits.

  38. Depth from stereo: Invest: algorithm for correspondence Recoup: almost a whole image (that’s a lot!) Perhaps could infer stereo for a single stereo image? Object/texture models (Yuille) Investment in building the model But recoup in compression, over “raw” image description Presumably few images needed? Visual structure―ample data?

  39. Quasi-regular structures are ubiquitous in language: e.g., alternations It is likely that John will come It is possible that John will come John is likely to come *John is possible to come (Baker,1979, see also Culicover) Strong winds High winds Strong currents *High currents I love going to Italy! I enjoy going to Italy! I love to go to Italy! *I enjoy to go to Italy! A harder linguistic case: Baker’s paradox(with Luca Onnis and Matthew Roberts)

  40. Baker’s paradox (Baker, 1979) • Selectional restrictions: “holes” in the space of possible sentences allowed by a given grammar… • How does the learner avoid falling into the holes?? • i.e., how does the learner distinguish genuine ‘holes’ from the infinite number of unheard grammatical constructions?

  41. Our abstract theory tells us something • Theorem on grammaticality judgments show that the paradox is solvable, in the asymptote, and with no computational restrictions • But can this be scaled down… • Learn specific ‘alternation’ patterns • With corpus the child hears

  42. To encode an exception, which appears to have probability x, requires Log2(1/x) bits But this elimination of x makes all other sentences (1-x) times more likely, saving: n(Log2(x/1-x) bits Argument by information investment Does the saving outweigh the investment?

  43. An example Recovery from overgeneralisations The rabbit hid You hid the rabbit! The rabbit disappeared *You disappeared the rabbit! Return on ‘investment’ over 5M words from the CHILDES database is easily sufficient But this methodology can be applied much more widely (and aimed at fitting time-course of U-shaped generalization; and when overgeneralizations do or do not arise).

  44. V. Where next?

  45. Can we learn causal structure from observation? What happens if we move the left hand stick?

  46. The output of perception provides a description in terms of causality • Liftability • Breakability • Edibility • Whats is attached to what • What is resting on what Without this, perception is fairly useless as an input for action

  47. Formal question Suppose a modular computer program generate stream of data of indefinite length… Under what conditions can modularity be recovered? How might “interventions”/expts help? Inferring causality from observation: The hard problem of induction Sensory input Generative process (Key technical idea: Kolmogorov sufficient statistic)

  48. If data is generated by independent processes Then one model of the data will involve recapitulation of those processes But will there be other alternative modular programs? Which might be shorter? Hopefully not! Completely open field… Fairly uncharted territory