‘ Ideal ’ Language Learning and the Psychological Resource Problem MIT Workshop:

‘Ideal’ Language Learning and the Psychological Resource Problem MIT Workshop: Where does syntax come from? Have we all been wrong? October 19, 2007 William Gregory Sakas Janet Dean Fodor City University of New York (CUNY)

CUNY-CoLAG CUNYComputational Language Acquisition Group http://www.colag.cs.hunter.cuny.edu CUNY-CoLAG graduate students: David Brizan Carrie Crowther Arthur Hoskey Xuân-Nga Kam Iglika Stoyneshka Lidiya Tornyova

Agenda • What we do • What’s troubling us • An invitation to discussion

CoLAG research Create large domain of parameterized languagesfor evaluating learning models (Fodor & Sakas 04) Watch-dog role on ‘richness of stimulus’ claims. (Kam et al. 05, i.p.) poverty of stimulus language domain Solving modeling problems: noise; overgeneration. (Fodor & Sakas 06; Crowther et al. 04) learnability problems Compare efficiency of 12 parameter setting models (Fodor & Sakas 04) testing models

A conceptual history of modeling P-setting • Psychologically feasible learning (Pinker 1979) • Triggering – the ideal (Chomsky 1981) • But too many interactions (Clark 1989) • Domain search by parse-test (Gibson & Wexler 1994) • Better domain search (Yang 2000) • Scanning for I-triggers (Lightfoot 1999) • Parse-test with unambiguous triggers (Fodor 1998) • Back to innate triggers? (Sakas & Fodor, in prep.)

A driving force - limiting resources • All models were driven by an attempt to limit the resources to what can reasonably be attributed to young children. (What is reasonable? Needs sharpening, but not urgently. << Aren’t we gonna ask ‘them’?>>) • Limit the complexity of innate knowledge (of the possible grammars, of subset relations among them, triggers, etc.) • Limit the storage of input sentences, storage of statistics over the input, storage of grammars tested & rejected, etc. • Limit the amount of input needed in order to attain the target grammar. • Limit the amount of processing of each input sentence for the purpose of extracting the information it contains.

Learnability vs. Feasibility • Learnability = study of what can be learned in principle • Feasibility = study of how efficiently learning takes place • Learnability is a non-issue for (finite) parametric domains • We ask the feasibility question: Given reasonable psychocomputational resources, does a learner converge after encountering a reasonable number of input sentences?

Reasonable psychocomputational resources • Full parallel parse, retrieving all structural analyses of an input string. • Multiple parses of the same input string, with different grammars. • Memory for all (most, many) previous input sentences. • On-line computation of language-inclusion relationships (needed to apply the Subset Principle for conservative learning). • A mental list of disconfirmed grammars. • An online-computed probabilistic metric for all grammars. << Reword>> Finite number of parameters still implies exponential search space of grammar hypotheses

Reasonable psychocomputational resources • At most two parses per input sentence (inherited from Gibson & Wexler) • Memory for a small number of ‘informative’ input sentences (e.g., sentences that had caused an hypothesis change in the past, i.e., potential ‘trigger’ sentences) • An online-computed probabilistic metric for each parameter (inherited from Yang)

Triggering as switch-flipping – the ideal • Rule-based acquisition was never feasibly implemented. Data combing, hypothesis formation, little innate guidance. • Shift to parameter theory: languages differ only in lexicon and (including) values of a small number of parameters. • Tight UG guidance: Finite set of candidate grammars.  • Input sentences ‘trigger’ p-values. That is: the learning mechanism knows which parameter values to adopt to license each sentence, without linguistic computation.  • Memoryless (‘incremental’) learning. Choose next grammar hypothesis on basis of current input sentence only.  • Maybe deterministic: Parameters set just once, correctly. 

But: interactions and ambiguity (Clark) • What can Pat sing?Why is the object in initial position? • +WH-movement or +Scrambling • Pat expects Sue to win. How is case licensed on “Sue”? • ECM: matrix verb governs lower subject.or SCM: non-finite Infl assigns case to its subject. • An error may then cause an error re long-distance anaphora. • Over-optimistic: For each parameter, an unambiguous trigger, innately specified. Realistically, a potential trigger may be masked by other derivational phenomena. • The null subject parameter is not typical!

Recognition problem for triggers • How does the switch-setting mechanism work? Given sentence s, how does LM know which switches to flip? • This problem arises even for unambiguous triggers. • G&W’s 3-P domain had an unambiguous trigger for each P in each grammar. But 5 of 8 languages were not acquirable, because LM couldn’t access the trigger info. • Two alternatives rejected by Gibson & Wexler: (i) Innate trigger description. (Needs global triggers or huge list.) (ii) On-line calculation of effects of Ps. (Excessive computation.) • Instead, the TLA: Trial-and-error. Pick a P-grammar (by criteria). Try parsing the sentence. If it works, adopt the grammar.Wastes input; not ‘automatic’, not error-free, not deterministic. NOT TRIGGERING! (But the parse test is good.) 

Improving the parse-test approach • By design, the TLA was a very impoverished mechanism. • Minimal resources  little info extracted from input  slow.It could only follow its nose through the language domain, led on by an occasional lucky successful parse. • Yang (2000) gave it a memory for the fate of past hypotheses, so that it could accumulate knowledge. • To record parse-test results grammar by grammar isn’t practical. Record them P by P. A weight represents success of each value. Approximate (A good P-value in a bad grammar, and v.v.) • Mechanism: Select a grammar to test next, with probability based on the weights of its P-values. If it parses the sentence, upgrade the weights of its P-values; if it fails, downgrade them. 

Pro and con Yang’s ‘variational learner’ • This mechanism makes more use of the input: It gains knowledge from every parse-test, not just from successes. • Also, it avoids the learning failures (local maxima) of TLA, by occasionally sampling unlikely grammars. This breaks it free of a false direction without losing accumulated knowledge. • It does all this while still using the parser to implicitly identify triggers. (No innate specification of triggers is needed.) • But still inefficient trial-&-error. Yang’s simulations and ours agree: an order of magnitude more input consumed than other models. • Still non-deterministic. After sufficient weight of evidence, it may lock in a P-value; but mostly it doesn’t know which value is right. So each parameter may swing back-and-forth repeatedly.   

I-triggers and E-triggers • Lightfoot (1999) aimed for deterministic P-setting, by means that G&W had rejected as infeasible. He claimed that simple global triggers can be defined. They are I-language properties. • E.g. Trigger for V2 is [SpecCP Nonsubject ] [C Verb+fin ] • Mechanism: Learner “scans the input” for I-triggers. • It’s not switch-setting, but splendid if it works. However, the input is surface word strings. Would need translation from deep global I-triggers to surface language-specific E-triggers. Linguists do try, but no systematic compilation to date. • Can it be done?? E.g., for every V2 language, regardless of all other properties, what word-string property reveals +V2? • This is what the G&W parse-test did for free. Human parser is designed to fit trees to strings. No need to specify the E-triggers.

A mechanism for detecting I-triggers • Fodor (1998) also recommended I-triggers, & proposed a mechanism for recognizing them in word strings. Don’t look for them. Donate them to the parser when it hangs up for lack of them. • This gets more goodness out of the parse-test. Not just yes/no, but which P-value did the work – adopt that one. • Closer to Chomsky’s original concept of triggering: An input sentence tells the learning mechanism which Ps can license it. (Not: First pick a grammar and then see if it works.) We call this parametric decoding. Next best thing to switches! • Not domain search, but I-to-E trigger conversion by the parser.Uses existing resources. Also, an unambiguous trigger could set a P indelibly, halving the search space for later Ps.  

Unambig triggers  deterministic learning? • Mechanism for deterministic learning based on unambiguous triggers only: • If only one parse, adopt the P-values that contributed to it. • If more than one parse, the input is ambiguous, so: (a) Discard it entirely. Problem: this leaves too few unambiguous triggers; confirmed by our simulation data: learning fails often. (b) Adopt just the P-values that are in all of the parses. Problem: Requires a full parallel parse of the sentence, but adult parsing data indicate that parallel parsing is limited at best – insufficient for learners to do full parametric decoding of an input word string. • Over the resource limits again!  Non-determinism unavoidable?

Taking stock so far • Parameters may be psychologically real. Triggers for P-values may exist. But an account of how human learners know and use the triggers to set the P’s has been flummoxed by: • how to detect the triggers in the input… • without overstepping reasonable limits on innate knowledge or processing capacity… • or resorting to linguistically-undirected search through the field of all possible grammars. • Attractiveness of Ps for capturing linguistic diversity is enhanced by the promise of P-setting as an efficient learning procedure, to explain the speed & uniformity of human language acquisition. • So psycho-computational research needs to deliver a good implementation of P-setting! We’ll try. But first: another problem.

The Subset Principle wants unambiguity • A final blow! The Subset Principle runs amok in incremental learning if it’s not based on unambiguous triggers (which we’ve just seen aren’t realistic for a parse-test mechanism). • Summary of the SP problem (see more below): • Insufficient negative evidence demands conservative learning. • Adopt the smallest language compatible with the available data. (SP) • Without memory for past inputs, available data = the current sentence. • SP demands an absurdly small language, losing past facts acquired! • E.g. “It’s bedtime”. Give up wh-movt, topicalization, passive, aux-inversion, prep-stranding, etc. • SP says: Retrench on existing P-values, when set a new one.

SP-retrenchment is due to trigger ambiguity • LM cannot trust any parameter values adopted on the basis of input that was (or even might have been) parametrically ambiguous. • The sentences introduced by those p-values may not be in the target language. • So LM cannot hold onto those sentences (those P-values) when setting another parameter later. • Give up all sentences / all marked parameter values that are not entailed by the current input sentence! • Unless – you know that you set the parameter on the basis of a globally unambiguous trigger.

So: Back to unambiguous triggers • We are exploring three solutions to excessive retrenchment: • Add memory for past input. (Non-incremental; Fodor & Sakas 2005) • Add memory for disconfirmed grammars. (Fodor, Sakas & Hoskey 2007) • Don’t retrench on Ps set by unambiguous evidence.(Here, today) • The first two are computational solutions: add resources. The third is where linguistics could contribute. • Even if only a few Ps have unambiguous triggers & the learner knows them, setting them might disambiguate triggers for others. • But the learning system has to know when a trigger is unambig. • Innate unambig I-triggers, translated to E-triggers, could do it. So let’s go back to this idea. Can it work, for at least some Ps? And can one P provide an unambig trigger for the next one?

Unambiguous triggers in CoLAG domain • We ask: For each of the 13 Ps in our domain, does it have at least one unambiguous E-trigger for at least one of its values, in every language with that value? • If only for one value, the other one could be taken as default. So we ask: Is this default linguistically plausible? • For each P-value that has unambiguous E-triggers, what do those triggers have in common? Do they embody a single, global I-trigger? • If so, are the E-triggers transparently related to the I-trigger, so that a learner could recognize it without excessive linguistic computation? • Seeking unambiguous, global, transparent triggers.

Facts about CoLAG languages • Sentences: S Aux V Adv, Wh-O1 V S PP, O3-wa V P ka • Universal lexicon: S, O1, P, Aux, -wa, etc. (Inherited and expanded from Gibson & Wexler’s TLA simulations.) • Input stream = random sequence of all sentences in the lg. Input is word strings only; tree structure isn’t ‘audible’. • Except: Learner is given lexical categories & grammatical roles! ( some semantic boot-strapping, no prosodic bootstrapping). • All sentences are degree-0 (no embedded clauses). A language has 545 sentences on average. • The domain contains 28,924 distinct sentences (as strings) 60,272 distinct syntactic trees 3,072 distinct sets of P-values (grammars) • Average ambiguity of sentences = 53 languages per sentence.

Parameters in the domain 13 standard Ps, simplified. Necessarily old-fashioned, but illustrate interesting learning problems.

Some lgs lack unambig triggers for a P-value Missing = there are lgs that need the value but w/o an unambiguous trigger for it.

E.g., What are the triggers for headedness? • An easy example: Headedness (in IP, VP, PP). • I-triggers: H-Initial: I before VP, V before complements, P before NP. H-Final: I after VP, V after complements, P after NP. • Many E-triggers: H-Initial: P O3. non-initial Aux V. non-initial O1 O2 . H-Final: Non-initial O3 P. V NOT Aux. • Linking facts: • Inside VP, only V O1 O2 PP Adv order, or the reverse. • XPs can move (only) leftward into Spec,CP. Head-movement: Only Aux or finite Verb move, only to I or C (leftward or rightward). C-direction may differ from IP, VP etc. • A general strategy for P-setting:Transparent.To set underlying order, do not rely on any movable item in a possible landing site position.

DECIDE WITH JANET WHICH “MORE DIFFICULT” TRIGGERING EXAMPLE TO PUT HERE

Summary Despite Clark’s pessimism & Lightfoot’s over-optimism: Is it conceivable that infants acquire language rapidly because they have innate knowledge of unambiguous P-triggers? • We’ve seen it’s not guaranteed that where there’s info in the input, there’s a trigger. • Also, even if there’s a trigger, it may not be simple & transparent & global enough to be psychologically plausible. • Nevertheless, we found unambiguous global triggers for 9 of 13 Ps. And we’ve done pretty well at rescuing the other 4 – though there’s more work to be done. • We’ve used P-ordering to simplify triggers for setting later Ps, so their triggers don’t need to fold in triggers for more basic Ps. • We’ve found conditional triggers, where setting earlier Ps indelibly creates new unambiguous triggers for later Ps.

Prospects • No general success yet. E.g. only 30% of +V2 languages have an unambig trigger. But maybe because V2 needs breaking into several Ps; also, we haven’t yet fully explored ordering/conditioning for V2. • Additional methods for investigating: • Run new simulations in which Ps with unambiguous triggers are set permanently. Does it speed convergence? Any errors emerge? • • In existing simulation data, look for Ps that got set early, & led to rapid convergence. Can we detect an optimal ordering? • If we can’t find triggers for a P in some context, can we show that just that P is irrelevant? Ideal defense of a trigger model! • If not, then we’re looking for a computational solution (far from original triggering concept), rather than a linguistic solution. Will we find legitimate (ordered) triggers to set all the Ps?

Your advice, please • So - should we keep trying? • How much does it matter whether P-triggering can be shown to work, within the resource limits of normal children? • Would the linguistic value of Ps still stand even if acquisition consisted in massive statistical analysis of the learner’s corpus? • How does MP change things? Generally, the more abstract and explanatory the linguistic analysis, the greater the divergence between E-triggers and I-triggers. Hard on learners. • Does your systematization of Ps change things for acquisition?Can it help us with the E-trigger / I-trigger translation? • An alliance of linguistics and computation may be what it takes.

Thank you

‘ Ideal ’ Language Learning and the Psychological Resource Problem MIT Workshop:

‘ Ideal ’ Language Learning and the Psychological Resource Problem MIT Workshop:

Presentation Transcript

University students beliefs about language learning and the effect of a language acquisition course

Intercultural Language Learning

Language

UNIT 3 FACTORS AFFECTING SECOND/FOREIGN LANGUAGE LEARNING

Michigan’s Ideal Energy resource mix

Muslim Chinese

Psychological Disorders

First Language Acquisition

Children vs. Adults in second-language learning

Urban Literacy Dream Workshop

Ideal Gas Law

Academic language and learning practice: Reflections on what, how and why

Ideal Gas Law

Learning a New Language: HTML

Introduction to Language learning

A Reinforcement Learning Approach to Dynamic Resource Allocation

Resource centres and self-study: issues in computer assisted language learning (CALL)

Authentic Problem Based Learning

Digital Rights Expression Language Workshop

Language Learning Center

IDEAL Resources for Teachers

Differentiated Learning Strategy Instruction