Modeling infant word segmentation: Another example of discovery fueled by CHILDES

Modeling infant word segmentation: Another example of discovery fueled by CHILDES Alejandrina Cristia Laboratoire de Sciences Cognitives et Psycholinguistique @Language Emergence: Competition, Usage, and Analyses, 2019-06-06

No overt & unambiguous word/morpheme boundaries in the input… “no silences” Kuhl 2004

“no silences” …yet by the end of the first year, infants know somewords/morphemes ‘Feet’ ‘mommy’ ‘baby’ ‘alldone’ ‘tobed’ Kuhl 2004 Tincoff & Jusczyk 2012; Bergelson & Swingley 2012; Ngon et al. 2014

How to study segmentability? mommy talking …cute …something shiny go by? Let’sjustget to the facts.

Today’s menu • A methodology for studying word form segmentation using models • Segmentability differences forchild-directed versus adult-directedregister (in French) • …bilingual versus monolingual settings (English, Spanish, & Catalan) • Implications for infant studies

Input representation Acoustic Symbolic (‘Phonological text’) + lots of corpora can be used + lots of algorithms proposed + algorithms represent a wide range of strategies assumes babies represent input abstract, with zero errors + realistic… • …provided representations match babies’ • few appropriate corpora (natural discourse & good quality audio) • only one (reproducible) algorithm

Example *MOT: look at the doggie Phonologize lUk At D2 dOgi Removewordboundaries & unitize Precision = 1 of the 5 wordsfoundwerewords in the input = .2 l U k A t D 2 d O g i Recall = 1 of the 4 words in the input wasrecovered = .25 Evaluate Segment withsomealgorithm 2* (Precision * Recall) Token F-score = Precision + Recall lUkAtD2dOgi Note -- one can also unitize at the syllable level: lUk At D2 dOgi(input) lUk At D2dOgi(output)

Example algorithms • Every sentence is a word (SentBase) • Every syllable is a word (SyllBase) Simplest strategies 1. Baseline Lignos 2012 TP_abs TP_rel Goal is to “cut” using local cues • Transitional Probabilities (TP) x Absolute/Relative threshold 2. Sub-lexical • Diphone-Based Segmentation (DiBS) Daland + 2009; Saksida + 2016 Goal is to learn a set of “minimal recombinable units” 3. Lexical • Adaptor Grammar (AG) • Phonotactics from Utterances Determine Distributional Lexical Elements (Puddle) Johnson + 2007; Monaghan + 2010 Package: wordseg.readthedocs.io Preprint: https://osf.io/nx49h/ Bernard et al. 2019 BehResMeth

The process in WordSeg Package: wordseg.readthedocs.io Preprint: https://osf.io/nx49h/ Bernard et al. 2019 BehResMeth

Sample results:precision, recall, & F-score are correlated Providence corpus (Demuth, Culbertson, & Alter, 2006) on CHILDES

Naima, in Providence corpus (Demuth, Culbertson, & Alter, 2006) on CHILDES Sample results:Effects of algorithm and input represent-ation

Why look at register? In child-directed speech, probably… • More utterances consist of a single word (+ all models) • Utterances are overall shorter in length (+ all models) *MOT: Attends! *MOT: Ouaistuvastemettreausoleilpourtesecherlescheveux!

Why look at register? In child-directed speech, probably… • More utterances consist of a single word (+ all models) • Utterances are overall shorter in length (+ all models) • Utterances are more repetitious (+? lexical models) • *MOT: coucoucoucousitufaisaisdespetitssourirestoi. • *MOT: tumefaisdespetitssouriresXXXcoucoumongrand. • *MOT: coucoutumefaisdessouriresoupas.

(Ask me about crosslinguistic extensions if curious!) Japanese English French • Riken corpus • Collected in the lab  adult-directed speech is with experimenter • Winnipeg corpus • Collected with child-worn device worn whole day  adult-directed speech is among caregivers • LENA-Lyon corpus (LeNormand et al. HomeBank) • Collected with child-worn device worn whole day  adult-directed speech is among caregivers BogdanLudusan Georgia Loukatou

French“wild” ADS on Le Normand, Canault, & Van Thai’s LENA-Lyon corpus Loukatou + 2019 Proc Cog Sci

CDS-ADS: Conclusions • Overall trend for better performance for child- than adult-directed speech • But: • reversed for some algorithms • effect of register < 15% • (in the best controlled cases, 2%)

Bilingualsneed to: Learn words, likemonolinguals do, butin twolanguages Overall less input in each language ‘Feet’ ‘mommy’ ‘baby’ ‘alldone’ ‘tobed’ Why study word segmentation in a bilingual setting? ‘pié’ ‘mamá’ ‘bebé’ … Hoff + 2012 Fibla & Cristia (submitted very soon, I hope)

Questions & predictions • Are segmentation strategies equally successful when applied to bilingual and monolingual corpora? → Measure the performance of previously studied segmentation algorithms in a controlledmonolingual versus bilingual corpus. • Possible outcomes: • The confusion hypothesis: variable and inconsistent input → Poorer performance for the bilingual than for the monolingual • The resistant hypothesis:(if switchingonly at utteranceedges) local statistical and lexical are stillreliable → Similar performance for the bilingual and the monolingual Fibla & Cristia (submitted very soon, I hope)

Creating bilingual corpora

Three cases of bilingual < monolingual

Three cases of bilingual < monolingual 11 cases of bilingual ‘in between’ monolingual

Effects of algorithm and input represent-ation size of algorithm x level effect = 40-60%? Cristia + 2019 Open Mind

Effect of register Size of register effect < 10%? on LENA-Lyon corpus Loukatou + 2019 Proc Cog Sci

Effect of bilingualism Size of bilingualism effect ~ 0%? Fibla & Cristia (submitted very soon, I hope)

Today’s menu • A methodology for studying word form segmentation using models • Segmentability differences as a function of language properties • …child-directed versus adult-directedregister (in Japanese, English, & French) • …bilingual versus monolingual settings (English, Spanish, & Catalan) • Implications for infant studies

What may babies be doing? Using CDI results & frequency effects Larsen + 2017 Interspeech & in prep

What may babies be doing? Using CDI results & frequency effects Coefficient of determination R2=.1 Larsen + 2017 Interspeech & in prep

phoneme-based models

syllable-based models phoneme-based models

Cut only at utterance edges  frequency of words in isolation

To be continued…

Thanks to... Familieswhoagree to berecorded & for their data to beshared Researcherswho record them and share on TalkBank TalkBank~ Brian MacWhinney &you!

Japanese“lab” ADS on Reiko Mazuka’s RIKEN corpus much of this is in Ludusan et al. 2017 ACL (now working on journal paper with more material)

English“wild” ADS on Melanie Soderstrom’s Winnipeg corpus Cristia + 2019 Open Mind

Modeling infant word segmentation: Another example of discovery fueled by CHILDES