1 / 27

Foundations of Language Science and Technology - Corpus Linguistics - Silvia Hansen-Schirra

Foundations of Language Science and Technology - Corpus Linguistics - Silvia Hansen-Schirra. Outline. Why corpora, why interpreted corpora Many types of annotation - linguistic annotation - non-linguistic annotation New developments. Why corpora?. Linguistics linguistic theory.

lorant
Télécharger la présentation

Foundations of Language Science and Technology - Corpus Linguistics - Silvia Hansen-Schirra

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Foundations of Language Science and Technology - Corpus Linguistics - Silvia Hansen-Schirra

  2. Outline • Why corpora, why interpreted corpora • Many types of annotation - linguistic annotation - non-linguistic annotation • New developments

  3. Why corpora? Linguistics linguistic theory Engineering language technology applications Cognition models of human language processing

  4. Empirical linguistics introspective data research experimental psycholinguistic data corpus data DB of relevant data

  5. Engineering motivation • information extraction • question-answering • statistical machine translation • parser training and evaluation => increased need for deeply annotated corpora

  6. Cognitive motivation • experience-oriented frequency-based models • models of gradiant grammaticality • metrics of complexity

  7. Resource description metadata language: Spanish, English, German sublanguage/register: regional dialect, sociolect, vernacular, professional jargon, toddler speech text sort(s): newspaper articles, wire news, political speech, control commands subject domain: stock rates, flight reservations, type of producers: professional journalist, student, radiologist mode of production: spoken, written, signed, morsed medium of production: pencil, PC with MS Word, dictaphone conditions of production: spontaneous, carefully composed, produced under time pressure transmission encoding: raw ascii code, HTML, digitized phone signal, unicode medium of transmission: telephone, WWW, CB radio storage encoding: raw ASCII code, HTML, AIFF medium of storage: DAT tape, CD ROM, hard disk mode of presentation: spoken, written, signed medium of presentation: newspaper, radio, book, tv show, theater performance, type of intended recipients: newspaper reader, booking agent, theater audience number of intended recipients: point-to-point, multicast, broadcast synchronicity of discourse: synchronous dialogue, asynchronous direction: one-way, two-way

  8. Linguistic annotation • part-of-speech tags, • word sense information, • morphosyntactic features of words, • constituent structures for phrases or sentences, • coreference markers, • dependency structures, • predicate-argument structures, • reference identifications for term phrases, • information structures within sentences, • intonation contours, • speech acts, • discourse relations - discourse structures.

  9. Other annotations • judgements of native speakers on the acceptability or appropriateness of the utterance, • information on speaker(s), • information on hearer(s) or intended audience, • information on the utterance situation (time, place, circumstances) • information on the published source, • typographic information, • layout and document structure, • textual transcriptions of spoken utterances, • transcription of pauses, • error tagging.

  10. Raw vs. linguistically interpreted corpora search term: word=form ...play a significant part in determining growth and form. ...each molecule can form four hydrogen bonds... vs. search term: word=form & pos=N ...play a significant part in determining growth and form. search term: word=form & pos=V ...each molecule can form four hydrogen bonds...

  11. search term: is *ed Alpha interferon is produced by white blood cells... search term: were *ed In the late 1970s interferons were hailed as "wonder drugs"... vs. search term: pos=VB {0,1} pos=VVN Gamma is not induced by viruses at all... So interferons could be described as the antibiotics of the virus... Only two of these have yet been identified... Raw vs. linguistically interpreted corpora

  12. Syntactically annotated corpora: treebanks • German treebank project: TiGer Treebank • English reference treebank: Penn Treebank • Treebank + semantic information: Prague Dependency Bank

  13. TiGer Treebank S HD SB OC VP MO OA HD PP NP NP AC NK NK NK NK NK NK nächsten ADJA Sup.Dat. Sg.Neut nahe . $. Jahr NN Dat. Pl.Neut Jahr will VMFIN 3.Sg. Pres.Ind wollen die ART Nom. Sg.Fem die Regierung NN Nom. Sg.Fem Regierung ihre PPOSAT Acc. Pl.Masc ihr Reformpläne NN Acc. Pl.Masc Plan umsetzen VVINF Inf umsetzen Im APPRART Dat in

  14. TiGer Treebank S HD SB OC VP MO OA HD PP NP NP AC NK NK NK NK NK NK nächsten ADJA Sup.Dat. Sg.Neut nahe . $. Jahr NN Dat. Pl.Neut Jahr will VMFIN 3.Sg. Pres.Ind wollen die ART Nom. Sg.Fem die Regierung NN Nom. Sg.Fem Regierung ihre PPOSAT Acc. Pl.Masc ihr Reformpläne NN Acc. Pl.Masc Plan umsetzen VVINF Inf umsetzen Im APPRART Dat in annotation on word level: part-of-speech, morphology, lemmata

  15. TiGer Treebank node labels: phrase categories S HD SB OC VP MO OA HD PP NP NP AC NK NK NK NK NK NK nächsten ADJA Sup.Dat. Sg.Neut nahe . $. Jahr NN Dat. Pl.Neut Jahr will VMFIN 3.Sg. Pres.Ind wollen die ART Nom. Sg.Fem die Regierung NN Nom. Sg.Fem Regierung ihre PPOSAT Acc. Pl.Masc ihr Reformpläne NN Acc. Pl.Masc Plan umsetzen VVINF Inf umsetzen Im APPRART Dat in

  16. TiGer Treebank edge labels: syntactic functions S HD SB OC VP MO OA HD PP NP NP AC NK NK NK NK NK NK nächsten ADJA Sup.Dat. Sg.Neut nahe . $. Jahr NN Dat. Pl.Neut Jahr will VMFIN 3.Sg. Pres.Ind wollen die ART Nom. Sg.Fem die Regierung NN Nom. Sg.Fem Regierung ihre PPOSAT Acc. Pl.Masc ihr Reformpläne NN Acc. Pl.Masc Plan umsetzen VVINF Inf umsetzen Im APPRART Dat in

  17. TiGer Treebank crossing branches for discontinuous constituency types S HD SB OC VP MO OA HD PP NP NP AC NK NK NK NK NK NK nächsten ADJA Sup.Dat. Sg.Neut nahe . $. Jahr NN Dat. Pl.Neut Jahr will VMFIN 3.Sg. Pres.Ind wollen die ART Nom. Sg.Fem die Regierung NN Nom. Sg.Fem Regierung ihre PPOSAT Acc. Pl.Masc ihr Reformpläne NN Acc. Pl.Masc Plan umsetzen VVINF Inf umsetzen Im APPRART Dat in

  18. Penn Treebank ( (S (NP-SBJ (NP (NNP Pierre) (NNP Vinken) ) (, ,) (ADJP (NP (CD 61) (NNS years) ) (JJ old) ) (, ,) ) (VP (MD will) (VP (VB join) (NP (DT the) (NN board) ) (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director) )) (NP-TMP (NNP Nov.) (CD 29) ))) (. .) ))

  19. Penn Treebank ( (S (NP-SBJ (NP (NNP Pierre) (NNP Vinken) ) (, ,) (ADJP (NP (CD 61) (NNS years) ) (JJ old) ) (, ,) ) (VP (MD will) (VP (VB join) (NP (DT the) (NN board) ) (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director) )) (NP-TMP (NNP Nov.) (CD 29) ))) (. .) )) annotation on word level: part-of-speech

  20. Penn Treebank ( (S (NP-SBJ (NP (NNP Pierre) (NNP Vinken) ) (, ,) (ADJP (NP (CD 61) (NNS years) ) (JJ old) ) (, ,) ) (VP (MD will) (VP (VB join) (NP (DT the) (NN board) ) (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director) )) (NP-TMP (NNP Nov.) (CD 29) ))) (. .) )) phrase categories

  21. Penn Treebank ( (S (NP-SBJ (NP (NNP Pierre) (NNP Vinken) ) (, ,) (ADJP (NP (CD 61) (NNS years) ) (JJ old) ) (, ,) ) (VP (MD will) (VP (VB join) (NP (DT the) (NN board) ) (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director) )) (NP-TMP (NNP Nov.) (CD 29) ))) (. .) )) syntactic functions

  22. Prague Dependency Bank chce wants Sb investovat to-invest Obj ACT.VOL.T Kdo who Sb ACT.T ste hundred Obj RESTR.F do to AuxP automobilu car Adv DIR.F korun crowns Atr PAT.F

  23. Prague Dependency Bank chce wants Sb annotation on word level: lemmata, morphology investovat to-invest Obj ACT.VOL.T Kdo who Sb ACT.T ste hundred Obj RESTR.F do to AuxP automobilu car Adv DIR.F korun crowns Atr PAT.F

  24. Prague Dependency Bank chce wants Sb investovat to-invest Obj ACT.VOL.T Kdo who Sb ACT.T syntactic functions ste hundred Obj RESTR.F do to AuxP automobilu car Adv DIR.F korun crowns Atr PAT.F

  25. Prague Dependency Bank chce wants Sb investovat to-invest Obj ACT.VOL.T Kdo who Sb ACT.T dependency structure ste hundred Obj RESTR.F do to AuxP automobilu car Adv DIR.F korun crowns Atr PAT.F

  26. Prague Dependency Bank chce wants Sb investovat to-invest Obj ACT.VOL.T Kdo who Sb ACT.T ste hundred Obj RESTR.F do to AuxP semantic information on constituent roles, theme/rheme, etc. automobilu car Adv DIR.F korun crowns Atr PAT.F

  27. New developments • historical dimension (e.g., Corpus of the History of German Language) • multilayer stand-off linguistic markup • multimodal markup/interpretation • new types of treebanks: • CS treebanks with dependency links (NEGRA, TIGER) • machine-annotated corpora for statistical training (e.g., Redwoods Treebank) • Dependency (Tree)Banks (Prague, PARC) • Grammatical Relation (Tree)Banks (Briscoe & Carroll)

More Related