670 likes | 916 Vues
Exploring the tiers of Japanese vocabulary: Academic, literary and beyond. Tatsuhiko Matsushita LALS, Victoria University of Wellington tatsuhiko.matsushita@vuw.ac.nz. Main findings. VDRJ is useful for designing curriculum (material, tests etc.)
E N D
Exploring the tiers of Japanese vocabulary: Academic, literary and beyond Tatsuhiko Matsushita LALS, Victoria University of Wellington tatsuhiko.matsushita@vuw.ac.nz
Main findings • VDRJ is useful for designing curriculum (material, tests etc.) • The more domains a words is shared as AWor LAD by, the more abstract the meaning of the word is. • Conversation and non-academic textscontain more general words and LW • Academic texts: more AW and LAD but less LW in any academic domain • Wikipedia: more proper nouns and low frequency words • Newspapers and academic items of Wikipediacan be a good resource for learning AW and LAD. • Natural science texts contain more academic domain words at lower frequency levels than arts and social science texts • Origins of academic and literary words are considerably clearly separated; 3/4 of LW originate in Japanese while 3/4 of AWand LADoriginate in Chinese • LAD contains more Western origin words (Gairaigo)
Contents • Motive for this research • Goals of this presentation • Vocabulary Database for Reading Japanese • Tiers of Japanese vocabulary (Basic words, academic words, limited-academic domain words, literary words) • Text coverage by word tier • Proportions of word origin types by word tiers • Number of characters required to cover the word tiers • Implications from the findings • Conclusion
1. Motive for this research How efficiently can we learn vocabulary? • Learning burden is big! • More effective choice of target words • More efficient order for learning the words • Effective choice and efficient order: to maximize the coverage of text which the learner would encounter in his/her domain = Reading comprehension and lexical density (Hu & Nation, 2000; Komori et al., 2004) Q. What words should learners learn first? And second and next?
Studies on EAP vocabulary • Basic: General Service List (West, 1953) • Academic: AWL (Coxhead, 2000) UWL (Xue & Nation, 1984) • EGAP-A/S, EGAP-HM/SS etc. (Tajino, Dalsky, & Sasao, 2009) • Science-specific Word List (Coxhead & Hirsh, 2007) • Technical: e.g. Chung (2003) • Literary vocabulary?
Studies on JAP vocabulary • Basic: The former JLPT list, Tamamura (1987) etc. • Academic: Butler (2010), Matsushita (2011) • ? • Technical: Komiya (1995), Oka (1992) etc. • Others • No list for words between academic and technical words • Literary vocabulary?
2. Goals of this presentation To introduce • the Vocabulary Database for Reading Japanese • extracted domain-specific words such as Academic Words (AW), Limited-Academic-Domain Words (LAD), Literary Words (LW) To argue about • how the word tiers work in different types of text (register variation) • how learner’s language background possibly affects the understanding of texts in different genres
3. Vocabulary Database for Reading Japanese • Vocabulary Database for Reading Japanese (VDRJ)(Matsushita, 2010; 2011) • Created from the Balanced Contemporary Corpus of Written Japanese, 2009 monitor version (NINJAL, 2009) • 33 million token (28 million from books and 5 million from the Internet forum sites (Yahoo Chiebukuro)) • 19 million content words and 14 million function words • Unit of counting: Lexeme – considerably inclusive but less inclusive than the word family (Level 6 in Bauer & Nation, 1993) in English • “Short unit of lexemes” are ranked by U (usage coefficient) (Juilland& Chang-Rodrigues, 1964) • Short unit of lexeme: more inclusive than “lemma”, less inclusive than “word family”
Some problems of existing Japanese word frequency lists • Lack of representativeness • Too old • The corpus size is not large enough: low reliability for low frequency words • No good sub frequency data which enable us to calculate dispersion to downgrade unevenly distributed words
Advantages of word lists * Various types of word lists can be created from the vocabulary database (VDRJ) • Reference for developing vocabulary tests = Checking learners’ vocabulary levels • Reference for checking vocabulary level of material = Checking vocabulary levels of materials • Specify vocabulary for learners to learn and for teachers to teach For better choice of material, modification of text Cf. Nation (2011), Word profiler
How to make VDRJ • Method • Classify all the texts into some sub corpora to see the range and dispersion cf. Nippon Decimal Classification, BCCWJ (NINJAL, 2009) • Parse (made word segmentation of ) all the texts by a morphological analyzer with a dictionary (if the text is not segmented by space between words.) cf. MeCab, UniDic • Make word lists by AntConc and/or AntWordProfiler
Content and construct of VDRJ • Vocabulary Database for Reading Japanese • The list is for reading as it is made from written corpus of books and internet forum sites • Written and spoken languages are different in word frequency, domain and required language processing skills ⇒ A good corpus of spoken language is necessary to develop a good word list for it(, but there is no very good corpus of spoken Japanese…)
Different word rankings • The word ranking problem mainly exists in Basic Words • This is mainly due to lack of good spoken corpora • Compromise: frequency weighted to limited domains which seem to reflect basic daily needs • For International Students • For General Learners • Non-weighted (ranking for overall written Japanese)
Multidimensional scaling (MDS) 10 domains 10 domains + word familiarity
4. Tiers of Japanese vocabulary (1) The concept of “word tiers” • Domain / Level • Level = general importance = frequency × dispersion Some words are frequent only in a particular domain e.g. 発送 (shipping) 振り込み (paying by bank transfer) 古墳 (tumulus / burial mound)
Assumed word tiers for students Level • Basic: Top 1288 = Former JLPT Level 4 &3 • Intermediate: Ranked 1289-5000 • Advanced 1: 6K-10K • Advanced 2: 11K-15K • Super-Advanced: 15K-20K • 21K+ • Assumed Known Words (AKW) Domain *General / Academic / Literary
4. Tiers of Japanese vocabulary (2) Basic words (BW) • Feature of the corpus: formal written language similar to BNC (Nation, 2004) • No good spoken corpus for vocabulary studies • Compromise • For learners and teachers lists, the former JLPT Level 4 $ 3 vocabulary is put at the top of the list as basic words To order the basic words • Identify closer domains to word familiarity (basic needs) by Multidimensional Scaling (MDS) • Frequency in literary works and the Internet-forum sites (Yahoo-Chiebukuro) is weighted
4. Tiers of Japanese vocabulary (3) Academic domain words Extracting academic domain words • Log-likelihood ratio (LLR)(Dunning, 1993) • Target texts: Technical texts • Classified into four large academic domains • Total number of tokens: approx. 2.9 million • Reference texts: General texts in BCCWJ 2009 • Total number of tokens: approx. 29.9 million • Extract keywords shared by 4 - 1domains • Cut off point: higher for more narrowly distributed words
4. (3) Academic domain words • Academic words (AW):high specificity in 3+ academic domains • 4-domain words (cut off point: LLR > 0) • 3-domain words(cut off point: LLR > 0) • Limited-academic-domain words (LAD) • 2-domain words (cut off point: LLR > 1) • 1-domain words (cut off point: LLR > average value) • Eliminate the former JLPT Level 4 vocabulary (Top 700 words) • Eliminate the words ranked at 20001 or lower • Classify all the AW and LAD by word ranking levels for International Students (U=Usage Coefficient): • 5 levels: Basic / Inter. / Adv. 1 / Adv. 2 / Super-adv.
4. Tiers of Japanese vocabulary (3) -1 Academic words (AW) • JAWL = Japanese Academic Word List • High specificity in 3 or 4 academic domains • 4-domain words (cut off point: LLR > 0) • 3-domain words (cut off point: LLR > 0) • Level 0 - VIII9 levels,2590 words in total • JAWL I (Intermediate): most essential for learning • Basic words contains much fewer academic words • JAWL I: 559 words Close to AWL in number and text coverage Coverage in the academic corpus used for extracting AW AWL: 10.0%JAWL I: 11.1%
4. (3) -1 Academic words (AW)Semantic features of AW (1) • Highly abstract, essential for operating logic i.e. • Range: 占める (occupy, account for), 特殊 (special, particular) • Relation: 属する (belong to), 依存 (rely/reliance) • Comparison/Evaluation: 後者 (the latter), 優れる (superior), • Quantitative change: 減少 (decrease), 強化 (reinforce) • Stage: 当初 (beginning), 現状 (present condition) • Development of enunciation: 取り上げる (take up [an issue]), まとめる (summarize) • Cause-effect, degree, agent, action, object, direction, goal, instrument, time etc.
4. Tiers of Japanese vocabulary (3) -1 Academic words (AW)Semantic features of AW (2) The most frequent Kanji used for AW 合 (combine, together), 定 (fix, certain), 分 (divide, minute), 一 (one), 同 (same), 数 (number), 上 (up), 体 (body), 出 (out), 大 (large) • 3-domain words: Some words have concrete meanings e.g. 署名 (signature), 保健 (health, hygiene) • 4-domain words: Few words have concrete meanings • The nature of the words are the same at all levels
POS of Japanese AW (1) • Common noun: 1072 words (41.4 %) e.g. 背景 (background) • Verbal noun: 882 words (34.0 %) e.g. 連続 (establish/-ment) Adding other types of nouns together, 2104 words (81.2 %) can be a noun • Verb (excluding verbal nouns): 225 words (8.7 %) e.g. 認める (recognize/approve) 述べる (describe/mention) Adding other types of verbs together, 1107 words (42.7%) can be a verb • Adjectival noun: 95 words (3.7 %) e.g. 詳細 (detail/-ed), 平等 (equal/-ity) • Adjective:Only 9 words (0.3 %) e.g. 著しい (remarkable)
POS of Japanese AW (2) • Affix: 106 words (4.1 %) e.g. -期 (period),-種 (type) substantial in Japanese academic words • Adverb: 34 words (1.3 %) e.g. しばしば (frequently) • Other (particle, auxiliary verb etc.): 22 words (0.8 %) • Remarkably many archaic words e.g. のみ (only), つつ (while doing), べし (ought to), あらゆる (every) いかなる (any), 我が (my), 漠然 (vague) • れる/られる (Passive/Potential/Spontaneous) specific in academic texts
4. (3) -2 Limited-academic-domain words (LAD) • Limited-academic-domain words (LAD) • High specificity in 2 or 1 domain(s) • 2-domain words (cut off point: LLR > 1) • 1-domain words (cut off point: LLR > average value) • Something between “academic” and “technical” • The “scams” from extracting AW? • Tiers of curriculum cf. Tajino et al. (2007) • Words correspondent to the curriculum • Basic: all the learners • Academic words: prep. to first year • Limited-academic-domain words (?): prep. to major • Technical words: major to postgrad.
4. (3) -2 Limited-academic-domain words (LAD) 2 domain words
4. (3) -2 Limited-academic-domain words (LAD) 2 domain words
4. (3) -2 Limited-academic-domain words (LAD) 2 domain words
Examples of 2 domain words: Words which are shared by only 2 main academic domains
4. (3) -2 Limited-academic-domain words (LAD) 2 domain words • Semantic features • More concrete and specific than academic words • Ah & Ss: Social, overlap in history and ethnology • Ss & Tn: Industrial • Ss & Bn: Social security, medical and nursing service • Tn & Bn: Scientific • Ah & Tn, Ah & Bn: not clear
4. (3) -2 Limited-academic-domain words (LAD) 1 domain words • It is merely a trial • The corpus is not the best for academic purpose, especially for natural sciences • Extracting something common across domains is much easier while extracting words by only one target corpus will require more complete target corpus • Therefore, AW (4 domain words and 3 domain words) will be more reliable than LAD (2 domain words and 1 domain words)
4. (3) -2 Limited-academic-domain words (LAD) 1 domain words
4. (3) -2 Limited-academic-domain words (LAD) 1 domain words • Semantic features are much clearer than 2 domain words
4. (3) -2 Limited-academic-domain words (LAD) 1 domain words • Semantic features are much clearer than 2 domain words
POS of Japanese LAD (1) • Common noun: 1605 words (63.1 %) – more than AW (41.4%) • Verbal noun: 633 words (24.9 %) e.g. 融資 (finance) cf. AW (34.0%) Adding other types of nouns together, 2104 words (87.9 %) can be a noun – more than AW (81.2%) • Verb (excl. verbal nouns): 81 words (3.2 %) cf. AW (8.7%) e.g. 訳す (translate) 向き合う (face (v.)) Adding other types of verbs together, 714 words (28.1%) can be a verb – less than AW (42.7%) • Adjectival noun: 88 words (3.5 %) cf. AW (3.7%) e.g. フル (full), 偉大 (great) • Adjective:Only 3 words (0.1 %) cf. AW (0.3%) e.g. 硬い (stiff)
POS of Japanese LAD (2) • Affix: 109 words (4.3%) cf. AW (4.1%) e.g. –犯 (offense) substantial in Japanese academic domain words • Adverb: 15 words (0.6 %) cf. AW (1.3%) e.g. 現に (surely) • Other (particle, auxiliary verb etc.): 9 words (0.8 %) cf. AW (0.8%) • Remarkably many archaic words – similar to AW e.g. なり [affirmative aux.], とも (even though), たり [affirmative aux.], ごとし (as/like), 単なる (mere), しめる(=しむ) [causative aux.], かかる (such)
4. Tiers of Japanese vocabulary (4) Literary words (LW) Extracting literary words: Words for reading literary works • Log-likelihood ratio (Keyness in AntConc) • Target corpus: literary works (identified by NDC and C-code) in BCCWJ 2009 (NINJAL, 2009) – Over 8 million tokens • 4 different reference corpus: Technical texts, general texts in arts and humanities, general texts in the other 3 academic domains, Internet forum texts (Yahoo Chiebukuro) • Extract keywords shared by the four results (Cutoff point: average value) • Eliminate the former JLPT Level 4 vocabulary (Top 700 words) • Eliminate the words ranked at 20001 or lower • Classify all the LW by word ranking levels for International Students (U=Usage Coefficient)
4. (4) Literary words (LW) POS of LW • More verbs, adverbs and interjections than AW and LAD • Less verbal nouns and adjectival nouns • This inevitably means LW have less loan words but more Japanese-origin words.
4. (4) Literary words (LW) Q. How many LW overlap with AW and LAD? • Only 27 words (0.5% of academic domain words, 1.7% of LW) are overlapping • Most of the overlapping words (24/27) overlap with 1 domain words (17 words overlap with words in biological natural science) • Many physical words such as words for body parts e.g. 左手 (left hand), こぶし (fist), 血 (blood),頭上 (overhead) • No LW words overlap with 4 domain words • Overlapping words are mainly at the intermediate level • No overlapping words in or above 11K+ • Some examples of overlapping words:音 (sound), 光 (light), 棚 (shelf), 組 (class), 岩 (rock), ひざ (knee), 興奮 (excite/-ment), 全身 (whole body), 帝 (emperer), ネズミ (mouse), 帆 (sail)
Word tiers: In what order should students learn them? Highly Advanced General AW/LAD LW Super-Advanced General AW/LAD LW Assumed known words Proper names Fillers, Signs (Transparent compounds *) Others • Basic • General • AW/LAD • LW • Intermediate • General • AW/LAD • LW • Advanced • General • AW/LAD • LW
5. Text coverage by word tier • The word tier analyser: An Excel sheet where word profiling of a text can be checked automatically by cutting and pasting the result of AntWordProfiler with the word tier base word list. • Text covering efficiency High efficiency in vocabulary learning = Fewer unique lexemes cover more texts (Reciprocal Type/Token Ratio = Token/Type Ratio?) *Comparison should be made between equally-sized texts)