290 likes | 425 Vues
How do we evaluate a group of words in gaining text coverage?. Tatsuhiko Matsushita (University of Tokyo) Vocab@Vic 2013 Victoria University of Wellington. Text Covering Efficiency of the Grouped Words by Genre (Not Graded by Level) *Domain-unspecified.
E N D
How do we evaluatea group of wordsin gaining text coverage? Tatsuhiko Matsushita (University of Tokyo) Vocab@Vic 2013 Victoria University of Wellington
Text Covering Efficiency of the Grouped Words by Genre (Not Graded by Level) *Domain-unspecified (This slide will appear again later in 5. Results and Discussion.)
Contents • Motives for the Study • Research Questions and Goals • Proposal of a New Index: Text Covering Efficiency (TCE) • Method of Validating TCE • Resultsand Discussion • Conclusion
1. Motives for the Study (1) • How efficiently can we learn vocabulary? What words should learners learn first, second and next? • Domain-specific words such as academic words (Coxhead, 2000) are often extracted for efficient vocabulary learning in a genre. • Text coverage has been used for evaluating these groups of words (Coxhead, 2000; Hyland & Tse, 2007)
1. Motives for the Study (2) • However, text coverage is not appropriate for comparing the efficiency between grouped words when the numbers of words are different between the groups. • How can we compare the efficiency between a group of domain-specific words and the other words? e.g. 1 How can we compare the efficiency between learning AWL(Coxhead, 2000) and UWL (Xue & Nation, 1984)? e.g. 2 How can we compare the efficiency between learning technical term lists in different genres? e.g. 3 How can we compare lists at different frequency levels in a genre e.g. sublists of AWL? How many times more efficient in gaining text coverage in different genres by learning the sublist 1 than the sublist 2? e.g. 4 For gaining higher text coverage, at which stage should learners transit from learning general words to domain-specific words?
1. Motives for the Study (3) • For example, the table below (Hyland & Tse, 2007) does not show the difference in efficiency in gaining the text coverage because the numbers of words in AWL and GSL are different.
2. Research Questions and Goals (1) Research Questions • What index is appropriate for comparing the efficiency between grouped words in gaining text coverage when the numbers of words are different between the groups? • Is there any advantages of comparing the efficiency between grouped words in gaining text coverage other than deciding the most efficient learning order of words?
2. Research Questions and Goals (2) Goals • To propose an index: Text Covering Efficiency (TCE) • To show the validity and usefulness of TCE for • deciding the most efficient order of words to learn • analyzing lexical features of text genres by applying TCE to some groups of Japanese domain-specific words and other types of grouped words.
3. Proposal of a New Index: Text Covering Efficiency (TCE) (1) • Problem: numbers of words are different between the groupsto be compared • Solution: Standardization • Dividing text coverage (tokens) of a group of words by the number of the grouped words • Dividing the quotient by the total number of tokens in the target text (domain) to adjust the difference in size of the texts and make the figures from differently-sized texts comparable.
3. Proposal of a New Index: Text Covering Efficiency (TCE) (2) • For the user’s convenience, the figure is multiplied by 1,000,000. • The solution means the expected number of tokens ofa word from the grouped wordsin a one-million-token text in the target domain. • Therefore, it is comparable with the standardized frequency per million. In other words, TCE is an expected standardized frequency of a grouped word. • Text Covering Efficiency (TCE) = the mean text coverage per one million tokens of the target text by a word from the grouped words.
3. Proposal of a New Index: Text Covering Efficiency (TCE)(3) The formula for TCE E = = • E: Text covering efficiency = Expected number of tokens (= text coverage) of a word in the tested group in a one-million-token text in the target domain • : Number of tokens (= text coverage) of the tested group of words in the target text • : Number of lexemes of the tested group of words • : Number of tokens in the target text (text length)
4. Method of Validating TCE(1) How can we validate an index: TCE? • How can we validate an index? • By applying the index to the actual data to check if: • the results do not conflict with the findings from previous studies • the results show something which will not be clearly shown without the index • TCE was applied to some grouped Japanese words in different text genres
4. Method of Validating TCE(2) Domain-Specific Words to Be Tested • (Japanese) Common Academic Words (CAW) (Matsushita, 2011) • (Japanese) Limited-Academic-Domain Words (LAD) • (Japanese) Literary Words (LW) (Matsushita, 2012) These word lists can be downloaded from “Matsushita Laboratory for Language Learning” http://www17408ui.sakura.ne.jp/tatsum/English_top_Tatsu.html
4. Method of Validating TCE(3) Method of Extractionof the Domain-specific Words • Target Corpora: Technical texts in the four genres of Humanities, Social sciences, Technological natural sciences and Biological natural sciences • Reference Corpus: Balanced Contemporary Corpus of Written Japanese (BCCWJ), 2009 monitor version excluding the target corpora part • Index: Log-likelihood Ratio (LLR) • Criteria for extraction 4-domain words and 3-domain words: CAW 2-domain words and 1-domain words: LAD
4. Method of Validating TCE(4) Test corpora: Text Genres Used for the Validation • JS-Bn: Journal articles on biological natural sciences. 0.72 million tokens. • MTT-Bn: Technical texts in biological natural sciences. 0.01 million tokens. • JS-Tn: Journal articles on technological natural sciences. 2.71 million tokens. • MTT-Tn: Technical texts in technological natural sciences. 0.07 million tokens. • MTT-Ss: Technical texts in social sciences. 0.05 million tokens. • TB: Texts in social sciences for intermediate and advanced learners of Japanese. 0.19 million tokens. • TIS: Texts in a textbook in international studies. Mainly social science texts. . 0.04 million thousand tokens. • UYN: Newspaper texts of 5.68 million tokens. • BSB: Texts from best seller books. Mainly composed of literary works. 2.10 million tokens. • UPC: Lieterary texts. 2.30 million tokens. • MC: Conversation texts. 1.13 million tokens.
5. Results and Discussion (1) TCE of the Grouped Words by Genre (Not Graded by Level) *Domain-unspecified
5. Results and Discussion(2) Ranking for TCE of the Grouped Words in Each Genre (Not Graded by Level) *Domain-unspecified
5. Results and Discussion(3)TCE of the Grouped Words by Level and Genre *Domain-unspecified
5. Results and Discussion(4) TCE of the Grouped Words in Each Genre *Domain-unspecified
Text Covering Efficiency (TCE) and its Rankings of the Grouped Words by Level and Genre (Detailed) *Domain-unspecified
TCE in Biological Natural Science Journal Articlesby Type of and Level of Grouped WordsTCE: Text Covering Efficiency= Expected number of tokens of a lexeme in the tested groupin a one-million-token text in the target domain
TCE in Biological Natural Science Journal Articlesby Type of and Level of Grouped WordsTCE: Text Covering Efficiency = Expected number of tokens of a lexeme in the tested group in a one-million-token text in the target domain
5. Results and Discussion (5) TCE of the Grouped Words by Level and Genre (Detailed) *Domain-unspecified
5. Results and Discussion (6) TCE of the Grouped Words by Level and Genre (Detailed) *Domain-unspecified
5. Results and Discussion (7) • The result shows that TCE clearly indicates the efficiency in gaining text coverage, and thus it is useful for deciding a more efficient learning/teaching order of words. • These findings do not seem to conflict with previous studies. • Lexical features of texts in different genres can also be examined by checking the TCE figures. E.g. Japanese newspaper texts have similar lexical features to academic texts in social sciences. • You can find things you cannot see without the index. For example, such an analysis allows you to say things like, “Learning the intermediate Japanese Common Academic Words is 6.2 times more efficient in covering Japanese social science texts than learning other words at the same level, and 8.3 times more efficient than learning the advanced common academic words”.
8. Conclusion • TCE: Text Covering Efficiency = the mean text coverage per one million tokens of the target text by a word from the grouped words • TCE enables us to compare many different types of grouped words in many different genres. Therefore, it makes easier to decide what words should be learned first to read texts in a genre. • TCE enables us to examine the lexical features of texts in different genres.
References Coxhead, A. (2000). A new academic word list. TESOL Quarterly, 34(2), 213–238. Hyland, K., & Tse, P. (2007). Is there an “Academic Vocabulary」? TESOL Quarterly, 41(2), 235–253. Matsushita, T. (松下達彦). (2011). 日本語の学術共通語彙(アカデミック・ワード)の抽出と妥当性の検証 [Extracting and validating the Japanese Academic Word List]. [2011年度 日本語教育学会春季大会 予稿集 [Proceedings of the Conference for Teaching Japanese as a Foreign Language, Spring 2011] (p 244–249). Matsushita, T. (松下達彦). (2012). 日本語文芸語彙の抽出と検証 ―コーパスに基づくアプローチ― [Extracting and validating the Japanese Literary Word List: A corpus-based approach].第九回国際日本語教育・日本研究シンポジウム (The Ninth Symposium for Japanese Language Education and Japanese Studies), City University of Hong Kong, November 24, 2012 Richards, B. J., & Malvern, D. D. (1997). Quantifying lexical diversity in the study of language development. Reading: University of Reading. Xue, G., & Nation, I. S. P. (1984). A university word list. Language Learning and Communication, 3(2), 215–229.
Added note: Robustness of TCE • In addition, TCE is a robust index by which different lexical features in different genres can be clarified as well. • As argued about TTR (Richards & Malvern, 1997), the relationship between the numbers of tokens and lexemes will be different depending on the text size. Nevertheless, it is not a problem for TCE because the formula does not use the number of lexemes occurring in the text but uses the number of lexemes of the target group of words. This is a reasonable idea because learners generally do not know which words will occur in a particular text. For example, to evaluate the value of the intermediate literary words as a source for gaining the text coverage, it is reasonable to divide the tokens by the number of lexemes of the intermediate literary words which a learner will learn before s/he reads the text.