1 / 25

Chinese WordSketch Online, corpus-based summaries of word usage

Chinese WordSketch Online, corpus-based summaries of word usage. Participants. Adam Kilgarriff, Lexical Computing, UK David Tugwell, Tech University Budapest Pavel Rychly, Brno University Simon Smith, 銘傳大學 ( 中研院 ) 黃居仁 , 中研院 巫宜靜 , 清華大學 ( 中研院 ). Facing the problem: lexical choice.

margarita
Télécharger la présentation

Chinese WordSketch Online, corpus-based summaries of word usage

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Chinese WordSketchOnline, corpus-based summaries of word usage

  2. Participants • Adam Kilgarriff, Lexical Computing, UK • David Tugwell, Tech University Budapest • Pavel Rychly, Brno University • Simon Smith, 銘傳大學 (中研院) • 黃居仁, 中研院 • 巫宜靜, 清華大學 (中研院)

  3. Facingthe problem: lexical choice • “You shall know a word by the company it keeps” (Firth, 1957) • The meaning of face depends on the collocation (詞語搭配) • 學漢語的外國人要面對詞語選擇的問題 • 許多種動物正在面臨絕種 • Similarly with save • Save money • Save life • Save a seat for me

  4. Look in a dictionary? A corpus? • Some modern English dictionaries give some collocation (詞語搭配) information • Chinese dictionaries give very limited help • Since the 1980s, corpus KWIC (KeyWord In Context) concordances have been available

  5. Pre-computer corpus! • Oxford English • Dictionary: • 20 million • index cards

  6. KWIC Concordance

  7. The coloured pens method 1political association 4 person in an agreement/dispute 2 social event 5 to be party to something... 3 group of people

  8. Limitation of KWIC analysis • As corpora get bigger: too much data • 50 lines for a word: read all • 500 lines: could read all, takes a long time • 5000 lines: no • Instead, create a statistical summary of word usage • Show most salient 最有顯著性 collocates (Mutual Information)

  9. Mutual Information • Church and Hanks 1989 • MI: How much more often does a word pair occur, than one might expect by chance:

  10. Collocation listing For right collocates of save (>5 hits)

  11. Limitations of collocation listing • Some items are not genuine collocates • yours appears only because it is adjacent to save • The collocates can belong to any part of speech • It would better if they were classified into POS • and the role they play in the sentence • Thus, • for arrest in “The police were quick to arrest a number of suspects on the spot” • We would like to see • Keyword: arrest • Subject: police • Object: suspect(s) • Modifier: on the spot

  12. Wordsketch • Attempts to meet these requirements • A corpus-derived one-page summary of a word’s grammatical and collocational behaviour • Implemented for English and Czech • Chinese and Irish implementations in progress

  13. The corpus: Chinese Gigaword • A Linguistic Data Consortium corpus • Very large: over 1 billion characters • Compiled by David Graff & Ke Chen in 2003 • Minimally tagged • 286 newswire stories, half from each of: • CNA Taiwan (740 million traditional characters) • Xinhua PRC (380 million simplified characters) • Corpus was segmented and tagged using Academia Sinica tools

  14. http://corpora.fi.muni.cz/chinese/ • 逮捕 • 教 • 學習 • 銀行 • 捉 • http://corpora.fi.muni.cz/chinese/

  15. Functions • KWIC concordance • Sorting, filtering etc • Word sketch • Automatic thesaurus • Sketch difference • discriminate near-synonyms • In development • key words in a subcorpus / text type • how word varies with text type

  16. Grammar writing • Uses CQL (Corpus query language) • Christ and Schulze, U. Stuttgart, 1994 • defining an object: v (adj|n|det|num|adv)* n rewriting in CQL with BNC/CLAWS-5 tags [tag="VV.*"] [tag="(A[JTV]|D|O).*"]* [tag="NN.*"]

  17. Further work • Improve grammatical relations, especially sentence objects, to account for • topicalization (啤酒,葡萄酒,他都愛喝) • 把 fronting (請把啤酒喝完) • Create “Dr Eye” style interface, to show common collocations online, in a text

  18. English version available • For personal use • www.sketchengine.co.uk • 歡迎註冊及多善加利用!

More Related