1 / 31

Learning a token classification from a large corpus (A case study in abbreviations)

Learning a token classification from a large corpus (A case study in abbreviations). Petya Osenova & Kiril Simov BulTreeBank Project (www.BulTreeBank.org) Linguistic Modeling Laboratory, Bulgarian Academy of Sciences petya@bultreebank.org, kivs@bultreebank.org ESSLLI'2002 Workshop on

seven
Télécharger la présentation

Learning a token classification from a large corpus (A case study in abbreviations)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Learning a token classification from a large corpus(A case study in abbreviations) Petya Osenova & Kiril Simov BulTreeBank Project (www.BulTreeBank.org) Linguistic Modeling Laboratory, Bulgarian Academy of Sciences petya@bultreebank.org, kivs@bultreebank.org ESSLLI'2002 Workshop on Machine Learning Approaches in Computational Linguistics August 5 - 9, 2002

  2. Plan of the talk • BulTreeBank Project • Text Archive • Token Processing Problem • Global Token Classification • Application to Abbreviations

  3. BulTreeBank project • It is a joint project between the Linguistic Modeling Laboratory (LML), Bulgarian Academy of Sciences and Seminar fuer Sprachwissenschaft (SfS), Tuebingen. It is funded by Volkswagen Foundation, Germany. • Its main goal is the creation of a high quality syntactic treebank of Bulgarian which will be HPSG oriented. • It also aims at producing a parser and a partial grammar of Bulgarian. • Within the project an XML-based system for corpora development is being created.

  4. BulTreeBank team Principle researcher: Kiril Simov Researchers: Petya Osenova, Milena Slavcheva, Sia Kolkovska PhD student: Elisaveta Balabanova Students: Alexander Simov, Milen Kouylekov, Krasimira Ivanova, Dimitar Dojkov

  5. BulTreeBank text archive • A collection of linguistically interpreted texts from different genres (target size 100 million words) • Linguistically interpreted text is a text in which all meaningful tokens (including numbers, special signs and others) are marked-up with linguistic descriptions

  6. The current state of the text archive • Nearly 90 000 000 running words: 15% fiction, 78% newspapers and 7% legal texts, government bulletins and other genres • About 70 million running words are converted into XML format with respect to TEI guidelines • 10 million running words are morphologically tagged • 500 000 running words are manually disambiguated

  7. Pre-processing steps (1) • Morphosyntactic tagger Assigning all appropriate morpho-syntactic features to each potential word • Part-of-speech disambiguator Choosing the right morpho-syntactic features for each potential word in the context • Partial Grammar for non-word tokens

  8. Pre-processing steps (2) Partial grammars • Sentence boundaries grammar • Named Entity Recognition • Names of people, places, organizations etc. • Dates, currencies, numerical expressions • Abbreviations • Foreign tokens • Chunk grammar (Abney 1991, 1996) • Non-recursive constituents

  9. Token processing problem A token in a text receives its linguistic interpretation on the basis of two sources of information: (1) the language and (2) the context of use Two problems: • For less studied languages there is no enough language resources (low level of linguistic interpretation) • Erroneous use in the context (wrong prediction)

  10. Token classification • Symbol-based classification The tokens are defined by their immanent graphical characteristics • General token classification The tokens fall into several categories: common word, proper name, abbreviation, symbols, punctuation, error • Grammatical and semantic classification The tokens are presented in several lexicons, in which their grammatical and semantic features are listed

  11. General token classification Our goal is to learn a corpus-based classification of the tokens with respect to the general token classification We use this classification in two ways: • For an initial classification of the tokens in the texts before consulting the dictionary, and • For processing linguistically the tokens from the different classes

  12. Learning general token categories (1) Token classes: • Common words typical - lowercased and first capital letter in sentence-initial position; non-typical - all caps • Proper names typical - first capital letter; non-typical - all caps; wrong - lowercased • Abbreviations typical - all caps, mixed, lowercased (with period, hyphen or a single letter)

  13. Learning general token categories (2) Some problems: • Some tokens can belong to more than one class according to their graphical properties. • Spelling errors in a large set of texts could cause misclassification.

  14. Learning general token categories (3) Our classification is not boolean but gradual-ranking of tokens with respect to each of the above categories. Our initial procedure included the following steps: • We used some graphical criteria for assigning potential categories to the unknown tokens. • We used statistical methods to make a distinction within each category between the most frequent tokens of this category and tokens not in the category or rare tokens.

  15. Learning general token categories (4) Graphical criterion It takes into account the graphical specificity of the tokens. For each category a list of tokens potentially belonging to it was constructed Well known problems such as: • Common words written in capital letters • Abbreviations written in a wrong way The graphical criterion is not sufficient

  16. Learning general token categories (5) Statistical criterion For each category, in order to get the maximal number of right predictions for candidate tokens, every candidate-token is ranked In fact we classify normalized tokens A normalized token is an abstraction over tokens that share the same sequence of letters from a given alphabet

  17. Learning general token categories (6) Ranking with a category (1) The ranking formula is Rank = TokPar*DocPar where the two parameters are TokPar = True/All The number of true appearances of the token divided by the number of all appearances of the token DocPar The number of the documents in which the correctly written token was found if this number is less that 50, otherwise this value is 50

  18. Learning general token categories (7) Ranking with a category (2) The first parameter does not make difference between one or hundred occurrences. Thus, the real scope of distribution is lost The impact that the token has over the text archive is represented by the second parameter. The upper bound of 50 is used as a normalization parameter. Thus the tokens that are rare or do not belong to the category receive a very small rank.

  19. Learning general token categories (8) Usefulness • The method tolerates the tokens with greater impact over the whole corpus • The tokens appearing in a small number of documents are processed by local-based approaches (document-centered)

  20. General token categories and local-based approaches • The local-based approaches can filter general classification with respect to ambiguous or unusual usage of tokens • When the local-based approach is unapplicable, the information is taken from the general token classification • The result of such a ranking is very useful for the other task mentioned above - the linguistic treatment of the unknown tokens

  21. Abbreviations in the pre-processing • Abbreviations are special tokens in the text • They contribute to a robust: • tagging • disambiguation • shallow parsing

  22. Extraction criteria Three criteria • Graphical criterion (as above) • Statistical criterion (as above) • Context criterion - we tried to extract some abbreviations with their extensions written usually in brackets. Thus the ambiguity is reduced.

  23. Dealing with abbreviations Our approach includes three steps: • Typological classification - the existing classifications were refined with respect to the electronic treatment of abbreviations • Extraction - different criteria were proposed for the extraction of the most frequent abbreviations in the archive • Linguistic treatment - the abbreviations were extended and the relevant grammatical information was added

  24. Typological classification

  25. Linguistic treatment (1) • Encoding the linguistic information shared by all abbreviations: • head element presents the abbreviation itself • every abbreviation has a generalized type: acronym or word • every abbreviation has at least one extension • every extension element consists of a phrase

  26. Linguistic treatment (2) • Encoding the linguistic information shared by some types of abbreviations: • the non-lexicalized abbreviations were assigned grammatical information according to its syntactic head. Thus the element 'class' was introduced. • the partly lexicalized abbreviations were assigned additionally grammatical information according to their inflection. Thus the element 'flex' was introduced. • the abbreviations of foreign origin usually have an additional head element, called headforeign (headf).

  27. Examples (1) type ACRONYM <abbr><head>АЧП</head><acronym/><expan><phrase>Агенция за чуждестранна помощ</phrase><class>Сжед</class></expan><abbr> <abbr><head>ДП</head><acronym/> <expan><phrase>Държавно предприятие</phrase> <class>Ссред</class> </expan> <expan><phrase>Демократичесka партия</phrase> <class>Сжед</class></expan></abbr> <abbr><head>ЗУНК</head><acronym/><expan><phrase>Закон за уреждане на необслужваните кредити</phrase><class>Смед</class> <flex>ЗУНК-а,ЗУНК-ът,ЗУНК-ове</flex></expan></abbr> <abbr><head>ФБР</head><headf>FBI</headf><acronym/> <expan><phrase>Федералнобюро за разследване</phrase> <class>Ссред</class></expan><abbr>

  28. Examples (2) type WORD <abbr><head>г-ца</head> <word/> <expan><phrase>госпожица</phrase></expan></abbr> <abbr><head>гр.<head><word/> <expan><phrase>град</phrase></expan></abbr> <abbr><head>в.</head><head>в-к</head><word/> <expan><phrase>вестник</phrase></expan></abbr> <abbr><head>ез.</head><word/> <expan>езеро</expan> <expan>език</expan></abbr>

  29. Evaluation The method is hard for absolute evaluation with respect to only one class of tokens We apply only relative evaluation with respect to a given rank Only precision measure is really applicable The recall is practically equal to 100% Precision = 98.7% for the first 557 candidates (Rank >= 25)

  30. Other applications • Classification and linguistic treatment of other classes of tokens: names, sentence boundary markers (similar to abbreviation) • Determination of the vocabulary of dictionary for human use The lexeme with great impact over the nowadays texts will be chosen Similar treatment of the new words

  31. Future work • Dealing with different ambiguities • Combination with other methods as document-centered, morphological guessers • Using other stochastic methods

More Related