1 / 10

Sentence Classification and Clause Detection for Croatian

Sentence Classification and Clause Detection for Croatian . Kristina Vučković, Željko Agić, Marko Tadić Department of Information Sciences, Department of Linguistics Faculty of Humanities and Social Sceinces, University of Zagreb {kvuckovi, zagic, marko.tadic}@ffzg.hr FASSBL 7 Conference

terrence
Télécharger la présentation

Sentence Classification and Clause Detection for Croatian

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sentence Classification and Clause Detection for Croatian Kristina Vučković, Željko Agić, Marko Tadić Department of Information Sciences, Department of LinguisticsFaculty of Humanities and Social Sceinces, University of Zagreb {kvuckovi, zagic, marko.tadic}@ffzg.hr FASSBL 7 Conference Dubrovnik, Croatia2010-10-05

  2. Overview • What? • classifying Croatian sentences by structure • detecting independent and dependent clauses • How? • implemented a prototype system in NooJ • linked it with a morphosyntactic tagger • evaluated on a sample from Croatian corpora • Why? • rule-based chunking and shallow parsing

  3. Classification and detection • sentence segmentation is easy when considering sentence boundaries only • here, we: • detect boundaries of clauses in complex sentences • assign type to sentences • sentence classification • purpose: declarative, interrogative, etc. • structure: simple and complex • complex sentences • independent complex, i.e. compound sentences • dependent complex sentences

  4. Classification and detection • independent complex sentences • independent clause connected to the main clause by using a conjunction • type defined by the choice of conjunction • e.g. constituent clause, conjunctions {i, pa, te, ni, niti} • disjunctive, opposite, exclusive, conclusive and explanatory clause • Svi su spavali, jedino sam ja bio budan. (exclusive) • dependent complex sentences • main clause is independent, all the others depend on it and cannot stand alone in a sentence • Predicative, subjective, objective, attributive, appositional and adverbial clause • Ispričat ću tišto mi se dogodilo.(objective)

  5. The system • prototype implemented in NooJ • finite state transducer cascades (local grammars) • Croatian lexical resources • each cascade detects and annotates a different type of clause • built on top of a chunker for Croatian • the top-level grammar • two types of subgraphs: main clauses and independent clauses

  6. The system • Main clause grammar • presence of a VP and possibly any other phrase • independent clauses recognized just by using the conjunctions • implementation of dependent clause detection varies across clause types

  7. Experiment setup • used the CW100 corpus • XCES-encoded to word level • sentence delimited, tokenized, manually lemmatized and MSD-annotated • 200 randomly selected sentences • 100 for the development and 100 for testing • utilized the CroTag tagger • NooJ input format allows external annotation • created three systems • no preprocessing • tagging input sentences with CroTag (~85% accuracy) • using the manually assigned tags from CW100 • recall, precision, F1-measure

  8. Results • scores for the three systems • “perfect” tagging system is the top-performer • benefits of automatic tagging? • distribution of assigned types • main, objective, opposite, adverbial, attribute, ... • misclassifications • attributive and objective most commonly misclassified • data sparseness

  9. Conclusions and future work • the system scores good in terms of F1-measure • open issues • verb coordination • dislocated nominal predicates • attribute classes starting with a PP • complex insertion of dependent clauses • no real benefit from automatic MSD-tagging • future work • resolving the issues • re-evaluation on a larger test set? • integration with a rule-based shallow parser

  10. Thank you for your attention. The research within the project ACCURAT leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007-2013), grant agreement no 248347. www.accurat-project.eu

More Related