280 likes | 405 Vues
This work explores the complexities of PP-attachment in Bulgarian through a customized extension of the Earley-Stolcke algorithm. As part of my bachelor's thesis at the University of Plovdiv, I present a thorough examination of PP-attachment ambiguity, its grammatical policies, and the efficiency of the implemented algorithm in resolving these issues. Despite meticulous design, the results revealed that common ambiguities remain unresolved—highlighting the limitations of stochastic context-free grammars in handling intricate linguistic phenomena. The study also outlines potential future work such as improving grammatical constructs and enhancing parsing efficacy.
E N D
Atanas Georgiev Chanev PhD student in Cognitive Sciences and Education, Univeristy of Trento Bachelor’s: FMI, University of Plovdiv, Bulgaria
A PP-Attachment Conundrum for Bulgarian Based on the parser I have implemented (an extension of the Earley-Stolcke’s algorithm) and the results I have obtained a.k.a. the diploma work for my bachelor’s
Note: I won’t discuss algorithms dealing with PP attachments. I’ll show that my approach fails to resolve PP Attachment ambiguities in most of the cases
Contents: The problem The prerequisites The algorithm The grammar The results The PP-Attachment problem Future Work Acknowledgements Slides: 28
The Problem: Parsing natural languages (Bulgarian) Shallow parsingVs.Full parsing
What Is Syntax? POS tagging ? Phrase Structures Grammatical Relations Grammatical Functions ?
Constituent Structures: Rules like: S-> NP VP; NP-> NP PP … Problem: Ambiguity An approach for resolving ambiguity in Bulgarian – Tanev’ 2001
The Prerequisites: Morphological Processor for Bulgarian (Krushkov’ 97) POS Tagger (Tachev’ 2001) … I have encapsulated the Grammar in a separate section
The Algorithm: Three steps: Predictor Scanner Completer The Earley Algorithm (Earley’ 70)
Stolcke’s extension: Each rule is assigned two types of probabilities: inner probability and forward probability. They are calculated differently in each step (predictor, scanner, completer) Stochastic extension – Capable of solving ambiguities (Stolcke’ 93)
Shallow trees are better? The deep trees do always have smaller probabilities!
+ Basic Unification: Basic Unification mechanism, based on agreement constraints A full unification as described in (Jurafsky, Martin’ 2001) performing at each step is too ineffective
The Grammar: Two versions of the grammar, collected from a mini corpus of sentences in the newspaper articles register
The mini corpus: 5331 Tokens > 450 sentences Grammatically and syntactically annotated
The PPs: Two types of PPs: Modifying the verb – AdvPs Modifying the noun – PPs
POS tags: Shte – future tense auxiliary or particle Govoreshtiqt – verb or adjective As in ‘govoreshtiqt student’
The Results: Precision: 42.42% Recall: 66.00% F-measure: 51.65%
How do I define precision and recall: Precision = (The number of correctly parsed sentences)/(The number of sentences given any predictions) Recall = (The number of sentences given any predictions)/(The number of tested sentences)
The PP-Attachment Problem (+the next 4 slides): How many of the correctly parsed sentences contain PPs? How many of the mistaken sentences contain PPs?
A considerable Amount of mistaken PPs: • BUT: • Sentences, which are not given any prediction contain PPs • AdvPs sometimes are not ambiguos – e.g. at the beginning of the sentence How many of the correctly parsed sentences contain PPs? 27.59% How many of the mistaken sentences contain PPs? 32.61%
Conclusion: Stochastic Context-Free Grammars are not powerful enough to deal with the PP-Attachment Problem. (or at least using this approach)
Future Work: Clause Splitter for Bulgarian A better grammar = a better corpus A better unification processor Semantic Constraints
Acknowledgements: [1] Крушков, Хр., Моделиране и изграждане на машинни речници и морфологични процесори, Пловдив, Дисертация за присъждане на образователна и научна степен “Доктор”, ПУ "П.Хилендарски", Пловдив, 1997. [2]Тачев, Г. Стохастичен маркировчик на частите на речта, Дипломна работа, ПУ "П.Хилендарски", Пловдив, 2001. [3] Танев, Хр., Автоматичен анализ на текстове и решаване на многозначности в българския език, Дисертация за присъждане на образователна и научна степен “Доктор”, ПУ "П.Хилендарски", Пловдив, 2001. [4] Earley, J., An Efficient Context-free Parsing Algorithm, Communications of the ACM, 6(8):451-455, 1970. [5] Stolcke, A., An Efficient Probabilistic Context-free Parsing Algorithm That Computes Prefix Probabilities, Technical Report TR-93-065, International Computer Science Institute, Berkeley, CA, 1993. Revised 1994. [6] Jurafsky, D., Martin J. H., Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics,and Speech Recognition, Prentice Hall, New Jersey, 2001.