Atanas Georgiev Chanev

Atanas Georgiev Chanev PhD student in Cognitive Sciences and Education, Univeristy of Trento Bachelor’s: FMI, University of Plovdiv, Bulgaria

A PP-Attachment Conundrum for Bulgarian Based on the parser I have implemented (an extension of the Earley-Stolcke’s algorithm) and the results I have obtained a.k.a. the diploma work for my bachelor’s

Note: I won’t discuss algorithms dealing with PP attachments. I’ll show that my approach fails to resolve PP Attachment ambiguities in most of the cases

Contents: The problem The prerequisites The algorithm The grammar The results The PP-Attachment problem Future Work Acknowledgements Slides: 28

The Problem: Parsing natural languages (Bulgarian) Shallow parsingVs.Full parsing

What Is Syntax? POS tagging ? Phrase Structures Grammatical Relations Grammatical Functions ?

Constituent Structures: Rules like: S-> NP VP; NP-> NP PP … Problem: Ambiguity An approach for resolving ambiguity in Bulgarian – Tanev’ 2001

The Prerequisites: Morphological Processor for Bulgarian (Krushkov’ 97) POS Tagger (Tachev’ 2001) … I have encapsulated the Grammar in a separate section

The Algorithm: Three steps: Predictor Scanner Completer The Earley Algorithm (Earley’ 70)

Stolcke’s extension: Each rule is assigned two types of probabilities: inner probability and forward probability. They are calculated differently in each step (predictor, scanner, completer) Stochastic extension – Capable of solving ambiguities (Stolcke’ 93)

Shallow trees are better? The deep trees do always have smaller probabilities!

+ Basic Unification: Basic Unification mechanism, based on agreement constraints A full unification as described in (Jurafsky, Martin’ 2001) performing at each step is too ineffective

The Grammar: Two versions of the grammar, collected from a mini corpus of sentences in the newspaper articles register

The mini corpus: 5331 Tokens > 450 sentences Grammatically and syntactically annotated

The PPs: Two types of PPs: Modifying the verb – AdvPs Modifying the noun – PPs

POS tags: Shte – future tense auxiliary or particle Govoreshtiqt – verb or adjective As in ‘govoreshtiqt student’

How to Assign Probabilities to the Rules:

The Results: Precision: 42.42% Recall: 66.00% F-measure: 51.65%

How do I define precision and recall: Precision = (The number of correctly parsed sentences)/(The number of sentences given any predictions) Recall = (The number of sentences given any predictions)/(The number of tested sentences)

The PP-Attachment Problem (+the next 4 slides): How many of the correctly parsed sentences contain PPs? How many of the mistaken sentences contain PPs?

A considerable Amount of mistaken PPs: • BUT: • Sentences, which are not given any prediction contain PPs • AdvPs sometimes are not ambiguos – e.g. at the beginning of the sentence How many of the correctly parsed sentences contain PPs? 27.59% How many of the mistaken sentences contain PPs? 32.61%

A bad parse:

Another bad parse:

A good parse:

Conclusion: Stochastic Context-Free Grammars are not powerful enough to deal with the PP-Attachment Problem. (or at least using this approach)

Future Work: Clause Splitter for Bulgarian A better grammar = a better corpus A better unification processor Semantic Constraints

Acknowledgements: [1] Крушков, Хр., Моделиране и изграждане на машинни речници и морфологични процесори, Пловдив, Дисертация за присъждане на образователна и научна степен “Доктор”, ПУ "П.Хилендарски", Пловдив, 1997. [2]Тачев, Г. Стохастичен маркировчик на частите на речта, Дипломна работа, ПУ "П.Хилендарски", Пловдив, 2001. [3] Танев, Хр., Автоматичен анализ на текстове и решаване на многозначности в българския език, Дисертация за присъждане на образователна и научна степен “Доктор”, ПУ "П.Хилендарски", Пловдив, 2001. [4] Earley, J., An Efficient Context-free Parsing Algorithm, Communications of the ACM, 6(8):451-455, 1970. [5] Stolcke, A., An Efficient Probabilistic Context-free Parsing Algorithm That Computes Prefix Probabilities, Technical Report TR-93-065, International Computer Science Institute, Berkeley, CA, 1993. Revised 1994. [6] Jurafsky, D., Martin J. H., Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics,and Speech Recognition, Prentice Hall, New Jersey, 2001.

Thank You!

Atanas Georgiev Chanev

Atanas Georgiev Chanev

Presentation Transcript

Lyubomir Penev , Terry Catapano , Donat Agosti , Teodor Georgiev , Guido Sautter, Pavel Stoev

SPRING SEMESTER 2009 311 GR. 4 Galin Georgiev

Kostadin Georgiev, VMware Bulgaria Preslav Nakov , Qatar Computing Research Institute

SPRING SEMESTER 2009 311 GR. 4 Galin Georgiev

36, Atanas Dukov Str., 1407 Sofia, Bulgaria T: +359 2 861 57 00

Head Asst.Prof. Atanas Vladikov University of Plovdiv, Bulgaria

E. Fiori, G. Georgiev, R. Lozeva : CSNSM, orsay L. Gaudefroy : CEA, Bruyères -Le- Chatel

Adam Atanas Phys401 – Dr. Reiff April 13, 2009

G. Georgiev, M. Hass, D.L. Balabanski, A. Herlert,

Stoyan Georgiev advisors: Uwe Ohler and Sayan Mukherjee

Deal Team: Gerson R. Guzman Vassil Georgiev Ricardo Wilson Russell Workman

Christo Georgiev NIMH , Bulgaria

SPRING SEMESTER 2012 211 GR. 15 Galin Georgiev

A. Grigorov, A. Georgiev, M. Petrov, S. Varbanov, K. Stefanov

SPRING SEMESTER 2009 311 GR. 4 Galin Georgiev

Guoqing Xu , Nick Mitchell, Matthew Arnold, Atanas Rountev, Gary Sevitsky Ohio State University

105 HIGH SCHOOL “Atanas Dalchev”

Christo Georgiev NIMH , Bulgaria

Stoyan Georgiev advisors: Uwe Ohler and Sayan Mukherjee

Gjoneska B.¹ , Georgiev D.¹, Fustic S.² and Pop-Jordanova N.²

Guoqing Xu , Nick Mitchell, Matthew Arnold, Atanas Rountev, Gary Sevitsky Ohio State University

Christo Georgiev NIMH, Bulgaria