Comparative Structures in Croatian: MWU Approach Kristina Kocijan, Sara Librenjak

Comparative Structures in Croatian: MWU Approach Kristina Kocijan, Sara Librenjak Department of Information and Communication Sciences University of Zagreb krkocijan@ffzg.hr, sara.librenjak@gmail.com Europhras 2015 Malaga, Spain 2015-07-01

Language of our work - Croatian • South-Slavic language • High similarity to Bosnian, Serbian and Montenegrin • Latin alphabet • Properties: • Highly flective (7 cases) • Syntactically flexible (almost any word order possible) • Pronoun dropping • A challenge for computational processing

Computional approach to idioms • Comparative structures as a subtype of idiomatic structures • Two manners of computational language processing • Statistical approach • Rule-based approach • Idioms • Higly specific part of language (i.e. replacing one word changes the whole meaning) • Statistical approach would yield unprecise results • Rule-based approach preferential, especially when dealing with flective languages

Importance of idioms in computatonal processing of texts • Present in language, yet often ignored • Difficult to proccess – described only linguistically • Causing incomplete computational understanding of the language and unprecise translation • Lack of real data about their frequency • Why are they diffucult to process? • Because of their multi-word nature • Because of their elusive semantic properties (meaning is not the sum of the words) • Because of their cultural and historical nuances which render them very difficult to translate without special preparation

Croatian phraseology and comparisons • Well described linguistically (Croatian Dictionary of Idioms with ~2500 entries) • Lack of systematic approach essential for text processing • Sorted into categories for the purpores of this work • Comparative structures as one of the main categories of idioms • Radi kao pčela (Working hard as a bee) • Puši kao Turčin (Smokes like a pipe, lit. Like a Turk) • Brz poput strijele (Fast as an arrow) • Approximately 540 set comparative phrases in Croatian (Fink-Arnovski)

Comparisons in literature and beyond • Comparative structures (usporedbe ili poredbe) mainly a feature of literary texts and newspaper • Filaković (2008) assumes their presence in the works of fiction by analyzing the works of Croatian writer I.B.Mažuranić • Kovačević (2012) reports linguistic creativity in use of comparative structures in newspaper articles • Mance and Trtanj (2010) note the usage of modern slang variants of the comparisons • No statistical data about their real usage in various types of text

Goals of this work • To build a tool for automated processing of the comparative idioms in Croatian texts • To be able to recognize them in any type of the text as the multi word unit • Extract, describe and ennumerate the structures • Collect the statistical data about their frequency in different styles of texts • Serve as an example for similar work in other languages • Be used as a tool in automated or semi-automated machine translation of Croatian to any lanugage (provided the additional work)

NooJ – a tool for rule based automated text processing • NooJ – free to use linguisticdevelopmentenvironmentfor various kinds of rule-based automated text and corpora processing • http://nooj4nlp.net/ • Morphological, syntactic and semantic processing with options for translation and transformation of sentences • Ready made resources for dozen languages: • Acadian, Arabic, Armenian, Belarusian, Bulgarian, Catalan, Croatian, English, French, German, Greek, Hebrew, Hungarian, Italian, Japanese, Polish, Portuguese, Russian, Serbian, Slovene, Spanish, Turkish, Vietnamese • Great tool for highly flective languages

Methodology • Listing and categorizing the idioms • Definition and recognition of rules • Construction of training and testing corpora • Construction of grammars for processing texts • Using NooJ as a platform • Testing phase • Calculation of results

Listing and categorizingthe idioms • Based on Croatian Dictionary of Idioms and idioms manually found in Croatian corpus • For the purposes of computational approach, we defined five major categories • a) Noun phrase with an attribute or apposition • b) Verbal phrase with a direct object • c) Verbal phrase with the optional direct object which can disrupt the syntactic structure • d) Comparative structure (A/V as N) • e) Fixed phrase which doesn't change in any syntactic environment

Definition and recognition of rules • 312 different comparative construcion in our dictionary • Recognized in any form, tense, case and word order • Divided into 5 subcategories due to sytactic properties • Adjective AS Noun = 89 • Noun AS Preposition = 9 • AS a Noun/Adjective =49 • AS a Noun (7) • AS a PP fixed phrase (37) • AS a N + PP (5) • Verb AS Noun = 157 • AS IF Verb = 8

Constructionoftrainingand testingcorpora • First phase: training • A smaller corpus of sentences exclusively containing the structures in question (comparative structures with phrases „kao” or „poput”) • Second phase: testing • After the completion of the grammars (NooJ files for processing texts), results are tested on the bigger corpus • Corpus 1: random texts from the Web corpus of differents styles of text (2,2 million words corpus) • Corpus 2: literal text of mostly Croatian authors (658 Kw corpus)

Constructionofgrammarsfor processing texts • Grammar – a file constructed in NooJ environment, made for syntactic processing of the texts • Input, output, variebles, nested grammars • Concordance with marked texts as an output

Adjective AS Noun Recognizes: Lijep kao slika (pretty as a picture) Pijan kao smuk (drunk as a sponge) Brz kao zec (fast as a bullet)

Noun AS prepositon Recognizes: Mrak kao u rogu (pitch dark) AS a Noun Recognizes: Kao drvena Marija (being stiff, unrelaxed) Poput guske u magli (without thinking)

VerbASNoun Recognizes: Ići kao po loju (go smoothly, slide like over the fat) Šutjeti kao grob (be silent as a grave) AS IF Verb Recognizes: Kaoda je u zemlju propao (as if the Earth swallowed him) Kao da je pao s Marsa (clueless, as if he came from Mars)

Example of results Comparative structure

Evaluation

Conclusionsaboutcomparisonin Croatian • Number of comparative structures in different types of texts varies greatly • General texts (web corpus) – 1 per every 10000 words • Literal texts (books from Croatian authors) – 1 per every 1000 words • Confirmed hypothesis that such structures are pertaining mostly to literal style • 10 times more frequent in books and works of fiction • Rare in other styles of writing due to the stylistic marking they bring to the text

Thank you for your attention. Questions?

Comparative Structures in Croatian: MWU Approach Kristina Kocijan, Sara Librenjak