1 / 38

LANGUAGE TECHNOLOGIES: Linguistics in the Computer World

LANGUAGE TECHNOLOGIES: Linguistics in the Computer World. Darinka Verdonik. Digital/c omputer age. Living in a digital, computer age: we are surrounded by digital machines: PCs, audio/video devices (MP3, DVD, CD), e-banking, domestic appliances (microwave oven, washing machine)...

eddy
Télécharger la présentation

LANGUAGE TECHNOLOGIES: Linguistics in the Computer World

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. LANGUAGE TECHNOLOGIES:Linguistics in the Computer World Darinka Verdonik

  2. Digital/computer age • Living in a digital, computer age: • we are surrounded by digital machines: PCs, audio/video devices (MP3, DVD, CD), e-banking, domestic appliances (microwave oven, washing machine)... • How can a linguist use PC? • searching for information & knowledge on Internet • buying books, making reservations (eg. library)... • contacts: e-mail, mailing lists, messenger, forum, chat-rooms... • tools: writing and designing texts, preparing presentations, posters; e-dictionary, corpus, spell-checker, grammar checker, automatic summarization...

  3. Products • Speech synthesiser:Plattos • pronounces written text with a human-like voice

  4. Products • Speech recogniser:Broadcast News subtitle • writes down spoken text in a TV Broadcast News show

  5. Products • Dialogue system:Auto Attendant • human-machine communicationin automatic telephone responder System:Pozdravljeni na portalu FERI. Izberite imenik ali oddelek. Caller:Imenik. System:Izbrali ste imenik. Izgovorite ime in priimek osebe. Caller: Darinka Verdonik.. System:Izbrali ste Darinko Verdonik. Prosim, počakajte trenutek.

  6. Products • Dialogue system: Klepec

  7. Products • Machine translation: • translates text-to-text (just translation): Presis • translates speech-to-speech (recognition – translation – synthesis): Babilon, VoiceTran

  8. Products

  9. Products

  10. Products • National corpus:Fida/FidaPlus (www.fidaplus.net), Nova beseda (bos.zrc-sazu.si), BNC (http://www.natcorp.ox.ac.uk/)...

  11. Products • National corpus: FidaPlus (www.fidaplus.net)

  12. Products • Parallel corpus: Evrokorpus (http://www.gov.si/evrokor/)

  13. Products • Parallel corpus: Evrokorpus (http://www.gov.si/evrokor/)

  14. Algorithms • The heart of the technologies: programming, modeling, coding...

  15. Algorithms Examples: • Speech synthesis: • grapheme-to-phoneme conversion • modeling prosodic features • searching algorithm(s) • Speech recognition: • acoustic modeling – calculating probabilities of phonemes (triphones) • language modeling – calculating probabilities of word order

  16. Language resources – spoken • Databases of spoken language: • define the type and number of texts to include: read phrases and/or sentences, speech in media (TV, radio), conversational speech... • recording • defining contextual tags: speakers (gender, dialect...), acoustic environment (channel, background, noises...), non-speech sounds (breathing, laughing...)... • defining linguistic tags: phonetic/orthographic transcription, lemma, POS and other morpho-syntactic tags... • segmentation, transcription (phonetic or orthographic) • annotation • coding (computer-readable form, eg. XML) • optional: developing user interface for searching through the database

  17. Language resources – spoken

  18. Language resources – spoken <?xml version="1.0" encoding="ISO-8859-2"?> <!DOCTYPE Trans SYSTEM "trans-13.dtd"> <Trans scribe="Darinka" audio_filename="HOha50" version="25" version_date="051201"> <Topics> <Topic id="to1" desc="jedro"/> <Topic id="to2" desc="uvod"/> <Topic id="to3" desc="zakljucek"/> </Topics> <Speakers> <Speaker id="spk1" name="Habakuk_receptor1" check="yes" type="female" dialect="native" accent="p-mariborsko" scope="local"/> <Speaker id="spk2" name="klicatelj39" check="yes" type="female" dialect="native" accent="p-celjsko" scope="local"/> </Speakers> <Episode> <Section type="report" topic="to2" startTime="0" endTime="6.26"> <Turn startTime="0" endTime="2.246" speaker="spk1" mode="spontaneous" fidelity="medium" channel="telephone"> <Sync time="0"/> dobro hotel Habakuk [ime] pri telefonu </Turn> <Turn speaker="spk2" mode="spontaneous" fidelity="medium" channel="telephone" startTime="2.246" endTime="5.424"> <Sync time="2.246"/> ja <Event desc="marker" type="lexical" extent="previous"/> dober dan ľelim [priimek] [ime] je moje ime </Turn>

  19. Language resources – spoken

  20. Language resources – spoken

  21. Spoken corpora of the Slovenian language • BNSI Broadcast News (36 hours) • Slovenian Broadcast News Database (30 hours = 255,000 words) • Korpus govorjene slovenščine (90 min. = 15,000 words) – pilot corpus • Turdis (100 min. = 15,000 words) • ...

  22. Language resources – written • Corpora – huge e-collections of different texts (books, journals...): • define the type and number of texts to include: national/reference corpora, domain specific corpora, parallel corpora... • defining contextual tags: source, year of publication, language... • defining linguistic tags: lemma, POS, morpho-syntactic tags, phonetic transcription... • annotating • coding • optional – user interface for searching through the corpus

  23. Language resources – written <text lang="en-sl" id="orwl.T"> <body> <tu lang="en-sl" id="orwl.1"> <seg lang="en"> <s id="Oen.1.1.1.1"><w>It</w> <w>was</w> <w>a</w> <w>bright</w> <w>cold</w> <w>day</w> <w>in</w> <w>April</w><c>,</c> <w>and</w> <w>the</w> <w>clocks</w> <w>were</w> <w>striking</w> <w>thirteen</w><c>.</c></s> </seg> <seg lang="sl"> <s id="Osl.1.2.2.1"><w lemma="biti" function="Vcps-sma">Bil</w> <w lemma="biti" function="Vcip3s--n">je</w> <w lemma="jasen" function="Afpmsnn">jasen</w><c>,</c> <w lemma="mrzel" function="Afpmsnn">mrzel</w> <w lemma="aprilski" function="Aopmsn">aprilski</w> <w lemma="dan" function="Ncmsn">dan</w> <w lemma="in" function="Ccs">in</w> <w lemma="ura" function="Ncfpn">ure</w> <w lemma="biti" function="Vcip3p--n">so</w> <w lemma="biti" function="Vmps-pfa">bile</w> <w lemma="trinajst" function="Mcnpnl">trinajst</w><c>.</c></s> </seg>

  24. Language resources – written

  25. Language resources – written • Lexica – e-collections of words, usually with linguistic information added: • selecting word entries and preparing a word list • defining types of information included: lemma, POS, morpho-syntactic tags, phonetic transcription, semantic nets/word nets... • annotating • coding • optional – user interface for searching through the lexicon

  26. Language resources – written

  27. Language resources – written <ENTRYGROUP orthography="Abitanti"> <ENTRY> <NOM class="CIT" /> <LEMMA>Abitanti</LEMMA> <PHONETIC>a - b i - " t a: n - t i</PHONETIC> </ENTRY> <ENTRY> <NOM class="STR" /> <LEMMA>Abitanti</LEMMA> <PHONETIC>a - b i - " t a: n - t i</PHONETIC> </ENTRY> </ENTRYGROUP>

  28. Language resources – written

  29. Corpus linguistics Uses corpora for it’s researches. Advantages: • Analysis of real texts that were actually written/spoken. • Ability to handle a huge amounts of data – automatic searching, counting, sorting... • Statistical reliability – eg. results of analysis can be calculated in %.

  30. Corpus linguistics Includes: • Building corpora: • what types and what amount of texts to include • what linguistic information to include • Developing tools for automatic search, sorting and counting. • Corpus analysis.

  31. Corpus linguistics Example of corpus analysis (Gorjanc, V., 2005. Uvod v korpusno jezikoslovje. Domžale, Izolit.)

  32. Corpus linguistics • Usability of corpus in everyday work – similar as dictionary, with advantage of being up-to-date: • when writing or correcting texts, we can search for a word/phrase and see: • how often it is usually used • in what type of texts it is used • how it is usually used • what meaning does it has in a context • what are the most common collocations • what are the most common translations • etc.

  33. Conclusions • Linguistics in a computer world: • co-operates in a process of technological development, results of which (if successful) will effect our everyday future (machine-mediated communication, human-machine communication, helping handicapped people) • uses the products of technological development for achieving higher reliability of the researches, to develop new methods of research and new linguistic tools

  34. Thank you for your attention. Questions? Slides available on: http://www.elektronika.uni-mb.si/Elektronika/Slo/staff/Staff_slo.php

More Related