1 / 74

Computational Linguistics

Computational Linguistics. What is it and what (if any) are its unifying themes?. Computational linguistics. I often agree with XKCD…. linguistics?. computational linguistics. physics. chemistry. biology. neuropsychology. psychology. literary criticism. more rigorous. less rigorous.

glain
Télécharger la présentation

Computational Linguistics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Computational Linguistics What is it and what (if any) are its unifying themes?

  2. Computational linguistics

  3. I often agree with XKCD…

  4. linguistics? computational linguistics physics chemistry biology neuropsychology psychology literary criticism more rigorous less rigorous more flakey

  5. What defines the rigor of a field? • Whether results are reproducible • Whether theories are testable/falsifiable • Whether there are a common set of methods for similar problems • Whether approaches to problems can yield interesting new questions/answers

  6. Linguistics

  7. engineering linguistics sociology literary criticism more rigorous less rigorous

  8. The true situation with linguistics “theoretical” linguistics (e.g. lexical-functional grammar) other areas of sociolinguistics (e.g. Deborah Tannen) some areas of sociolinguistics (e.g. Bill Labov) “theoretical” linguistics (e.g. minimalist syntax) experimental phonetics historical linguistics psycholinguistics more rigorous less rigorous

  9. Okay enough alreadyWhat is computational linguistics • Text normalization/segmentation • Morphological analysis • Automatic word pronunciation prediction • Transliteration • Word-class prediction: e.g. part of speech tagging • Parsing • Semantic role labeling • Machine translation • Dialog systems • Topic detection • Summarization • Text retrieval • Bioinformatics • Language modeling for automatic speech recognition • Computer-aided language learning (CALL)

  10. Computational linguistics • Often thought of as natural language engineering • But there is also a serious scientific component to it.

  11. Why CL may seem ad hoc • Wide variety of areas (as in linguistics) • If it’s natural language engineering, the goal is often just to build something that works • Techniques tend to change in somewhat faddish ways… • For example: machine learning approaches fall in and out of favor

  12. Machine learning in CL • In general it’s a plus since it has meant that evaluation has become more rigorous • But it’s important that the field not turn into applied machine learning • For this to be avoided, people need to continue to focus on what linguistic features are important • Fortunately, this seems to be happening

  13. Some interesting themes… • Finite-state methods: • Many application areas • Raises interesting questions about how much of language is “regular” (in the sense of “finite state”) • Grammar induction: • Linguists have done a poor job at their stated goal of explaining how humans learn grammar • Computational models of language change: • Historical evidence for language change is only partial. There are many changes in language for which we have no direct evidence.

  14. Finite state methods • Used from the 1950’s onwards • Went out of fashion a bit during the 1980’s • Then a revival in the 1990’s with the advent of weighted finite-state methods

  15. Some applications • Analysis of word structure – morphology • Analysis of sentence structure • Part of speech tagging • Parsing • Speech recognition • Text normalization • Computational biology • …

  16. Regular languages • A regular language is a language with a finite alphabet that can be constructed out of one or more of the following operations: • Set union • Concatenation • Transitive closure (Kleene star)

  17. Finite state automata: formal definition Every regular language can be recognized by a finite-state automaton. Every finite-state automaton recognizes a regular language. (Kleene’s theorem)

  18. Representation of FSA’s: State Diagram

  19. Regular relations: formal definition

  20. Finite-state transducers

  21. An FST

  22. Composition • In addition to union, concatenation and Kleene closure, regular relations are closed under composition • Composition is to be understood here the same way as composition in algebra: • R1oR1 means take the output of R1 and feed it to the input of R2

  23. Composition: an illustration

  24. R1 as a transducer

  25. R2 as a transducer

  26. R1○R2

  27. Some things you can do with FSTs • Text analysis/normalization • Word segmentation • Abbreviation expansion • Digit-to-number-name mappings i.e. mapping from writing to language • Morphological analysis • Syntactic analysis • E.g. part-of-speech tagging • (With weights) pronunciation modeling and language modeling for speech recognition

  28. That’s fine for engineering but… • Does it really account for the facts? • Is morphology really regular? • Is the mapping between writing and speech really regular?

  29. What is morphology? • scripsērunt is third person, plural, perfect, active of scrībō (`I write’) • Morphology relates word forms • the “lemma” of scripsērunt is scrībō • Morphology analyzes the structure of word forms • scripsērunt has the structure scrīb+s+ērunt

  30. Morphology is a relation • Imagine you have a Latin morphological analyzer comprising: • D: a relation that maps between surface form and decomposed form • L: a relation that maps between decomposed form and lemma • Then: • scripsērunt ○ D = scrīb+s+ērunt • scripsērunt ○ D ○ L = scrībō

  31. English regular plurals • cat + s = cats /s/ • dog + s = dogs /z/ • spouse + s = spouses /Əz/ • This can be implemented by a rule that composes with the base word, inserting the relevant form of the affix at the end

  32. Templatic affixes in Yowlumne Transducer for each affix transforms base into required templatic form and appends the relevant string.

  33. Subtractive morphology Transducer deletes final VC of the base…

  34. Bontoc infixation • Insert a marker “>” after the first consonant (if any) • Change “>” into the infix –um-

  35. Side note … infixation in English Kalama zoo f*****g

  36. Reduplication: Gothic Problem: mapping w to ww is not a regular relation

  37. Factoring Reduplication • Prosodic constraints • Copy verification transducer C

  38. Non-Exact Copies • Dakota (Inkelas & Zoll, 1999):

  39. Basic and modified stems in Sye (Inkelas & Zoll, 1999): “they will fall all over” Non-Exact Copies

  40. Morphological Doubling Theory(Inkelas & Zoll, 1999) • Most linguistic accounts of reduplication assume that the copying is done as part of morphology • In MDT: • Reduplication involves doubling at the morphosyntactic level – i.e. one is actually simply repeating words or morphemes • Phonological doubling is thus expected, but not required

  41. Gothic Reduplication under Morphological Doubling Theory

  42. Summary • If Inkelas & Zoll are right then all morphology can be computed using regular relations • This in turn suggests that computational morphology has picked the right tool for the job

  43. Another Example: Linguistic analysis of text • Maps between the stuff you see on the page – e.g. text written in the standard orthography of a language – into linguistic units (words, morphemes, phonemes…) • For example: • I ate a 25kg bass • [aI εIt Ə twεnti faIv kIlƏgræm bæs] • This can be done using transducers • But is the mapping between writing and language really regular (finite-state)?

  44. Linguistic analysis of text • Abbreviation expansion • Disambiguation • Number expansion • Morphological analysis of words • Word pronunciation • …

  45. A transducer for number names Consider a machine that maps between digit strings and their reading as number names in English. 30,294,005,179,018,903.56 → thirty quadrillion, two hundred and ninety four trillion, five billion, one hundred seventy nine million, eighteen thousand, nine hundred three, point five six

  46. Mapping between speech and writing It seems obvious on the face of it that the mapping between speech and its written form is regular. After all, the words are ordered in the same way as speech. Even the tend to be ordered in the same letters way as the sounds they represent.

More Related