1 / 78

Language and Speech Technology: Introduction

Language and Speech Technology: Introduction. Jan Odijk January 2011 LOT Winter School 2011. Overview. What is language and speech technology (LST)? (3-7) Major Subfields of LST (8-25) Characterization of the last 30 years (26-27) 80s (28-36), 90s (37-49), 00s (50-56)

mina
Télécharger la présentation

Language and Speech Technology: Introduction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Language and Speech Technology: Introduction Jan Odijk January 2011 LOT Winter School 2011

  2. Overview • What is language and speech technology (LST)? (3-7) • Major Subfields of LST (8-25) • Characterization of the last 30 years (26-27) • 80s (28-36), 90s (37-49), 00s (50-56) • Current Status (57-69) • CLARIN infrastructure (70-75) • This week’s programme (76)

  3. Language Technology • Language Technology is the study of computational systems that process natural language • Alternative names: • Human Language Technology (HLT) • Natural Language Processing (NLP)

  4. Speech Technology • Speech Technology is the study of computational systems that process speech • Is a part of Language Technology • Often • Term “Language technology” reserved for the study of computational systems that process written language

  5. Computational Linguistics • Computational Linguistics (CL) is the study of language from a computational perspective • Often used interchangeably with language technology • Often grouped under Artificial Intelligence (AI) , although CL predates AI • AI: the study and design of intelligent systems

  6. Computational Systems • Computational systems to process natural language do not exist naturally (except in the human brain) • They must be designed, implemented, and evaluated • Therefore it is a kind of engineering

  7. Computational Systems • LST is NOT • the study of processing of natural language by humans in • cognition, • (cognitive) psychology, • (psycho)linguistics • phonetics

  8. Language Technology Subfields • Orthographic processing • Text = sequence of characters • Tokenization • Text => sequence of tokens • Token= occurrence of a word form • Relatively simple for languages that uses interpunction (space, dot, comma, etc.) for separating tokens • More difficult for languages such as Chinese, Thai, etc.

  9. Language Technology Subfields • Orthographic processing • Orthographic normalization • Token => (token, normalized token) • Normalized token = canonical orthographic representation for a set of orthographic variants • Examples: • Contemporary spelling variants: aktie => actie • Older spelling variants: vleesch => vlees • Typos: actei => actie • OCR errors: raarn => raam

  10. Language Technology Subfields • Morphological processing • Lemmatization: token => (token, lemma) • Lemma = canonical orthographic representation for an inflectional paradigm • Often ambiguities • Examples • lemma(walked) = walk; Lemma(men) = man • Lemma (graven) = {graf, graaf, graven} (Dutch)

  11. Language Technology Subfields • Morphological processing • Inflection analysis/generation • Word form  (lemma, inflectional features) • Examples: • graven  (graf, PoS=Noun, number=plural) • graven  (graaf, PoS=Noun, number=plural) • graven  (graven, PoS=Verb, form=infinitive) • graven  (graven, PoS=Verb, form= indicative, tense=present, number = plural)

  12. Language Technology Subfields • Morphological processing • Compound processing • word form ((word form,affix?)+, word form) • lemma  ((word form,affix?)+, lemma) • Example: • Vleeskoeienhouders  ([vlees,koeien], houders) ‘meat cow farmers’ • gebiedsbepaling  ([(gebied, s)], bepaling)

  13. Language Technology Subfields • Morphological processing • Derivational morphology processing • word form  (prefix*, lemma, suffix*) • Example: • Characterization ([], characterize, [ation])

  14. Language Technology Subfields • (PoS-)tagging • Assignment of a grammatical tag to a token in context (tag=label for grammatical properties) • Token => (token, tag) in context • Usually assignment of PoS-tags • Often more detailed grammatical (inflectional) tags

  15. Language Technology Subfields • (PoS-)tagging • Context: usually: • Some words and/or tags preceding • Some words following • Examples: • (graven, Zij __ een graf) => Vindprespl • (graven, De __ zijn boos) => Npl

  16. Language Technology Subfields • Chunking • identifying major phrases in a sentence • Example • The man bought a present for his wife => • [NP The man] bought [NP a present] [PP for his wife]

  17. Language Technology Subfields • Parsing • Assign a syntactic structure to a sentence • Example: The man bought a present for his wife => [S [subj/NP The man] [pred/VP bought [obj/NP a present] [pobj/PP for [obj/NP his wife]] ] ]

  18. Language Technology Subfields • Machine Translation • Automatic translation of an input text • Example • The man bought a present for his wife => • L’homme a acheté un cadeau pour sa femme

  19. Language Technology Subfields • Content extraction and processing • Named entity recognition • Question-answering • Information retrieval • Information extraction • Sentiment/ opinion mining • Reasoning/Inference on semantic representation • …

  20. Speech Technology Subfields • Speech Synthesis • Artificial production of human speech • Text => speech • Often called Text-To-Speech (TTS) • TTS system usually contains two components • Grapheme to Phoneme (G2P) component • Text => symbolic speech representation (phonetic representation) • Speech Synthesis component • Symbolic speech representation => speech

  21. Speech Technology Subfields • Speech Synthesis (cont.) • Term Speech Synthesis often reserved for this second component • Meaning => speech • Usually called Speech Generation, or Concept-To-Speech, or Data-to-Speech

  22. Speech Technology Subfields • Speech Recognition • Recognition of human speech • Audio containing speech => text • Often called automatic speech recognition (ASR) • Speech Understanding • Understanding of human speech • Audio containing speech => meaning or action

  23. Speech Technology Subfields • Speaker Recognition • Recognition of a speaker given a speech signal • Speech => person identity • Speaker Verification • Verification of the identity of a person • Speech + claimed identity => Boolean

  24. Speech Technology Subfields • Speech Compression • Reduction of the size of speech representations (speech encoding), or • Time-compression of speech representations (so that they sound faster to the listener)

  25. Related fields • Speech often used in dialogues • Study of spoken dialogues (human-human, human-machine) • Speech often combined with other modalities • Study of Multimodal Interaction • Speech part of an man-machine interface • Study of Human - Machine Interaction

  26. Introduction • Three decades: • “80s”= 1980-1994 • “90s”= 1990-2005 • “00s” = 2000-2011

  27. Overview • 80s: Language Technology • 80s: Speech Technology • 90s Language and Speech Technology • 90s Commercial Activity • 90s Importance of Data • 00s Language and Speech Technology

  28. 80s: Language Technology • Focus on MT (in Europe) • Eurotra (Europe) • Rosetta (Philips, Netherlands) • Distributed Translation (BSO, Netherlands)

  29. 80s: Language Technology • Linguistic “Research Approach” • Focus on Research • not/less on Technology Development • Knowledge-based approach • hand-crafted lexicons and rules • based on a theory / grammatical formalism • Focus on linguistically interesting complex phenomena • less on phenomena that occur often • not strongly data-driven

  30. 80s: Language Technology • Focus on an idealized language • not on actual language use • no focus on robustness • Computational approach seen (in research) as a way to gain insight into language, grammar and grammar formalisms • no focus on developing a working system • no pragmatic solutions

  31. 80s: Language Technology • Little formal (quantitative) evaluation • only with test suites • constructed sentences illustrating linguistic phenomena • E.g. the HP Test Suite (Flickinger et al. 1987) • computational linguistics rather than language technology

  32. 80s: Language Technology Major Problems (from a technology point of view): • Ambiguity • Real • Temporary • Computational Complexity • computation-intensive grammar formalisms • Complexity of language • handcrafting lexicons and rules • requires linguistic and computational expertise • requires a lot of effort and time

  33. 80s: Language Technology • Major problems (cont.): • Idealized Language v. actual Language Use • Require large and rich lexicons, suited to the application domain: difficult/ large effort to make them, and to tune (adapt) to specific domains

  34. 80s: Speech Technology • Automatic Speech Recognition (ASR) • Statistical “Engineering Approach” • approach based on Noisy Channel Model • derive acoustic models from a lot of annotated speech examples • derive statistical language models from large text corpora (n-gram probabilities)

  35. 80s: Speech Technology • Focus on making (small) working systems • Statistical approach: system uses probabilities derived from data • Focus initially on limited, “simple” tasks (e.g. digit recognition), and increasingly on more complex tasks

  36. 80s: Speech Technology • Focus on real language use under realistic conditions • Progress made by making concrete systems and evaluating them rigorously

  37. 90s: Language Technology • Statistical MT • derive language models from monolingual corpora (probabilities of word ( sequence)s • align “sentences” with their translations • derive translation model from parallel corpora: • estimate translation probabilities for words and word sequences from the aligned “sentences” • use these probabilities to compute translations for new “sentences”

  38. 90: Language Technology • Ambiguity: resolved by probabilities based on statistics • Computational Complexity • computationally feasible formalisms • proven in speech recognition • Complexity of language • language and translation model automatically derived from data • Strong focus on actual language use • Highly data driven • Lexicons can be simpler and are derived automatically from the data; adaptation to specific domains easy once the data are available

  39. 90s: Language Technology • Rise of Internet • increasing need for information retrieval • approximated by search for word and word sequence strings • Information Retrieval • strongly statistically based • Limited linguistics • formal evaluation (recall, precision, F-score)

  40. 90s: Language Technology • Resulted in • strongly data-driven approach in language technology • increasing use of machine learning techniques • explicit focus on formal, esp. quantative evaluation • re-examination of simpler/computationally less intensive formalisms (finite-state) for syntax

  41. 90s: Speech Technology • Continued working under the established paradigm • increasingly improving performance and extending environments and application areas

  42. 90s: Companies • many companies active in Speech technology • IBM, Microsoft, Siemens, Nokia, Philips, Motorola, Matra Nortel, Nortel,.. • Dragon, Kurzweil, Lernout & Hauspie, SpeechWorks, Nuance, Babel, Loquendo, Rhetorical, Vocalis, Telisma, Elan, ...

  43. 90s: Companies • many companies in Language technology • IBM, Microsoft, INSO, Novell, ... • GMS, Apptek, Globalink, Lernout & Hauspie, Systran, LANT (Xplanation), ...

  44. 90s: Companies • MT systems: • knowledge based systems, • developed under an engineering approach • grammatical formalism simple or pruning in search space • to reduce ambiguity • to reduce computational resource requirements • to reduce hand-crafting of rules

  45. 90s: Companies • resulted in low quality MT systems • still useful in many circumstances • Differentiating factors • rapid adaptation to (multi-word) terms / vocabulary of new domain • good performance on named entity recognition

  46. 90s: Data • Knowledge Based NLP realized cooperation on lexicons was required • ASR Methodology requires a lot of data: • “There is no data like more data” • This led to • Data creation projects • Set-up of data distribution centers • Projects for developing standards for data

  47. 90s: Data • Projects • Lexicon projects • Multilex, • Genelex • Acquilex • Parole • WordNet, EuroWordNet • SpeechDat projects • SpeechDat, SpeechDat-Car, SpeechDat-East, SPEECON, Orientel • National / Local projects • Spoken Dutch Corpus (Netherlands and Flanders)

  48. 90s: Data • Data distribution Centers are set up • LDC (1993) • ELRA (1995) • Standards: • TEI for text corpora • CES, XCES • Eagles, ISLE for grammatical properties

  49. Automating Data Production • Usually existing (imperfect) tools are used to create data (semi-)automatically • G2P for creating phonetic dictionaries • PoS-tagging for PoS-tagged text corpora • Parsers for treebanks • For bootstrapping annotations • Faster and more consistent results • Followed by (partial) manual correction

  50. 00s • Early 00s • Many data and research initiatives, nationally • Netherlands • IMIX 2001-2008 • STEVIN 2004-2011 • TST-Centrale (HLT Agency) 2005-.. • France • EVALDA • Technolangue

More Related