110 likes | 275 Vues
CHARM. Lecture 1 Outline of the Problem. The Problem 1. The Maltese Alphabet A a B b Ċ ċ D d E e F f Ġ ġ G g Għ għ H h a be ċe de e ef ġe ge ajn akka Ħ ħ I i Ie ie J j K k L l M m N n O o P p ħe i ie je ke elle emme enne o pe Q q R r S s T t U u V v W w X x Ż ż Z z
E N D
CHARM Lecture 1 Outline of the Problem
The Problem 1 The Maltese Alphabet A a B b Ċ ċ D d E e F f Ġ ġ G g Għ għ H h a be ċe de e ef ġe ge ajn akka Ħ ħ I i Ie ie J j K k L l M m N n O o P p ħe i ie je ke elle emme enne o pe Q q R r S s T t U u V v W w X x Ż ż Z z qe erre esse te u ve we exxe że zej We will refer to ordinary characters that could yield Maltese characters as charms
The Problem 2 from KullĦadd FIL-KRIZI li ghandna fit-turizmu fil-gzejjer taghna l-aghar li qed jintlaqtu huma l-lukandi tal tliet stilel. L-ahhar studju li sar mid-Deloitte ghall-Assocjazzjoni Maltija tal-Lukandi u Ristoranti jghidilna kif in-nuqqas tal turisti u z-zieda fl-ispejjez ghal dawn il-lukandi fissru li ghamlu telf tal 19.8% fir-rata tal qliegh taghhom u fosthom kien hemm min salva biss anki fl-aqwa tas-sajf permezz tal l-istudenti. L-istess studju juri li 70% tas-sidien tal dawn il-lukandi jibzghu li se jkomplu jbatu min-nuqqas tal turisti u se jkollhom hafna kmamar vojta fix-xhur li gejjin.
The Problem 3 Is there some way in which we can recover the special Maltese characters automatically? If so • What is the underlying algorithmic model? • What knowledge must the programme bring to bear? • What resources are needed to build the knowledge base?
Noisy Channel Modelfor Sentence Translation (Brown et. al. 1990) target sentence sourcesentence sentence diagram from Jurafsky & Martin
Algorithmic Model • Noisy channel model is domain independent. • Brown applied it to the domain of translation from source language to target language. • We can use it for the domain of words.
Noisy Channel at Word Level KullĦadd source NOISY CHANNEL KullHadd target
Main Algorithm: Four Steps • See target word t • Generate the set S of all possible source words for that word. • Pick the most probable source word s in S • Output s
Step 1: See Target Word • Preprocessing • noise • case • punctuation • hyphen • Tokenisation • words • numbers • other
Step 2 • Generate S If t contains charms generate S = {s | forall 0 < i <= len(t) s[i] = t[i] \/ s[i] = m(t[i]) }
Step 3 • Pick the most probable source word s in return argmax(P(s)) for s in S • This is covered in lecture 2