210 likes | 401 Vues
Graduation project by: Parush Anat & Grisha Klots http://www.cs.bgu.ac.il/~klotsg Supervised by: Yoav Goldberg & Dr. Michael Elhadad. Hebrew Sentence Compression. Nov. 2010, CS BGU. A short example…. A “long” sentence may look like -
E N D
Graduation project by:ParushAnat & GrishaKlots http://www.cs.bgu.ac.il/~klotsg Supervised by: Yoav Goldberg & Dr. Michael Elhadad Hebrew Sentence Compression Nov. 2010, CS BGU.
A short example… A “long” sentence may look like - אתמול בשעה שש איה אפתה עוגת תפוחים טעימה ואני אכלתי חתיכה קטנה ממנה And can be compressed to - אתמול איה אפתה עוגת תפוחים ואני אכלתי ממנה
Sentence Compression – Why & How? Motivations, implementation and some theoretical background • Automatic text summarization (Academic, technical, text books and so on…) • A sentence by sentence approach – each sentence is compressed individually • Our method is based on word deletion to generate a shorter “version” of the sentence
Work process • Our work consisted of two main phases: • Corpus Generation – Developed in Python • Algorithm Implementation – Developed in Java • We implemented the algorithm developed by Ryan McDonald and described in his paper: “Discriminative Sentence Compression with Soft Syntactic Evidence” • First time implemented in Hebrew!
Phase #1 Sentence Generation – General • Sentences were extracted from “Haaretz” web-site • A scoring method was employed to find pairs of “full” and “short” sentences. • All pairs were grouped into a single database (large XML-Like file) • XML file scanned again and filtered for irregularities and words that are not in the Hebrew lexicon • Final output is formatted to a predefined structure as the input for the 2nd Phase – Algorithm Implementation
Main Headline Sub-headline Body
Phase #1 Sentence Generation – Extraction (Extractor) • A scoring method ensures that only the best matches are returned (matching percentage varies) • At this stage, we allowed for Clauses to change their relative place in the different versions of the sentence (number of changes also varies)
Phase #1 Sentence Generation – Filtration • Due to strict rules imposed on the Algorithm Implementation, much filtration has to be performed • Each word that appears in the short version should appear in the long one as well. • No clauses change their position in each pair of sentences. • More than 90% of pairs fail this test! • (Initial size of DB was ~4500 sentence. Filtration left us with only ~400 sentences to work with)
Phase #1 Sentence Generation – Algorithm Input Formatting Example: השוטר כופר בכל ההאשמות NN VB DTT NN עו"ד מיכאל בוסקילה המייצג את השוטר אמר כי החשוד כופר בכל ההאשמות נגדו TTL NNP NNP BN AT NN VB CC NN VB DTT NN IN 5 9 10 11 ------------------END------------------
Phase #2 Algorithm Implementation
Phase #2 Algorithm Implementation The “heart” of the algorithm – Dynamic Programming: Compress: Long sent x requestedlength Short Sentence Length of short sentence C [ k , j – 1 ] C[i,j]=maxk<i{C[k,j-1]+S(x,k,i)} The i-th word from the long sentence Maximum Score for a short sentence having the desired length
Phase #2 Algorithm Implementation - basic terminology • Feature – a string that characterizes a pair of words according to their syntactic analysis, their position in the sentence and the words that are between them. • For example: • pi:pj = getPOS(i):getPOS(j) • “pi:pj = NN : VB” • for i<k<j: • IsNeg = isNeg(getWord(k)) • “IsNeg = True”, “IsNeg = False” • pi:pk:pj = getPos(i):getPos(k):getPos(j)
Phase #2 Algorithm Implementation - basic terminology (cont.) • Weights Vector – Contains ordered pairs of <Feature, Weight> for all instances induced by different pairs of words • For example: • <<“pi:pj = NN:VB”,100>,<“pi:pj = NN:DTT”,5>> • All the featuretemplatesare hard-coded and predefined. • The Weights Vector is updated constantly during the learning phase.
Phase #2 Algorithm Implementation – the Score function C[i,j]=maxk<i{C[k,j-1]+S(x,k,i)} S(x,k,i) returns the sum of weights of the features for k-th and i-th word in sentence x
Phase #2 Learning – Dynamic Programming (Part 1) • For each cluster from the input file, we iterate over the list of indices and for each two adjacent indices, we extract their feature list. For example: 5 9 10 11 • For each feature in each list, we increase its weight by 1 in the Weights Vector
Phase #2 Learning – Dynamic Programming (Part 2) • Now, we compress the long sentence to a new sentence having the length of the short one • From the compressed sentence, we generate a new list of indices (as shown before) and extract the lists of features • For each feature in each list, we decrease its score by 1
Results & Discussion A short example • משפט ארוך:כמו בפרשת ההכרה בישראל כמדינת העם היהודי העיתוי של החלטת הממשלה בדבר הענקת מעמד מיוחד לבירה מצטייר כפרובוקציה מכוונת • משפט מקוצר (מקורי): העיתוי של החלטת הממשלה בדבר הענקת מעמד מיוחד לבירה מצטייר כפרובוקציה מכוונת • משפט מקוצר (אלג'): העיתוי של החלטת הממשלה בדבר הענקת מעמד מיוחד לבירה מצטייר כפרובוקציה מכוונת
Results & Discussion And another one… • משפט ארוך:למשרד המשפטים 60 יום לערער על ההחלטה שמציבה דילמה בפני ממשל אובמה • משפט מקוצר (מקורי): למשרד המשפטים 60 יום לערער על ההחלטה • משפט מקוצר (אלג'): 60 יום לערער על ההחלטה ממשל אובמה
Results & Discussion A (very) basic results analysis • We analyzed the compression of 50 “unseen” sentences: • 8% matched exactly the shortened version • 25% differ by one or two words from the shortened version • 37% are valid Hebrew sentences • 43% retained the general notion of the original sentence
Results & Discussion Future improvements • Increase DB size!!! • Increase variety – use other sources of information • Add more feature templates