1 / 21

Hebrew Sentence Compression

Graduation project by: Parush Anat & Grisha Klots http://www.cs.bgu.ac.il/~klotsg Supervised by: Yoav Goldberg & Dr. Michael Elhadad. Hebrew Sentence Compression. Nov. 2010, CS BGU. A short example…. A “long” sentence may look like -

olisa
Télécharger la présentation

Hebrew Sentence Compression

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Graduation project by:ParushAnat & GrishaKlots http://www.cs.bgu.ac.il/~klotsg Supervised by: Yoav Goldberg & Dr. Michael Elhadad Hebrew Sentence Compression Nov. 2010, CS BGU.

  2. A short example… A “long” sentence may look like - אתמול בשעה שש איה אפתה עוגת תפוחים טעימה ואני אכלתי חתיכה קטנה ממנה And can be compressed to - אתמול איה אפתה עוגת תפוחים ואני אכלתי ממנה

  3. Sentence Compression – Why & How? Motivations, implementation and some theoretical background • Automatic text summarization (Academic, technical, text books and so on…) • A sentence by sentence approach – each sentence is compressed individually • Our method is based on word deletion to generate a shorter “version” of the sentence

  4. Work process • Our work consisted of two main phases: • Corpus Generation – Developed in Python • Algorithm Implementation – Developed in Java • We implemented the algorithm developed by Ryan McDonald and described in his paper: “Discriminative Sentence Compression with Soft Syntactic Evidence” • First time implemented in Hebrew!

  5. Phase #1 Sentence Generation – General • Sentences were extracted from “Haaretz” web-site • A scoring method was employed to find pairs of “full” and “short” sentences. • All pairs were grouped into a single database (large XML-Like file) • XML file scanned again and filtered for irregularities and words that are not in the Hebrew lexicon • Final output is formatted to a predefined structure as the input for the 2nd Phase – Algorithm Implementation

  6. Main Headline Sub-headline Body

  7. Phase #1 Sentence Generation – Extraction (Extractor) • A scoring method ensures that only the best matches are returned (matching percentage varies) • At this stage, we allowed for Clauses to change their relative place in the different versions of the sentence (number of changes also varies)

  8. Phase #1 Sentence Generation – Filtration • Due to strict rules imposed on the Algorithm Implementation, much filtration has to be performed • Each word that appears in the short version should appear in the long one as well. • No clauses change their position in each pair of sentences. • More than 90% of pairs fail this test! • (Initial size of DB was ~4500 sentence. Filtration left us with only ~400 sentences to work with)

  9. Phase #1 Sentence Generation – Algorithm Input Formatting Example: השוטר כופר בכל ההאשמות NN VB DTT NN עו"ד מיכאל בוסקילה המייצג את השוטר אמר כי החשוד כופר בכל ההאשמות נגדו TTL NNP NNP BN AT NN VB CC NN VB DTT NN IN 5 9 10 11 ------------------END------------------

  10. Phase #2 Algorithm Implementation

  11. Phase #2 Algorithm Implementation The “heart” of the algorithm – Dynamic Programming: Compress: Long sent x requestedlength Short Sentence Length of short sentence C [ k , j – 1 ] C[i,j]=maxk<i{C[k,j-1]+S(x,k,i)} The i-th word from the long sentence Maximum Score for a short sentence having the desired length

  12. Phase #2 Algorithm Implementation - basic terminology • Feature – a string that characterizes a pair of words according to their syntactic analysis, their position in the sentence and the words that are between them. • For example: • pi:pj = getPOS(i):getPOS(j) • “pi:pj = NN : VB” • for i<k<j: • IsNeg = isNeg(getWord(k)) • “IsNeg = True”, “IsNeg = False” • pi:pk:pj = getPos(i):getPos(k):getPos(j)

  13. Phase #2 Algorithm Implementation - basic terminology (cont.) • Weights Vector – Contains ordered pairs of <Feature, Weight> for all instances induced by different pairs of words • For example: • <<“pi:pj = NN:VB”,100>,<“pi:pj = NN:DTT”,5>> • All the featuretemplatesare hard-coded and predefined. • The Weights Vector is updated constantly during the learning phase.

  14. Phase #2 Algorithm Implementation – the Score function C[i,j]=maxk<i{C[k,j-1]+S(x,k,i)} S(x,k,i) returns the sum of weights of the features for k-th and i-th word in sentence x

  15. Phase #2 Learning – Dynamic Programming (Part 1) • For each cluster from the input file, we iterate over the list of indices and for each two adjacent indices, we extract their feature list. For example: 5 9 10 11 • For each feature in each list, we increase its weight by 1 in the Weights Vector

  16. Phase #2 Learning – Dynamic Programming (Part 2) • Now, we compress the long sentence to a new sentence having the length of the short one • From the compressed sentence, we generate a new list of indices (as shown before) and extract the lists of features • For each feature in each list, we decrease its score by 1

  17. Results & Discussion A short example • משפט ארוך:כמו בפרשת ההכרה בישראל כמדינת העם היהודי העיתוי של החלטת הממשלה בדבר הענקת מעמד מיוחד לבירה מצטייר כפרובוקציה מכוונת • משפט מקוצר (מקורי): העיתוי של החלטת הממשלה בדבר הענקת מעמד מיוחד לבירה מצטייר כפרובוקציה מכוונת • משפט מקוצר (אלג'): העיתוי של החלטת הממשלה בדבר הענקת מעמד מיוחד לבירה מצטייר כפרובוקציה מכוונת

  18. Results & Discussion And another one… • משפט ארוך:למשרד המשפטים 60 יום לערער על ההחלטה שמציבה דילמה בפני ממשל אובמה • משפט מקוצר (מקורי): למשרד המשפטים 60 יום לערער על ההחלטה • משפט מקוצר (אלג'): 60 יום לערער על ההחלטה ממשל אובמה

  19. Results & Discussion A (very) basic results analysis • We analyzed the compression of 50 “unseen” sentences: • 8% matched exactly the shortened version • 25% differ by one or two words from the shortened version • 37% are valid Hebrew sentences • 43% retained the general notion of the original sentence

  20. Results & Discussion Future improvements • Increase DB size!!! • Increase variety – use other sources of information • Add more feature templates

  21. Thank you!

More Related