1 / 16

ISM@FIRE MET-2013

ISM@FIRE MET-2013. Amit Jain Nitish Gupta Sukomal Pal Indian School of Mines, Dhanbad. Contents. Introduction to Morpheme ISMStemmer Result of MET at FIRE-2013 Problems in ISMStemmer Conclusion. Morpheme.

shada
Télécharger la présentation

ISM@FIRE MET-2013

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ISM@FIRE MET-2013 Amit Jain Nitish Gupta Sukomal Pal Indian School of Mines, Dhanbad

  2. Contents • Introduction to Morpheme • ISMStemmer • Result of MET at FIRE-2013 • Problems in ISMStemmer • Conclusion

  3. Morpheme • In linguistics, a morpheme is the smallest grammatical unit in a language. •  Every word comprises one or more morphemes. • Morphological analysis is the process of segmenting a word into its component. e.g."Unbreakable" comprises three morphemes:  un- (a morpheme signifying "not") -break- (the stem, a free morpheme), and  -able (a morpheme signifying "can be done").

  4. Stemmer • Attempts to reduce word variants to its stem or root form Example – education, educating, educative will all reduce to educat Reasons: • search engines are based on string matching • similarity of a document wrt a query mostly determined by exact term overlap • vocabulary mismatch as natural language documents use different form of a word for the same content

  5. Why stemming? (contd…) For children education is very important Example – Suppose we have to search some information about “education” doc 1 What is the reason we educate children doc 2 Query: education Government aims to make people educated doc 3 Educating young minds is the job of a teacher doc 4

  6. Why stemming? (contd…) For children education is very important By stemming: Original word -education, educate Stemmed word - educat doc 1 What is the reason we educate children doc 2 Query: education Government aims to make people educated doc 3 Educating young minds is the job of a teacher doc 4

  7. ISMstemmer • Approaches for Stemming • Language based approach • Statistical approach ISMStemmeris statistical • Based on suffix extraction • Suffix identified applying Apriori Algorithm (Agrawal and Srikant, 1994)

  8. ISMStemmer algorithm Single Colum Refined File aborning absolution absorption abuilding acquisition activation added addition admiration admitted admitting agreed agreeing allotted allotting ambling angling aborn absolu absorp abuild aquisi activa add add admira admitt admitt agre agree allott allott ambl angl Generate valid suffixes(AprioriAlgo) Strip off valid suffixes to get stems

  9. Suffix Generation Input is Single Column Sorted Refined File aborning absolution absorption abuilding acquisition activation added addition admiration admitted admitting agreed agreeing allotted allotting ambling angling Valid Suffixes ing ed tion . . . . . . er ment dedda dettolla … noitidda noitulosba … gnidliubagnieera Gnilgng ….. Reverse the unique sorted word file • Generate frequent suffixes (of length 1-character, 2-characters and so on). • Find valid suffixes whose frequency is above a pre-decided threshold value α.

  10. Evaluation of ISMstemmer • For evaluation of ISMstemmer we have participated in: Morpheme Extraction Task (MET) of FIRE-2013 • ISMstemmer submitted • evaluated at IR Labs: DAIICT, Gujarat • tested on 5 languages of South Asian origin • has given efficient results with 3 languages

  11. MET Results (IR Evaluation)

  12. Results ( Linguistic Evaluation) • Tamil:Precision: 80.22%; non-affixes: 80.22%Recall: 18.86%; non-affixes: 18.86%F-measure: 30.54%; non-affixes: 30.54%Bengali:Precision: 60.64%; non-affixes: 60.64%Recall: 32.15%; non-affixes: 32.15%F-measure: 42.02%; non-affixes: 42.02%

  13. Post-hoc Analysis • Over stemming • accent, accentual, accentuate– accent • accept, acceptant, acceptor– accept • access, accessible, accession– access due to overstemming acce • Stemming of Named Entities 1. Beijing  Beij

  14. Analysis

  15. Future plan • Need to consider the prefix as well -Clustering based on prefix • Identification NEs (Use o NERs) • ….

  16. THANK YOU! . . Questions?

More Related