ISM@FIRE MET-2013

ISM@FIRE MET-2013 Amit Jain Nitish Gupta Sukomal Pal Indian School of Mines, Dhanbad

Contents • Introduction to Morpheme • ISMStemmer • Result of MET at FIRE-2013 • Problems in ISMStemmer • Conclusion

Morpheme • In linguistics, a morpheme is the smallest grammatical unit in a language. • Every word comprises one or more morphemes. • Morphological analysis is the process of segmenting a word into its component. e.g."Unbreakable" comprises three morphemes: un- (a morpheme signifying "not") -break- (the stem, a free morpheme), and -able (a morpheme signifying "can be done").

Stemmer • Attempts to reduce word variants to its stem or root form Example – education, educating, educative will all reduce to educat Reasons: • search engines are based on string matching • similarity of a document wrt a query mostly determined by exact term overlap • vocabulary mismatch as natural language documents use different form of a word for the same content

Why stemming? (contd…) For children education is very important Example – Suppose we have to search some information about “education” doc 1 What is the reason we educate children doc 2 Query: education Government aims to make people educated doc 3 Educating young minds is the job of a teacher doc 4

Why stemming? (contd…) For children education is very important By stemming: Original word -education, educate Stemmed word - educat doc 1 What is the reason we educate children doc 2 Query: education Government aims to make people educated doc 3 Educating young minds is the job of a teacher doc 4

ISMstemmer • Approaches for Stemming • Language based approach • Statistical approach ISMStemmeris statistical • Based on suffix extraction • Suffix identified applying Apriori Algorithm (Agrawal and Srikant, 1994)

ISMStemmer algorithm Single Colum Refined File aborning absolution absorption abuilding acquisition activation added addition admiration admitted admitting agreed agreeing allotted allotting ambling angling aborn absolu absorp abuild aquisi activa add add admira admitt admitt agre agree allott allott ambl angl Generate valid suffixes(AprioriAlgo) Strip off valid suffixes to get stems

Suffix Generation Input is Single Column Sorted Refined File aborning absolution absorption abuilding acquisition activation added addition admiration admitted admitting agreed agreeing allotted allotting ambling angling Valid Suffixes ing ed tion . . . . . . er ment dedda dettolla … noitidda noitulosba … gnidliubagnieera Gnilgng ….. Reverse the unique sorted word file • Generate frequent suffixes (of length 1-character, 2-characters and so on). • Find valid suffixes whose frequency is above a pre-decided threshold value α.

Evaluation of ISMstemmer • For evaluation of ISMstemmer we have participated in: Morpheme Extraction Task (MET) of FIRE-2013 • ISMstemmer submitted • evaluated at IR Labs: DAIICT, Gujarat • tested on 5 languages of South Asian origin • has given efficient results with 3 languages

MET Results (IR Evaluation)

Results ( Linguistic Evaluation) • Tamil:Precision: 80.22%; non-affixes: 80.22%Recall: 18.86%; non-affixes: 18.86%F-measure: 30.54%; non-affixes: 30.54%Bengali:Precision: 60.64%; non-affixes: 60.64%Recall: 32.15%; non-affixes: 32.15%F-measure: 42.02%; non-affixes: 42.02%

Post-hoc Analysis • Over stemming • accent, accentual, accentuate– accent • accept, acceptant, acceptor– accept • access, accessible, accession– access due to overstemming acce • Stemming of Named Entities 1. Beijing  Beij

Analysis

Future plan • Need to consider the prefix as well -Clustering based on prefix • Identification NEs (Use o NERs) • ….

THANK YOU! . . Questions?

ISM@FIRE MET-2013

ISM@FIRE MET-2013

Presentation Transcript

ISM 270

ism-r2logooptions

ISM 270

ISM 270

ISM 270

Fire Season 2013

APHSA – ISM Conference October 8, 2013

ISM-Dallas August 8 th , 2013

ISM 270

“ISM”

ISM

ISM

ISM

Washington Fire Chiefs 2013

Shi`ism

FIRE 2013

ISM 158

ISM 270

ISM meeting November 6 2013

ISM working session , December 2-6 2013

Ism Institute