1 / 20

N-gram Based Indexing for Marathi Monolingual Search

N-gram Based Indexing for Marathi Monolingual Search. Ashish Almeida and Pushpak Bhattacharyya IIT Bombay. Introduction. Word based techniques Lexical analysis Morphological Analysis Language dependent Dictionary Spelling normalization Stop-word elimination Multi-word expressions

michiko
Télécharger la présentation

N-gram Based Indexing for Marathi Monolingual Search

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. N-gram Based Indexing for Marathi Monolingual Search Ashish Almeida and Pushpak Bhattacharyya IIT Bombay

  2. Introduction • Word based techniques • Lexical analysis • Morphological Analysis • Language dependent • Dictionary • Spelling normalization • Stop-word elimination • Multi-word expressions • N-grams • Language independent • Easy to develop

  3. Related Work • Character n-gram Tokenization : McNamee • Significance of n-grams • Use overlapping n-grams for different languages • Tested on HAIRCUT system • Various aspects of n-gram Modeling • Defining generalized n-grams : K. Jarvelin • Defines n-gram and s-gram • Similarity measures for comparing n-grams

  4. Corpus • FIRE 2008 DATA-SET • http://www.isical.ac.in/~fire/ • Documents • 99,275 news articles • Year 2004-2007 • eSakal and Maharashtra Times • queries • 50 training set (from FIRE 2008) • 50 test set (translated from English)

  5. Document <DOC> <DOCNO>MaharashtraC06E811B.txt</DOCNO> <TEXT> घटकचाचणीतसरकारअनुत्तीर्ण सरकारनेआजशंभरदिवसपूर्णकेलेआहेत. ठराविककाळानंतरहोणाऱ्याघटकचाचणीपरीक्षांतआपलीतयारीआजमावण्याचीसंधीविद्यार्थ्यांनामिळते. सरकारच्याकामगिरीचेमूल्यमापनहीत्याचधर्तीवरकरावे, याहेतूनेहेप्रगतिपुस्तकमांडलेआहे. …. </TEXT> </DOC>

  6. Relevance Judgment • 50 new queries translated in Marathi for FIRE 2010 • Query no. 76 – query no. 125 • Marathi • 20,600 document judged for FIRE 2010 • 621 relevant documents • 11 queries have no relevant documents

  7. Why N-grams • Vocabulary increase with #documents in corpus • Dictionary size grows • N-gram based tokens • size is restricted by size of n • n-grams break words • Captures morphological changes • Combine parts of consecutive words • Resilient to spelling errors / spelling variations

  8. Problems with Morph-analyzer • Accuracy of morphological analyzer limited • Can not handle • Unknown words • Unknown suffixes • Not suitable for news domain IR • Many named entities • Computationally heavy

  9. Generating N-grams • For each document Get article text Remove punctuation marks Replace space by ‘_’ Put ‘_’ before and after each sentence Treat address, titles etc. like sentences Do while (there are more than n-1 char left) • Select first n characters as n-gram • Remove the first character from the text End of while

  10. Example: 4-grams • “याफुटबॉलपटुंनाप्रसिद्धीचीगरजआहे.” (These football players deserves fame) • “_या_फुटबॉलपटुंना_प्रसिद्धीची_गरज_आहे_” • 4-grams generated • _या_ , या_फ , ा_फु , _फुट , फुटब , ुटबॉ , ... , गरज_ , रज_आ , ज_आह , _आहे , आहे_ • Length of word • प्रसिद्धी • प+्+र+स+ि+द+्+ध+ी • 9 characters

  11. IR System • Terrier 2 • Open source • Modular • Easy to modify • Unicode ready * • Retrieval models Retrieval models used for evaluation ( Available in Terrier 2 )

  12. Indexing • N-grams • Indexing • Modified tokenizer • Query processing • Index data structure sizes in terrier for • different n-grams

  13. Experiment 1. Baseline • Only word based indexing and retrieval. • No-preprocessing

  14. Experiment : n-grams 2. Using basic n-grams (DFR_BM25) MAP for different length N-grams

  15. Experiment: n-grams 3. Include Small words • During Indexing and retrieval • Identify and brake n-gram overlapping • At text boundaries such as sentence ends, braces, quotation marks, commas.

  16. Effect of length of n in n-gram on MAP (T+D+N)

  17. Combination of N-grams • Choose 2 different length n-grams • Indexing and retrieval • N-gram 1 : >=4 • N-gram 2 : < 4

  18. Conclusion • 4-gram performs best • Balance between precision and recall • Word based MAP : 23.94 % • 4- gram based MAP : 35.79 % • Future work • Analysis of combinations of N-grams • Use skip-grams for Marathi • Experiment based on different word length criteria

  19. References • Paul McNamee and James Mayfield, Character N-gram Tokenization for European Language Text Retrieval, 2004 • Terrier • http://ir.dcs.gla.ac.uk/terrier/ • AnniJarvelin, AnttiJarvelin, KalervoJarvelin, s-grams: Defining generalized n-grams for information retrieval, 2006 • Paul McNamee, Textual Representations for Corpus-Based Bilingual Retrieval, PhD Thesis, 2008

  20. Thank You !

More Related