Language Model in Turkish IR

Language Model in Turkish IR Melih Kandemir F. Melih Özbekoğlu Can Şardan Ömer S. Uğurlu

Outline • Indexing problem and proposed solution • Previous Work • System Architecture • Language Modeling Concept • Evaluation of the System • Conclusion

Indexing Problem • “A Language Modeling Approach to Information Retrieval” Jay M. Ponte and. W. Bruce Croft, 1998 • Indexing model is important at probabilistic retrieval model • Current models do not lead to improved retrieval results

Indexing Problem • Failure because of unwarranted assumptions: • 2-Poisson model • “elite” documents • N-Poisson model • Mixture of more than 2 Poission distributions

Proposed Solution • Retrieval based on probabilistic language modeling • Language model refers to probabilistic distribution that captures statistical regularities of the generation of language • A language model is inferred for each document

Proposed Solution • Estimate probability of generating the query • Documents are ranked according to these probabilities • Users have a reasonable idea of terms • tf, idf are integral parts of language model

Previous Work • Robertson–Sparck Jones model and Croft–Harper model • They focus on relevance • Fuhr integrated indexing and retrieval models. • Used statistics as heuristics • Wong and Yao used utility theory and information theory

Previous Work • Kalt’s approach is the most similar • Maximum likelihood estimator is used • Collection statistics are integral parts of the model • Documents are members of language classes

System Overview Different Resultsets Application Server LM-Search UI Indexer Query Evaluator USER JDBC Document Repository Index DB (PostgreSQL)

System Architecture Stemming & Term Selection No Stemming Inverted Index Generation Document Repository tf.idf First 5 Index DB Lemmatiser Language Model

tf.idf vs. Language model Different Resultsets GUI for seeing differences between results LM LM tf.idf tf.idf

Vocabulary Extraction Stemming & Term Selection • No stemmer • Turkish is aggluntinative • Expectation: low precision • First 5 characters • As effective as more complex solutions • Lemmatiser: • Expectation: high precision. • Zemberek2 (MPL license) • Open Source Software • Java Interface, easy to use • Find stems of the words • First valid stem will be used, • Word sense disambiguation (using wordnet or POS) may be added in the future No Stemming First 5 Lemmatiser

Database Index DB

Language Modeling : Inverted Index Implementation t1  … t2  … … An example inverted index for m terms : tm  … If a document does not contain term then probability can be calculated using cft ft = mean term frequency = mean probability of t in documents containing it cft=frequency of t in all documents

The Baseline Approach : tf.idf We will use the traditional tf.idf term weighting approach as the baseline model Robertson’s tf score Standard idf score

Language Modeling : Definition An alternative approach to indexing and retrieval Definition of Language Model: A probability distribution that captures the statistical regularities of the generation of language Intuition Behind : Users have a reasonable idea of terms that are likely to occur in documents of interest and will choose query terms that distinguish these documents from others in the collection

Language Modeling : The Approach The following assumptions are not made : Term distributions in the documents are parametric Documents are members of pre-defined classes “Query generation probability” rather than “Probability of relevance”

Language Modeling : The Approach P(t | Md) : Probability that the term t is generated by the language model of document Md

Language Modeling : Theory Maximum likelihood estimate of the probability of term t under the term distribution for document d: tf(t,d) : raw term frequency in document d dld : total number of terms in the document

Language Modeling : Theory An additional more robust estimate from a larger amount of data : pavg : Mean probability of term t in documents containing it dft : Number of documents that contain term t

Language Modeling : Theory The risk function : : Mean term frequency of term t in documents which contains it.

Language Modeling : The Ranking Formula Let the probability of term t being produced by document d given the document model Md : The probability of producing Q for a given document model Md is:

Language Modeling : Inverted Index Implementation t1  … t2  … … An example inverted index for m terms : tm  …

Evaluation • Perform recall/precision experiments • Recall/precision results • Non-interpolated average precision • Precision figures for the top N documents • For several values of N • R-Precision

Other Metrics • Compare the baseline (tf.idf) results to our language model. • Percent Change between two result sets • I / D • I : count of queries performance improved • D : count of queries performance changed

Document Repository • Milliyet (2001-2005) • XML file ( 1.1 GB ) • 408304 news • Ready for indexing • XML Schema ......(FIXME) Document Source

Summary • Indexing and stemming • Zemberek2 lemmatiser • Java environment • Data • News archive from 2001 to 2005, from Milliyet • Evaluation • Methods will be compared according to performance over recall/precision values

Conclusion • First language modelling approach to Turkish IR • The LM approach • Non-parametric • Less assumptions • Relaxed • Expected a better performance than baseline tf.idf method

Thanks for listening … Any Questions?

Language Model in Turkish IR