1 / 99

Lemur Toolkit Introduction

Lemur Toolkit Introduction. http://net.pku.edu.cn/~wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011. Recap. Information Retrieval Models Vector Space Model Probabilistic models Language model. t 3. D. Q. θ. t 1. t 2. Some formulas for Sim (VSM). Dot product Cosine Dice

Télécharger la présentation

Lemur Toolkit Introduction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lemur Toolkit Introduction http://net.pku.edu.cn/~wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011

  2. Recap • Information Retrieval Models • Vector Space Model • Probabilistic models • Language model

  3. t 3 D Q θ t 1 t 2 Some formulas for Sim(VSM) Dot product Cosine Dice Jaccard

  4. BM25 (Okapi system) – Robertson et al. Consider tf, qtf, document length • k1, k2, k3, b: parameters • qtf: query term frequency • dl: document length • avdl: average document length Doc. length normalization TF factors

  5. Standard Probabilistic IR Information need d1 matching d2 query … dn document collection

  6. IR based on Language Model (LM) • A query generation process • For an information need, imagine an ideal document • Imagine what words could appear in that document • Formulate a query using those words Information need d1 generation d2 query … … dn document collection

  7. P(w|D) = (1-) P(w|D)+  P(w|C) Language Modeling for IR Estimate a multinomial probability distribution from the text Smooth the distribution with one estimated from the entire collection

  8. Query Likelihood ? • Estimate probability that document generated the query terms P(Q|D) =  P(q|D)

  9. Kullback-Leibler Divergence • Estimate models for document and query and compare ? = KL(Q|D) =  P(w|Q) log(P(w|Q) / P(w|D))

  10. Question Among the three classic information retrieval model, which one is your best choice in designing your retrieval system? How can you tune the model parameters to achieve optimized performance? When you have a new idea on retrieval problem, how can you prove it?

  11. A Brief History of IR Slides from Prof. Ray Larson University of California, Berkeley School of Information http://courses.sims.berkeley.edu/i240/s11/

  12. Experimental IR systems • Probabilistic indexing – Maron and Kuhns, 1960 • SMART – Gerard Salton at Cornell – Vector space model, 1970’s • SIRE at Syracuse • I3R – Croft • Cheshire I (1990) • TREC – 1992 • Inquery • Cheshire II (1994) • MG (1995?) • Lemur (2000?)

  13. Historical Milestones in IR Research • 1958 Statistical Language Properties (Luhn) • 1960 Probabilistic Indexing (Maron & Kuhns) • 1961 Term association and clustering (Doyle) • 1965 Vector Space Model (Salton) • 1968 Query expansion (Roccio, Salton) • 1972 Statistical Weighting (Sparck-Jones) • 1975 2-Poisson Model (Harter, Bookstein, Swanson) • 1976 Relevance Weighting (Robertson, Sparck-Jones) • 1980 Fuzzy sets (Bookstein) • 1981 Probability without training (Croft)

  14. Historical Milestones in IR Research (cont.) • 1983 Linear Regression (Fox) • 1983 Probabilistic Dependence (Salton, Yu) • 1985 Generalized Vector Space Model (Wong, Rhagavan) • 1987 Fuzzy logic and RUBRIC/TOPIC (Tong, et al.) • 1990 Latent Semantic Indexing (Dumais, Deerwester) • 1991 Polynomial & Logistic Regression (Cooper, Gey, Fuhr) • 1992 TREC (Harman) • 1992 Inference networks (Turtle, Croft) • 1994 Neural networks (Kwok) • 1998 Language Models (Ponte, Croft)

  15. Boolean model, statistics of language (1950’s) Vector space model, probablistic indexing, relevance feedback (1960’s) Probabilistic querying (1970’s) Fuzzy set/logic, evidential reasoning (1980’s) Regression, neural nets, inference networks, latent semantic indexing, TREC (1990’s) DIALOG, Lexus-Nexus, STAIRS (Boolean based) Information industry (O($B)) Verity TOPIC (fuzzy logic) Internet search engines (O($100B?)) (vector space, probabilistic) Information Retrieval – Historical View Research Industry

  16. Research Systems Software • INQUERY (Croft) • OKAPI (Robertson) • PRISE (Harman) • http://potomac.ncsl.nist.gov/prise • SMART (Buckley) • MG (Witten, Moffat) • CHESHIRE (Larson) • http://cheshire.berkeley.edu • LEMUR toolkit • Lucene • Others

  17. Lemur Project Some slides from Don Metzler, Paul Ogilvie & Trevor Strohman

  18. Zoology 101 • Lemurs are primates found only in Madagascar • 50 species (17 are endangered) • Ring-tailed lemurs • lemur catta

  19. Zoology 101 • The indri is the largest type of lemur • When first spotted the natives yelled “Indri! Indri!” • Malagasy for "Look!  Over there!"

  20. About The Lemur Project The Lemur Project was started in 2000 by the Center for Intelligent Information Retrieval (CIIR) at the University of Massachusetts, Amherst, and the Language Technologies Institute (LTI) at Carnegie Mellon University. Over the years, a large number of UMass and CMU students and staff have contributed to the project. The project's first product was the Lemur Toolkit, a collection of software tools and search engines designed to support research on using statistical language models for information retrieval tasks. Later the project added the Indri search engine for large-scale search, the Lemur Query Log Toolbar for capture of user interaction data, and the ClueWeb09 dataset for research on web search.

  21. Installation • http://www.lemurproject.org • Linux, OS/X: • Extract software/lemur-4.12.tar.gz • ./configure --prefix=/install/path • ./make • ./make install • Windows • Run software/lemur-4.12-install.exe • Documentation in windoc/index.html

  22. Installation • Use Lemur-4.12 instead~ • JAVA Runtime(JDK 6) need for evaluation tool. • Environment Variable : PATH • Linux: modify ~/.bash_profile • Windows: MyComputer/Properties…

  23. Indexing • Document Preparation • Indexing Parameters • Time and Space Requirements

  24. Two Index Formats KeyFile • Term Positions • Metadata • Offline Incremental • InQuery Query Language Indri • Term Positions • Metadata • Fields / Annotations • Online Incremental • InQuery and Indri Query Languages

  25. Indexing – Document Preparation Document Formats: The Lemur Toolkit can inherently deal with several different document format types without any modification: • TREC Text • TREC Web • Plain Text • Microsoft Word(*) • Microsoft PowerPoint(*) • HTML • XML • PDF • Mbox (*) Note: Microsoft Word and Microsoft PowerPoint can only be indexed on a Windows-based machine, and Office must be installed.

  26. Indexing – Document Preparation • If your documents are not in a format that the Lemur Toolkit can inherently process: • If necessary, extract the text from the document. • Wrap the plaintext in TREC-style wrappers: <DOC> <DOCNO>document_id</DOCNO> <TEXT> Index this document text. </TEXT> </DOC> – or – For more advanced users, write your own parser to extend the Lemur Toolkit.

  27. Indexing - Parameters • Basic usage to build index: • IndriBuildIndex <parameter_file> • Parameter file includes options for • Where to find your data files • Where to place the index • How much memory to use • Stopword, stemming, fields • Many other parameters.

  28. Indexing – Parameters • Standard parameter file specification an XML document: <parameters> <option></option> <option></option> … <option></option> </parameters>

  29. Indexing – Parameters • where to find your source files and what type to expect • BuildIndex • <dataFiles> • name of file containing list of datafiles to index. • IndriBuildIndex <parameters> <corpus> <path>/path/to/source/files</path> <class>trectext</class> </corpus> </parameters>

  30. Indexing - Parameters • The <index> parameter tells IndriBuildIndex where to create or incrementally add to the index • If index does not exist, it will create a new one • If index already exists, it will append new documents into the index. <parameters> <index>/path/to/the/index</index> </parameters>

  31. Indexing - Parameters • <memory> - used to define a “soft-limit” of the amount of memory the indexer should use before flushing its buffers to disk. • Use K for kilobytes, M for megabytes, and G for gigabytes. <parameters> <memory>256M</memory> </parameters>

  32. Indexing - Parameters • Stopwords defined within <stopwords>filename</stopwords> • IndriBuildIndex <parameters> <stopper> <word>first_word</word> <word>next_word</word> … <word>final_word</word> </stopper> </parameters>

  33. Indexing – Parameters • Term stemming can be used while indexing as well via the <stemmer> tag. • Specify the stemmer type via the <name> tag within. • Stemmers included with the Lemur Toolkit include the Krovetz Stemmer and the Porter Stemmer. <parameters> <stemmer> <name>krovetz</name> </stemmer> </parameters>

  34. Retrieval • Parameters • Query Formatting • Interpreting Results

  35. Retrieval - Parameters • Basic usage for retrieval: • IndriRunQuery/RetEval <parameter_file> • Parameter file includes options for • Where to find the index • The query or queries • How much memory to use • Formatting options • Many other parameters.

  36. Retrieval - Parameters • The <index> parameter tells IndriRunQuery/RetEval where to find the repository. <parameters> <index>/path/to/the/index</index> </parameters>

  37. Retrieval - Parameters • The <query> parameter specifies a query • plain text or using the Indri query language <parameters> <query> <number>1</number> <text>this is the first query</text> </query> <query> <number>2</number> <text>another query to run</text> </query> </parameters> • Query file format <DOC> <DOCNO> 1 </DOCNO> What articles exist which deal with TSS (Time Sharing System), anoperating system for IBM computers? </DOC> <DOC> <DOCNO> 2 </DOCNO> I am interested in articles written either by Prieve or Udo PoochPrieve, B.Pooch, U. </DOC>

  38. Retrieval – Query Formatting • TREC-style topics are not directly able to be processed via IndriRunQuery/RetEval. • Format the queries accordingly: • Format by hand • Write a script to extract the fields (可爱的Python~)

  39. Retrieval – Parameters To specify a maximum number of results to return, use the <count> tag: <parameters> <count>50</count> </parameters>

  40. Retrieval - Parameters • Result formatting options: • IndriRunQuery/RetEval has built in formatting specifications for TREC and INEX retrieval tasks

  41. Retrieval – Parameters • TREC – Formatting directives: • <runID>: a string specifying the id for a query run, used in TREC scorable output. • <trecFormat>: true to produce TREC scorable output, otherwise use false (default). <parameters> <runID>runName</runID> <trecFormat>true</trecFormat> </parameters>

  42. Outputting INEX Result Format • Must be wrapped in <inex> tags • <participant-id>: specifies the participant-id attribute used in submissions. • <task>: specifies the task attribute (default CO.Thorough). • <query>: specifies the query attribute (default automatic). • <topic-part>: specifies the topic-part attribute (default T). • <description>: specifies the contents of the description tag. <parameters> <inex> <participant-id>LEMUR001</participant-id> </inex> </parameters>

  43. Retrieval - Evaluation • To use trec_eval: • format IndriRunQuery results with appropriate trec_eval formatting directives in the parameter file: • <runID>runName</runID> • <trecFormat>true</trecFormat> • Resulting output will be in standard TREC format ready for evaluation: <queryID> Q0 <DocID> <rank> <score> <runID> 150 Q0 AP890101-0001 1 -4.83646 runName 150 Q0 AP890101-0015 2 -7.06236 runName

  44. Use RetEval for TF.IDF • First run ParseToFile to convert doc formatted queries into queries <parameters> <docFormat>web</docFormat> <outputFile>filename</outputFile> <stemmer>stemmername</stemmer> <stopwords>stopwordfile</stopwords> </parameters> • ParseToFile paramfile queryfile • http://www.lemurproject.org/lemur/parsing.html#parsetofile

  45. Use RetEval for TF.IDF • Then run RetEval <parameters> <index>index</index> <retModel>0</retModel> // 0 for TF-IDF, 1 for Okapi, // 2 for KL-divergence, // 5 for cosine similarity <textQuery>querie filename</textQuery> <resultCount>1000</resultCount> <resultFile>tfidf.res</resultFile> </parameters> • RetEval paramfile • http://www.lemurproject.org/lemur/retrieval.html#RetEval

  46. Evluate Results • TREC qrels • Ground Truth: judge by human assessors.

  47. Ireval tool • java -jar “D:\Program Files\Lemur\Lemur 4.12\bin\ireval.jar”result qrels >pr.result

  48. Use Lemur API & Make Extension to Lemur

  49. Task When you have a new idea on retrieval problem, how can you prove it?

  50. Introducing the API • Lemur “Classic” API • Many objects, highly customizable • May want to use this when you want to change how the system works • Support for clustering, distributed IR, summarization • Indri API • Two main objects • Best for integrating search into larger applications • Supports Indri query language, XML retrieval, “live” incremental indexing, and parallel retrieval

More Related