150 likes | 492 Vues
Lemur Indri Search Engine. Yatish Hegde 03/03/2010. Background. Open source text search engine Combines language modeling and inference networks Inquery query language API – accesible from C++, Java, C# and PHP. Html, xml, txt, trectext , trecweb , ppt , doc*, ppt *. Resources.
E N D
Lemur Indri Search Engine YatishHegde 03/03/2010
Background • Open source text search engine • Combines language modeling and inference networks • Inquery query language • API – accesible from C++, Java, C# and PHP. • Html, xml, txt, trectext, trecweb, ppt, doc*, ppt*
Resources • Website: http://lemurproject.org • Tutorials: http://sourceforge.net/apps/trac/lemur/wiki • Forum: http://sourceforge.net/projects/lemur/forums
How to get started? • Cygwin: http://cygwin.com (include “perl”, “vi editor” and “make” package while installing) • Lemur Toolkit: http://sourceforge.net/projects/lemur/develop • TREC Eval: http://trec.nist.gov/trec_eval/
Installing Lemur Inside Lemur Directory - • ./configure • make • make install • Build Index – IndriBuildIndex • Run Query - IndriRunQuery
Building Index • IndriBuildIndex <parameterFile> • <parameters> <index>/home/lemur/testindex</index> <memory>1G</memory> <corpus> <path>/home/lemur/testdata/firstCorpus</path> <class>trectext</class> </corpus> <corpus> <path>/home/lemur/testdata/secondCorpus</path> <class>trecweb</class> </corpus> <stemmer> <name>krovetz</name> </stemmer> <field> <name>p</name> </field> </parameters>
Running Query • IndriRunQuery <queryFile> <stopwordFile> <queryOptions> • Query File <parameters> <query> <number>701</number> <text>oil industry history</text> </query> </parameters> • Stop Word File <parameters> <stopper> <word>the</word> </stopper> </parameters> • Query Options File <parameters> <trecFormat>true</trecFormat> <index>/path/to/index</index> <count>1000</count> </parameters>
Converting Topic File into Query File • Topic File <top> <num> Number: 301 <title> International Organized Crime <desc> Description: Identify organizations that participate in international criminal activity, the activity, and, if possible, collaborating organizations and the countries involved. <narr> Narrative: A relevant document must as a minimum identify the organization and the type of illegal activity (e.g., Columbian cartel exporting cocaine). Vague references to international drug trade without identification of the organization(s) involved would not be relevant. </top>
Converting Topic File into Query File Perl Program: • ./topicToQuery.pl [-t] [-d] <inputFile> <outputFile> • ./topicToQuery.pl -h
TREC Eval • make • trec_eval -q -c -M1000 official_qrelsquery_results • More Documentation: http://trecvid.nist.gov/trecvid.tools/trec_eval_video/README
Lemur Search UI • User Interface: http://sourceforge.net/apps/trac/lemur/wiki/The%20Lemur%20CGI%20Application • How it looks? http://sewell.syr.edu/lemur/lemur.cgi
Indri Query Langauge • #combine( white house) • #1(white house) • #5(white house) • #band(white house) • #band(oil fields) #1(white house) <parameters> <query> <number> 301 </number> <text> #combine( Identify organizations that participate in #max( #1( international criminal activity) international criminal activity ) the activity and if possible collaborating organizations and the countries involved) </text> </query> </parameters>
Contact If you have questions - YatishHegde: yhegde@syr.edu