1 / 132

Cheshire II: Features and Internals and Cheshire III overview

Cheshire II: Features and Internals and Cheshire III overview. Ray R. Larson School of Information Management and Systems University of California, Berkeley. Overview. Cheshire II feature overview Logistic Regression Ranking, Okapi BM-25 and Boolean Operations Fusion Operators

efrem
Télécharger la présentation

Cheshire II: Features and Internals and Cheshire III overview

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Cheshire II: Features and Internalsand Cheshire III overview Ray R. Larson School of Information Management and Systems University of California, Berkeley Ray R. Larson

  2. Overview • Cheshire II feature overview • Logistic Regression Ranking, Okapi BM-25 and Boolean Operations • Fusion Operators • Additions from INEX ‘03 • Element/Index level re-estimation of LR coefficients • Adhoc and Heterogeneous Track Methodology • Evaluation Results -Adhoc Ray R. Larson

  3. Overview of Cheshire II • It supports SGML and XML with components and component indexes • It is a client/server application • Uses the Z39.50 Information Retrieval Protocol, support for SRW, OAI, SOAP, SDLIP also implemented • Server supports a Relational Database Gateway • Supports Boolean searching of all servers • Supports probabilistic ranked retrieval in the Cheshire search engine as well as Boolean and proximity search • Search engine supports ``nearest neighbor'' searches and relevance feedback • GUI interface on X window displays and Windows NT • WWW/CGI forms interface for DL, using combined client/server CGI scripting via WebCheshire • Scriptable clients using Tcl and Python • Store SGML/XML as files or “Datastore” database Ray R. Larson

  4. Local Remote Z39.50 Z39.50 Internet Z39.50 Z39.50 Images Scanned Text Cheshire II Searching Ray R. Larson

  5. UI Or Scripts Map Query Map Results INEX Overview INEX Search Engine Map Query Local Net Map Results Ray R. Larson

  6. Boolean Search Capability • All Boolean operations are supported • “zfind author x and (title y or subject z) not subject A” • Named sets are supported and stored on the server • Boolean operations between stored sets are supported • “zfind SET1 and subject widgets or SET2” • Nested parentheses and truncation are supported • “zfind xtitle Alice#” Ray R. Larson

  7. Probabilistic Retrieval • Uses Logistic Regression ranking method developed at Berkeley (W. Cooper, F. Gey, D. Dabney, A. Chen) with new algorithm for weigh calculation at retrieval time • Z39.50 “relevance” operator used to indicate probabilistic search • Any index can have Probabilistic searching performed: • zfind topic @ “cheshire cats, looking glasses, march hares and other such things” • zfind title @ caucus races • Boolean and Probabilistic elements can be combined: • zfind topic @ government documents and title guidebooks Ray R. Larson

  8. Probabilistic Retrieval: Logistic Regression Probability of relevance is based on Logistic regression from a sample set of documents to determine values of the coefficients. At retrieval the probability estimate is obtained by: For the 6 X attribute measures shown on the next slide Ray R. Larson

  9. Probabilistic Retrieval: Logistic Regression attributes Average Absolute Query Frequency Query Length Average Absolute Component Frequency Document Length Average Inverse Component Frequency Inverse Component Frequency Number of Terms in common between query and Component -- logged Ray R. Larson

  10. Combining Boolean and Probabilistic Search Elements • Two original approaches: • Boolean Approach • Non-probabilistic “Fusion Search” Set merger approach is a weighted merger of document scores from separate Boolean and Probabilistic queries Ray R. Larson

  11. Okapi BM25 • Where: • Q is a query containing terms T • K is k1((1-b) + b.dl/avdl) • k1, b and k3are parameters , usually 1.2, 0.75 and 7-1000 • tf is the frequency of the term in a specific document • qtf is the frequency of the term in a topic from which Q was derived • dl and avdl are the document length and the average document length measured in some convenient unit • w(1) is the Robertson-Sparck Jones weight. Ray R. Larson

  12. Merging and Ranking Operators • Extends the capabilities of merging to include merger operations in queries like Boolean operators • Fuzzy Logic Operators (not used for INEX) • !FUZZY_AND • !FUZZY_OR • !FUZZY_NOT • Containment operators: Restrict components to or with a particular parent • !RESTRICT_FROM • !RESTRICT_TO • Merge Operators • !MERGE_SUM • !MERGE_MEAN • !MERGE_NORM • !MERGE_CMBZ Ray R. Larson

  13. INEX ‘04 Fusion Search Subquery Subquery Final Ranked List Fusion/ Merge Subquery Subquery Comp. Query Results Comp. Query Results • Merge multiple ranked and Boolean index searches within each query and multiple component search resultsets • Major components merged are Articles, Body, Sections, subsections, paragraphs Ray R. Larson

  14. New LR Coefficients Estimates using INEX ‘03 relevance assessments for b1 = Average Absolute Query Frequency b2 = Query Length b3 = Average Absolute Component Frequency b4 = Document Length b5 = Average Inverse Component Frequency b6 = Number of Terms in common between query and Component Ray R. Larson

  15. SGML/XML Support • Underlying native format for all data is SGML or XML • The DTD defines the file format for each file • Full SGML/XML parsing • SGML/XML Format Configuration Files define the database • USMARC DTD and MARC to SGML conversion (and back again) • Access to full-text via special SGML/XML tags Ray R. Larson

  16. Indexing • Any SGML/XML tagged field or attribute can be indexed: • B-Tree and Hash access via Berkeley DB (Sleepycat) • Stemming, keyword, exact keys and “special keys” • Mapping from any Z39.50 Attribute combination to a specific index • Underlying postings information includes term frequency for probabilistic searching • Component extraction with separate component indexes Ray R. Larson

  17. XML Element Extraction • A new search “ElementSetName” is XML_ELEMENT_ • Any Xpath, element name, or regular expression can be included following the final underscore when submitting a present request • The matching elements are extracted from the records matching the search and delivered in a simple format.. Ray R. Larson

  18. XML Extraction % zselect sherlock 372 {Connection with SHERLOCK (sherlock.berkeley.edu) database 'bibfile' at port 2100 is open as connection #372} % zfind topic mathematics {OK {Status 1} {Hits 26} {Received 0} {Set Default} {RecordSyntax UNKNOWN}} % zset recsyntax XML % zset elementset XML_ELEMENT_Fld245 % zdisplay {OK {Status 0} {Received 10} {Position 1} {Set Default} {NextPosition 11} {RecordSyntax XML 1.2.840.10003.5.109.10}} { <RESULT_DATA DOCID="1"> <ITEM XPATH="/USMARC[1]/VarFlds[1]/VarDFlds[1]/Titles[1]/Fld245[1]"> <Fld245 AddEnty="No" NFChars="0"><a>Singularitâes áa Cargáese</a></Fld245> </ITEM> <RESULT_DATA> … etc… Ray R. Larson

  19. SGML/XML Support • Configuration files for the Server are SGML/XML: • They include elements describing all of the data files and indexes for the database. • They also include instructions on how data is to be extracted for indexing and how Z39.50 attributes map to the indexes for a given database. Ray R. Larson

  20. SGML/XML Support • Example XML record for a DL document <ELIB-BIB> <BIB-VERSION>ELIB-v1.0</BIB-VERSION> <ID>756</ID> <ENTRY>June 12, 1996</ENTRY> <DATE>June 1996</DATE> <TITLE>Cumulative Watershed Effects: Applicability of Available Methodologies to the Sierra Nevada</TITLE> <ORGANIZATION>University of California</ORGANIZATION> <TYPE>report</TYPE> <AUTHOR-INSTITUTIONAL>USDA Forest Service</AUTHOR-INSTITUTIONAL> <AUTHOR-PERSONAL>Neil H. Berg</AUTHOR-PERSONAL> <AUTHOR-PERSONAL>Ken B. Roby</AUTHOR-PERSONAL> <AUTHOR-PERSONAL>Bruce J. McGurk</AUTHOR-PERSONAL> <PROJECT>SNEP</PROJECT> <SERIES>Vol 3</SERIES> <PAGES>40</PAGES> <TEXT-REF>/elib/data/docs/0700/756/HYPEROCR/hyperocr.html</TEXT-REF> <PAGED-REF>/elib/data/docs/0700/756/OCR-ASCII-NOZONE</PAGED-REF> </ELIB-BIB> Ray R. Larson

  21. SGML Support • Example SGML/MARC Record <USMARC Material="BK" ID="00000003"><leader><LRL>00722</LRL><RecStat>n</RecStat> <RecType>a</RecType><BibLevel>m</BibLevel><UCP></UCP><IndCount>2</IndCount> <SFCount>2</SFCount><BaseAddr>00229</BaseAddr><EncLevel> </EncLevel> <DscCatFm></DscCatFm><LinkRec></LinkRec><EntryMap><FLength>4</Flength><SCharPos> 5</SCharPos><IDLength>0</IDLength><EMUCP></EMUCP></EntryMap></Leader> <Directry>001001400000005001700014008004100031010001400072035002000086035001700106100001900123245010500142250001100247260003200258300003300290504005000323650003600373700002200409700002200431950003200453998000700485</Directry><VarFlds> <VarCFlds><Fld001>CUBGGLAD1282B</Fld001><Fld005>19940414143202.0</Fld005> <Fld008>830810 1983 nyu eng u</Fld008></VarCFlds> <VarDFlds><NumbCode><Fld010 I1="Blank" I2="Blnk"><a>82019962 </a></Fld010> <Fld035 I1="Blank" I2="Blnk"><a>(CU)ocm08866667</a></Fld035><Fld035 I1="Blank" I2="Blnk"><a>(CU)GLAD1282</a></Fld035></NumbCode><MainEnty><Fld100 NameType="Single" I2=""><a>Burch, John G.</a></Fld100></MainEnty><Titles><Fld245 AddEnty="Yes" NFChars="0"><a>Information systems :</a><b>theory and practice /</b><c>John G. Burch, Jr., Felix R. Strater, Gary Grudnitski</c></Fld245></Titles><EdImprnt><Fld250 I1="Blank" I2="Blnk"><a>3rd ed</a></Fld250><Fld260 I1="" I2="Blnk"><a>New York :</a><b>J. Wiley,</b><c>1983</c></Fld260></EdImprnt><PhysDesc><Fld300 I1="Blank" I2="Blnk"><a>xvi, 632 p. :</a><b>ill. ;</b><c>24 cm</c></Fld300></PhysDesc><Series></Series><Notes><Fld504 I1="Blank" I2="Blnk"><a>Includes bibliographical references and index</a></Fld504></Notes><SubjAccs><Fld650 SubjLvl="NoInfo" SubjSys="LCSH"><a>Managementinformation systems.</a></Fld650> ... Ray R. Larson

  22. SGML/XML Support TREC document… <DOC> <DOCNO>FT931-3566</DOCNO> <PROFILE>_AN-DCPCCAA3FT</PROFILE> <DATE>930316 </DATE> <HEADLINE> FT 16 MAR 93 / Italy's Corruption Scandal: Magistrates hold key to unlocking Tangentopoli - They will set the investigation agenda </HEADLINE> <BYLINE> By ROBERT GRAHAM </BYLINE> <TEXT> OVER the weekend the Italian media felt obliged to comment on a non-event. No new arrests had taken place in any of the country's ever more numerous corruption scandals which centre on the illicit funding of political parties ... </TEXT> <XX> … Ray R. Larson

  23. Companies:- </XX> <CO>Ente Nazionale Idrocarburi. Ente Nazionale per L'Energia Electtrica. Ente Partecipazioni E Finanziamento Industria Manifatturiera. IRI Istituto per La Ricostruzione Industriale. </CO> <XX> Countries:- </XX> <CN>ITZ Italy, EC. </CN> <XX> Industries:- </XX> <IN>P9222 Legal Counsel and Prosecution. P91 Executive, Legislative and General Government. P13 Oil and Gas Extraction. P9631 Regulation, Administration of Utilities. P6719 Holding Companies, NEC. </IN> <XX> Types:- </XX> … Ray R. Larson

  24. <TP>CMMT Comment &amp; Analysis. GOVT Legal issues. </TP> <PUB>The Financial Times </PUB> <PAGE> London Page 4 </PAGE> </DOC> Ray R. Larson

  25. SGML/XML Support • INEX Document <article> <fno>C1050</fno> <doi>10.1041/C1050s-2000</doi> <fm> <hdr><hdr1><ti>COMPUTING IN SCIENCE &amp; ENGINEERING</ti> <crt><issn>1521-9615</issn>/00/$10.00 <cci><onm>&copy; 2000 IEEE</onm></cci></crt></hdr1> <hdr2><obi><volno>Vol. 2</volno><issno>No. 1</issno></obi> <pdt><mo>JANUARY/FEBRUARY</mo><yr>2000</yr></pdt> <pp>pp. 50-59</pp></hdr2> </hdr> <tig><atl>The Decompositional Approach to Matrix Computation</atl> <pn>pp. 50-59</pn></tig> <au sequence="first"><fnm>G.W.</fnm><snm>Stewart</snm><aff><onm>University of Maryland</onm></aff></au> <fig><art file="c1050x1.gif" w="425" h="321" tw="150" th="113"/></fig> <abs><p>The introduction of matrix decomposition into numerical linear algebra revolutionized matrix computations. This article outlines the decompositional approach, comments on its history, and surveys the six most widely used decompositions.</p> </abs> </fm> <bdy> <sec><st></st> <ip1>In 1951, Paul S. Dwyer published <it>Linear Computations</it>, perhaps the first book devoted entirely to numerical linear algebra.<ref rid="bibc10501" type="bib">1</ref> Digital computing was in its infancy, and Dwyer focused on computation with mechanical calculators. Nonetheless, the book was state of the art. <ref rid="c10501" type="fig">Figure 1</ref> reproduces a page of the book dealing with Gaussian elimination. In 1954, Alston S. Householder published <it>Principles of Numerical Analysis</it>,<ref rid="bibc10502" type="bib">2</ref> one of the first modern treatments of high-speed digital computation. <ref rid="c10502" type="fig">Figure 2</ref> reproduces a page from this book, also dealing with Gaussian elimination.</ip1> <fig id="c10501"><art file="c10501.gif" w="600" h="970" tw="150" th="243"/><no>1</no><fgc>This page from <it>Linear Computations</it> shows that Paul Dwyer's approach begins with a system of scalar equations. Courtesy of John Wiley &amp; Sons.</fgc></fig> <fig id="c10502"><art file="c10502.gif" w="500" h="807" tw="150" th="242"/><no>2</no><fgc>On this page from <it>Principles of Numerical Analysis</it>, Alston Householder uses partitioned matrices and LU decomposition. Courtesy of McGraw-Hill.</fgc></fig> <p>The contrast between these two excerpts is striking. The most obvious difference is that Dwyer used scalar equations whereas Householder used partitioned matrices. … Ray R. Larson

  26. SGML/XML Support …<sec><st>CONCLUSION</st> <ip1>The big six are not the only decompositions in use; in fact, there are many more. As mentioned earlier, certain intermediate forms&mdash;such as tridiagonal and Hessenberg forms&mdash;have come to be regarded as decompositions in their own right. Since the singular value decomposition is expensive to compute and not readily updated, rank-revealing alternatives have received considerable attention.<ref rid="bibc105054" type="bib">54</ref><super>,</super><ref rid="bibc105055" type="bib">55</ref> There are also generalizations of the singular value decomposition and the Schur decomposition for pairs of matrices. <ref rid="bibc105056" type="bib">56</ref><super>,</super><ref rid="bibc105057" type="bib">57</ref> All crystal balls become cloudy when they look to the future, but it seems safe to say that as long as new matrix problems arise, new decompositions will be devised to solve them.</ip1> </sec> </bdy> <bm> <ack><h>Acknowledgment</h> <ip1><it>This work was supported by the National Science Foundation under Grant No. 970909-8562.</it></ip1> </ack> <bib><bibl><h>References</h> <bb id="bibc10501"><au><fnm>P.S.</fnm><snm>Dwyer</snm></au><ti>Linear Computations,</ti> <obi>John Wiley &amp; Sons,</obi><loc><cty>New York,</cty></loc><pdt><yr>1951.</yr></pdt></bb> <bb id="bibc10502"><au><fnm>A.S.</fnm><snm>Householder</snm></au><ti>Principles of Numerical Analysis,</ti> <obi>McGraw-Hill,</obi><loc><cty>New York,</cty></loc><pdt><yr>1953.</yr></pdt></bb> <bb id="bibc10503"><au><fnm>J.H.</fnm><snm>Wilkinson</snm></au><obi>and</obi> <au><fnm>C.</fnm><snm>Reinsch</snm></au><ti>Handbook for Automatic Computation, Vol. II, Linear Algebra,</ti> <obi>Springer-Verlag,</obi><loc><cty>New York,</cty></loc><pdt><yr>1971.</yr></pdt></bb> <bb id="bibc10504"><au><fnm>B.S.</fnm><snm>Garbow</snm></au> <obi>et al.,</obi><atl>"Matrix Eigensystem Routines&mdash;Eispack Guide Extension,"</atl> <ti>Lecture Notes in Computer Science,</ti><obi>Springer-Verlag,</obi><loc><cty>New York,</cty></loc><pdt> <yr>1977.</yr></pdt></bb> <bb id="bibc10505"><au><fnm>J.J.</fnm><snm>Dongarra</snm></au><obi>et al.,</obi> <ti>LINPACK User's Guide,</ti> <obi>SIAM,</obi><loc><cty>Philadelphia,</cty></loc><pdt><yr>1979.</yr></pdt></bb> … Ray R. Larson

  27. SGML/XML Support • INEX CAS Query <?xml version="1.0" encoding="ISO-8859-1"?> <!DOCTYPE inex_topic SYSTEM "topic.dtd"> <inex_topic topic_id="70" query_type="CAS" ct_no="49"> <title> /article[about(./fm/abs,'"information retrieval" "digital libraries"')]</title> <description>Retrieve articles with an abstract indicating the article is about information retrieval and/or digital libraries</description> <narrative>To be relevant the retrieved articles must be about information retrieval, digital libraries or, preferably both. Articles about information retrieval from digital libraries will receive the highest relevance judgements.</narrative> <keywords>information retrieval,digital libraries</keywords> </inex_topic> Ray R. Larson

  28. SGML/XML Support • Configuration files for the Server are also SGML/XML: • They include tags describing all of the data files and indexes for the database. • They also include instructions on how data is to be extracted for indexing and how Z39.50 attributes map to the indexes for a given database. Ray R. Larson

  29. Cheshire Configuration Files <!-- ******************************************************************* --> <!-- ************************* TREC INTERACTIVE TEST DB **************** --> <!-- ******************************************************************* --> <!-- This is the config file for the Cheshire II TREC interactive Database --> <DBCONFIG> <DBENV>/projects/is240/GroupX/indexes </DBENV> <!-- --> <!-- TREC TEST DATABASE FILEDEF --> <!-- --> <!-- The Interactive TREC Financial Times datafile --> <FILEDEF TYPE=SGML> <DEFAULTPATH>/projects/is240/GroupX </DEFAULTPATH> <!-- filetag is the "shorthand" name of the file --> <FILETAG> trec </FILETAG> <!-- filename is the full path name of the main data directory --> <FILENAME> /projects/is240/ft </FILENAME> <CONTINCLUDE> /projects/is240/ft.CONT </CONTINCLUDE> <!-- fileDTD is the full path name of the file's DTD --> <FILEDTD> /projects/is240/TREC.FT.DTD </FILEDTD> <!-- assocfil is the full path name of the file's Associator --> <ASSOCFIL> ft.assoc </ASSOCFIL> <!-- history is the full path name of the file's history file --> <HISTORY> cheshire_index/TESTDATA.history </HISTORY> … Ray R. Larson

  30. Indexing • Any SGML/XML tagged field or attribute can be indexed: • B-Tree and Hash access via Berkeley DB (Sleepycat) • Stemming, keyword, exact keys and “special keys” • Mapping from any Z39.50 Attribute combination to a specific index • Underlying postings information includes term frequency for probabilistic searching. • SGML may include address of full-text for indexing • New indexes can be easily added, or old ones deleted Ray R. Larson

  31. Bitmapped Indexes • Bitmap indexes can be used for Boolean operations where the data has only a few values and very large numbers of items with each value • Only one bit per record stored in the index • Processed on a demand basis so only blocks with the bits needed to resolve a query are fetched Ray R. Larson

  32. <!-- The following are the index definitions for the file --><INDEXES><!-- ******************************************************************* --><!-- ************************* DOC NO. ********************************* --><!-- ******************************************************************* --><!-- The following provides document number access. --><INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD NORMAL=NONE PRIMARYKEY=IGNORE><INDXNAME> cheshire_index/trec.docno.index </INDXNAME><INDXTAG> docno </INDXTAG><INDXMAP> <USE> 12 </USE><struct> 1 </struct> </INDXMAP><INDXMAP> <USE> 12 </USE><struct> 2 </struct> </INDXMAP><INDXMAP> <USE> 12 </USE><struct> 6 </struct> </INDXMAP><INDXKEY><TAGSPEC><FTAG>DOCNO </FTAG></TAGSPEC> </INDXKEY> </INDEXDEF>… Ray R. Larson

  33. <!-- ******************************************************************* --> <!-- ************************* TOPIC *********************************** --> <!-- ******************************************************************* --> <!-- The following is the primary index for probabilistic searches --> <!-- It includes headlines, datelines, bylines, and full text --> <INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD_PROXIMITY NORMAL=STEM> <INDXNAME> cheshire_index/trec.topic.index </INDXNAME> <INDXTAG> topic </INDXTAG> <INDXMAP> <USE> 29 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP> <INDXMAP> <USE> 29 </USE><RELAT> 102 </RELAT><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP> … <STOPLIST> cheshire_index/topicstoplist </STOPLIST> <INDXKEY> <TAGSPEC> <FTAG>HEADLINE </FTAG> <FTAG>DATELINE </FTAG> <FTAG>BYLINE </FTAG> <FTAG>TEXT </FTAG> </TAGSPEC> </INDXKEY> </INDEXDEF> Ray R. Larson

  34. Cheshire II – EVI Generation • Entry Vocabulary Indexes can improve access to data with controlled index terms • Define basis for clustering records. • Select field to form the basis of the cluster. • Evidence Fields to use as contents of the pseudo-documents. • During indexing cluster keys are generated with basis and evidence from each record. • Cluster keys are sorted and merged on basis and pseudo-documents created for each unique basis element containing all evidence fields. • Pseudo-Documents (Class clusters) are indexed on combined evidence fields. Ray R. Larson

  35. EVI/Cluster Definitions <!-- ************************* CLUSTER ********************************* --> <!-- *********************** DEFINITIONS ******************************* --> <CLUSTER> <clusname> classcluster </clusname> <cluskey normal=CLASSCLUS> <tagspec> <FTAG>FLD950 </FTAG> <s> ^a </s> </tagspec> </cluskey> <stoplist> /usr3/cheshire2/data2/clasclusstoplist </stoplist> <clusmap> <from> <tagspec> <ftag>FLD245</ftag><s>^[ab]</s> <ftag>FLD440</ftag><s>^a</s> <ftag>FLD490</ftag><s>^a</s> <ftag>FLD830</ftag><s>^a</s> <ftag>FLD740</ftag><s>^a</s> </tagspec></from> <to> <tagspec> <ftag>titles</ftag> </tagspec></to> <from> <tagspec> <ftag>FLD6..</ftag><s>^[abcdxyz]</s> </tagspec></from> <to> <tagspec> <ftag>subjects</ftag> </tagspec></to> <summarize> <maxnum> 5 </maxnum> <tagspec> <ftag>subjsum</ftag> </tagspec></summarize> </clusmap> </CLUSTER> Ray R. Larson

  36. Component Extraction and Indexing • Any element (or range of SGML/XML data starting with one element and ending with another) can be defined as a ‘component’ and accessed and indexed as if it were an entire document. • Component indexes and document-level indexes can be combined in search operations (and special operators permit selection of document or components as the result Ray R. Larson

  37. Component Definitions <COMPONENTS> <COMPONENTDEF> <COMPONENTNAME> TESTDATA/COMPONENT_DB1 </COMPONENTNAME> <COMPONENTNORM>NONE</COMPONENTNORM> <COMPSTARTTAG> <TAGSPEC> <FTAG>mainenty </FTAG> <FTAG>titles </FTAG> </TAGSPEC> </COMPSTARTTAG> <COMPENDTAG> <TAGSPEC><FTAG>Fld300 </FTAG></TAGSPEC> </COMPENDTAG> <COMPONENTINDEXES> <!-- First index def --> <INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD NORMAL=NONE> <INDXNAME> TESTDATA/comp1index1.author … </INDEXDEF> </COMPONENTDEF> </COMPONENTS> Ray R. Larson

  38. Result Formatting (Display) <DISPOPTIONS> KEEP_ENTITIES </DISPOPTIONS> <DISPLAY> <FORMAT NAME="B" OID="1.2.840.10003.5.105" DEFAULT> <convert function="TAGSET-G"> <clusmap> <from> <tagspec> <ftag>DOCNO</ftag> </tagspec></from> <to> <tagspec> <ftag>28</ftag> </tagspec></to> <from> <tagspec> <ftag>#DOCID#</ftag> </tagspec></from> <to> <tagspec> <ftag>5</ftag> </tagspec></to> </clusmap> </convert> </FORMAT> </DISPLAY> Ray R. Larson

  39. INEX Configuration Example <!-- ******************************************************************* --> <!-- ********************* Config for INEX evaluation ****************** --> <!-- ******************************************************************* --> <!-- This is the config file for the Cheshire II TREC interactive Database --> <!-- new version uses proximity indexes... --> <DBCONFIG> <DBENV>/projects/metadata/cheshire/TREC/cheshire_index </DBENV> <!-- --> <!-- INEX TEST DATABASE FILEDEF --> <!-- --> <FILEDEF TYPE=XML> <DEFAULTPATH> /projects/metadata/cheshire/INEX </DEFAULTPATH> <!-- filetag is the "shorthand" name of the file --> <FILETAG> INEX </FILETAG> <!-- filename is the full path name of the main data directory --> <FILENAME> inex-1.3/xml </FILENAME> <CONTINCLUDE> inex-1.3/xml_main.cont </CONTINCLUDE> <!-- fileDTD is the full path name of the file's DTD --> <FILEDTD> inex-1.3/dtd/wrapper.dtd </FILEDTD> <SGMLCAT> inex-1.3/dtd/catalog </SGMLCAT> <!-- assocfil is the full path name of the file's Associator --> <ASSOCFIL> inex-1.3/xml_main.assoc </ASSOCFIL> <!-- history is the full path name of the file's history file --> <HISTORY> inex.history </HISTORY> Ray R. Larson

  40. INEX Configuration Example <!-- The following are the index definitions for the file --> <INDEXES> <!-- ******************************************************************* --> <!-- ************************* DOC NO. ********************************* --> <!-- ******************************************************************* --> <!-- The following provides document number access. --> <INDEXDEF ACCESS=BTREE EXTRACT=EXACTKEY NORMAL=DO_NOT_NORMALIZE PRIMARYKEY=IGNORE> <INDXNAME> indexes/docno.index </INDXNAME> <INDXTAG> docno </INDXTAG> <INDXMAP> <USE> 12 </USE><struct> 1 </struct> </INDXMAP> <INDXMAP> <USE> 12 </USE><struct> 2 </struct> </INDXMAP> <INDXMAP> <USE> 12 </USE><struct> 6 </struct> </INDXMAP> <INDXKEY> <TAGSPEC> <FTAG> doi </FTAG> </TAGSPEC> </INDXKEY> </INDEXDEF> Ray R. Larson

  41. INEX Configuration Example <!-- ******************************************************************* --> <!-- ********************** PERSONAL AUTHOR/BYLINE ********************* --> <!-- ******************************************************************* --> <INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD NORMAL=NONE> <INDXNAME> indexes/pauthor.index </INDXNAME> <INDXTAG> pauthor </INDXTAG> <!-- The following INDXMAP items provide a mapping from the AUTHOR tag to --> <!-- the appropriate Z39.50 BIB1 attribute numbers --> <INDXMAP> <USE> 1 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP> <INDXMAP> <USE> 1004 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP> <!-- The stoplist for this file --> <STOPLIST> indexes/authorstoplist </STOPLIST> <!-- The INDXKEY area contains the specifications of tags in the doc --> <!-- that are to be extracted and indexed for this index --> <INDXKEY> <TAGSPEC> <FTAG>fm</FTAG><S>au</S><S>snm</S> <FTAG>fm</FTAG><S>au</S><S>fnm</S> </TAGSPEC> </INDXKEY> </INDEXDEF> Ray R. Larson

  42. INEX Configuration Example <!-- ******************************************************************* --> <!-- ************************* TITLE/HEADLINE ************************** --> <!-- ******************************************************************* --> <!-- The following provides keyword title access --> <INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD_PROX NORMAL=STEM> <INDXNAME> indexes/title.index </INDXNAME> <INDXTAG> title </INDXTAG> <INDXMAP> <USE> 4 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP> <INDXMAP> <USE> 5 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP> <INDXMAP> <USE> 6 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP> <STOPLIST> indexes/titlestoplist </STOPLIST> <INDXKEY> <TAGSPEC> <FTAG>fm</FTAG><S>tig</S><S>atl</S> </TAGSPEC> </INDXKEY> </INDEXDEF> Ray R. Larson

  43. INEX Configuration Example <!-- ******************************************************************* --> <!-- ************************* TOPIC *********************************** --> <!-- ******************************************************************* --> <!-- The following is the primary index for probabilistic searches --> <INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD_PROX NORMAL=STEM> <INDXNAME> indexes/topic.index </INDXNAME> <INDXTAG> topic </INDXTAG> <INDXMAP> <USE> 29 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP> … <INDXMAP> <USE> 1017 </USE><RELAT> 102 </RELAT><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP> <STOPLIST> indexes/topicstoplist </STOPLIST> <INDXKEY> <TAGSPEC> <FTAG>fm</FTAG><S>tig</S><S>atl</S> <FTAG>abs</FTAG> <FTAG>bdy</FTAG> <FTAG>bibl</FTAG><S>bb</S><S>atl</S> <FTAG>app</FTAG> </TAGSPEC> </INDXKEY> </INDEXDEF> Ray R. Larson

  44. INEX Configuration Example <!-- ******************************************************************* --> <!-- ************************** DATE *********************************** --> <!-- ******************************************************************* --> <INDEXDEF ACCESS=BTREE EXTRACT=DATE NORMAL=YEAR> <INDXNAME> indexes/date.index </INDXNAME> <INDXTAG> date </INDXTAG> <!-- The following INDXMAP items provide a mapping from the AUTHOR tag to --> <!-- the appropriate Z39.50 BIB1 attribute numbers --> <INDXMAP> <USE> 30 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP> <INDXMAP> <USE> 30 </USE><POSIT> 3 </posit> <struct> 5 </struct> </INDXMAP> <INDXKEY> <TAGSPEC> <FTAG>hdr2</FTAG><s>yr</s> </TAGSPEC> </INDXKEY> </INDEXDEF> Ray R. Larson

  45. INEX Configuration Example <!-- ******************************************************************* --> <!-- ************************** JOURNAL ******************************* --> <!-- ******************************************************************* --> <INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD NORMAL=NONE> <INDXNAME> indexes/journal.index </INDXNAME> <INDXTAG> journal </INDXTAG> <INDXMAP> <USE> 1022 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP> <INDXMAP> <USE> 1022 </USE><POSIT> 3 </posit> <struct> 5 </struct> </INDXMAP> <INDXKEY> <TAGSPEC> <FTAG>hdr1</FTAG><s>ti</s> </TAGSPEC> </INDXKEY> </INDEXDEF> Ray R. Larson

  46. INEX Configuration Example <!-- ******************************************************************* --> <!-- ************************* KEYWORDS********************************* --> <!-- ******************************************************************* --> <INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD_PROXIMITY NORMAL=STEM> <INDXNAME> indexes/keywords.index </INDXNAME> <INDXTAG> kwd </INDXTAG> <INDXMAP> <USE> 3121 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP> <STOPLIST> indexes/topicstoplist </STOPLIST> <INDXKEY> <TAGSPEC> <FTAG>kwd</FTAG> </TAGSPEC> </INDXKEY> </INDEXDEF> Ray R. Larson

  47. INEX Configuration Example <!-- ******************************************************************* --> <!-- ************************* ABSTRACT********************************* --> <!-- ******************************************************************* --> <INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD_PROXIMITY NORMAL=STEM> <INDXNAME> indexes/abstract.index </INDXNAME> <INDXTAG> abstract </INDXTAG> <INDXMAP> <USE> 62 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP> <STOPLIST> indexes/topicstoplist </STOPLIST> <INDXKEY> <TAGSPEC> <FTAG>abs</FTAG> </TAGSPEC> </INDXKEY> </INDEXDEF> Ray R. Larson

  48. INEX Configuration Example <!-- The following index has contents of the SEQUENCE attribute of the --> <!-- au (author) tag: either "first" or "additional" --> <INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD NORMAL=NONE> <INDXNAME> indexes/author_seq.index </INDXNAME> <INDXTAG> author_seq </INDXTAG> <INDXKEY> <TAGSPEC> <FTAG>fm</FTAG><S>au</S><ATTR>sequence</ATTR> </TAGSPEC> </INDXKEY> </INDEXDEF> Ray R. Larson

  49. INEX Configuration Example <!-- ******************************************************************* --> <!-- ************************* Bib author Forename ******************** --> <!-- ******************************************************************* --> <INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD NORMAL=NONE> <INDXNAME> indexes/bib_author_fnm.index </INDXNAME> <INDXTAG> bib_author_fnm </INDXTAG> <INDXMAP> <USE> 1000 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP> <INDXKEY> <TAGSPEC> <FTAG>bb</FTAG><s>au</s><s>fnm</s> </TAGSPEC> </INDXKEY> </INDEXDEF> Ray R. Larson

  50. INEX Configuration Example <!-- ******************************************************************* --> <!-- ************************* Bib author surname ******************** --> <!-- ******************************************************************* --> <INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD NORMAL=NONE> <INDXNAME> indexes/bib_author_snm.index </INDXNAME> <INDXTAG> bib_author_snm </INDXTAG> <INDXMAP> <USE> 1000 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP> <INDXKEY> <TAGSPEC> <FTAG>bb</FTAG><s>au</s><s>snm</s> </TAGSPEC> </INDXKEY> </INDEXDEF> Ray R. Larson

More Related