730 likes | 888 Vues
Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00 pm Spring 2007 http://courses.ischool.berkeley.edu/i240/s07. Lecture 21: XML Retrieval. Principles of Information Retrieval. Mini-TREC. Proposed Schedule
E N D
Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00 pm Spring 2007 http://courses.ischool.berkeley.edu/i240/s07 Lecture 21: XML Retrieval Principles of Information Retrieval
Mini-TREC • Proposed Schedule • February 15 – Database and previous Queries • February 27 – report on system acquisition and setup • March 8, New Queries for testing… • April 19, Results due (Next Thursday) • April 24 or 26, Results and system rankings • May 8 Group reports and discussion
Announcement • No Class on Tuesday (April 17th)
Today • Review • Geographic Information Retrieval • GIR Algorithms and evaluation based on a presentation to the 2004 European Conference on Digital Libraries, held in Bath, U.K. • XML and Structured Element Retrieval • INEX • Approaches to XML retrieval Credit for some of the slides in this lecture goes to Marti Hearst
Today • Review • Geographic Information Retrieval • GIR Algorithms and evaluation based on a presentation to the 2004 European Conference on Digital Libraries, held in Bath, U.K. • Web Crawling and Search Issues • Web Crawling • Web Search Engines and Algorithms Credit for some of the slides in this lecture goes to Marti Hearst
Introduction • What is Geographic Information Retrieval? • GIR is concerned with providing access to georeferenced information sources. It includes all of the areas of traditional IR research with the addition of spatially and geographically oriented indexing and retrieval. • It combines aspects of DBMS research, User Interface Research, GIS research, and Information Retrieval research.
Example: Results display from CheshireGeo: http://calsip.regis.berkeley.edu/pattyf/mapserver/cheshire2/cheshire_init.html
Other convex, conservative Approximations 2) MBR: Minimum aligned Bounding rectangle (4) 1) Minimum Bounding Circle (3) 3) Minimum Bounding Ellipse (5) 4) Rotated minimum bounding rectangle (5) 5) 4-corner convex polygon (8) 6) Convex hull (varies) After Brinkhoff et al, 1993b Presented in order of increasing quality. Number in parentheses denotes number of parameters needed to store representation
Our Research Questions • Spatial Ranking • How effectively can the spatial similarity between a query region and a document region be evaluated and ranked based on the overlap of the geometric approximations for these regions? • Geometric Approximations & Spatial Ranking: • How do different geometric approximations affect the rankings? • MBRs: the most popular approximation • Convex hulls: the highest quality convex approximation
Probabilistic Models: Logistic Regression attributes • X1 = area of overlap(query region, candidate GIO) / area of query region • X2= area of overlap(query region, candidate GIO) / area of candidate GIO • X3 = 1 – abs(fraction of overlap region that is onshore fraction of candidate GIO that is onshore) • Where: Range for all variables is 0 (not similar) to 1 (same)
CA Named Places in the Test Collection – complex polygons Counties Cities Bioregions National Parks National Forests Water QCB Regions
CA Counties – Geometric Approximations MBRs Convex Hulls Ave. False Area of Approximation: MBRs: 94.61% Convex Hulls: 26.73%
42 of 58 counties referenced in the test collection metadata 10 counties randomly selected as query regions to train LR model 32 counties used as query regions to test model Test Collection Query Regions: CA Counties
LR model • X1 = area of overlap(query region, candidate GIO) / area of query region • X2= area of overlap(query region, candidate GIO) / area of candidate GIO • Where: Range for all variables is 0 (not similar) to 1 (same)
Mean Average Query Precision: the average precision values after each new relevant document is observed in a ranked list. Some of our Results For metadata indexed by CA named place regions: • These results suggest: • Convex Hulls perform better than MBRs • Expected result given that the CH is a higher quality approximation • A probabilistic ranking based on MBRs can perform as well if not better than a non-probabiliistic ranking method based on Convex Hulls • Interesting • Since any approximation other than the MBR requires great expense, this suggests that the exploration of new ranking methods based on the MBR are a good way to go. For all metadata in the test collection:
Mean Average Query Precision: the average precision values after each new relevant document is observed in a ranked list. Some of our Results For metadata indexed by CA named place regions: BUT: The inclusion of UDA indexed metadata reduces precision. This is because coarse approximations of onshore or coastal geographic regions will necessarily include much irrelevant offshore area, and vice versa For all metadata in the test collection:
Shorefactor Model • X1 = area of overlap(query region, candidate GIO) / area of query region • X2 = area of overlap(query region, candidate GIO) / area of candidate GIO • X3 = 1 – abs(fraction of query region approximation that is onshore – fraction of candidate GIO approximation that is onshore) • Where: Range for all variables is 0 (not similar) to 1 (same)
Some of our Results, with Shorefactor These results suggest: • Addition of Shorefactor variable improves the model (LR 2), especially for MBRs • Improvement not so dramatic for convex hull approximations – b/c the problem that shorefactor addresses is not that significant when areas are represented by convex hulls. For all metadata in the test collection: Mean Average Query Precision: the average precision values after each new relevant document is observed in a ranked list.
Results for All Data - MBRs Precision Recall
Results for All Data - Convex Hull Precision Recall
XML Retrieval • The following slides are adapted from presentations at INEX 2003-2005 and at the INEX Element Retrieval Workshop in Glasgow 2005, with some new additions for general context, etc.
INEX Organization Organized By: • University of Duisburg-Essen, Germany • Norbert Fuhr, Saadia Malik, and others • Queen Mary University of London, UK • Mounia Lalmas, Gabriella Kazai, and others • Supported By: • DELOS Network of Excellence in Digital Libraries (EU) • IEEE Computer Society • University of Duisburg-Essen
XML Retrieval Issues • Using Structure? • Specification of Queries • How to evaluate?
Cheshire SGML/XML Support • Underlying native format for all data is SGML or XML • The DTD defines the database contents • Full SGML/XML parsing • SGML/XML Format Configuration Files define the database location and indexes • Various format conversions and utilities available for Z39.50 support (MARC, GRS-1
SGML/XML Support • Configuration files for the Server are SGML/XML: • They include elements describing all of the data files and indexes for the database. • They also include instructions on how data is to be extracted for indexing and how Z39.50 attributes map to the indexes for a given database.
Indexing • Any SGML/XML tagged field or attribute can be indexed: • B-Tree and Hash access via Berkeley DB (Sleepycat) • Stemming, keyword, exact keys and “special keys” • Mapping from any Z39.50 Attribute combination to a specific index • Underlying postings information includes term frequency for probabilistic searching • Component extraction with separate component indexes
XML Element Extraction • A new search “ElementSetName” is XML_ELEMENT_ • Any Xpath, element name, or regular expression can be included following the final underscore when submitting a present request • The matching elements are extracted from the records matching the search and delivered in a simple format..
XML Extraction % zselect sherlock 372 {Connection with SHERLOCK (sherlock.berkeley.edu) database 'bibfile' at port 2100 is open as connection #372} % zfind topic mathematics {OK {Status 1} {Hits 26} {Received 0} {Set Default} {RecordSyntax UNKNOWN}} % zset recsyntax XML % zset elementset XML_ELEMENT_Fld245 % zdisplay {OK {Status 0} {Received 10} {Position 1} {Set Default} {NextPosition 11} {RecordSyntax XML 1.2.840.10003.5.109.10}} { <RESULT_DATA DOCID="1"> <ITEM XPATH="/USMARC[1]/VarFlds[1]/VarDFlds[1]/Titles[1]/Fld245[1]"> <Fld245 AddEnty="No" NFChars="0"><a>Singularitâes áa Cargáese</a></Fld245> </ITEM> <RESULT_DATA> … etc…
TREC3 Logistic Regression Probability of relevance is based on Logistic regression from a sample set of documents to determine values of the coefficients. At retrieval the probability estimate is obtained by: For the 6 X attribute measures shown on the next slide
TREC3 Logistic Regression Average Absolute Query Frequency Query Length Average Absolute Component Frequency Document Length Average Inverse Component Frequency Number of Terms in both query and Component
Okapi BM25 • Where: • Q is a query containing terms T • K is k1((1-b) + b.dl/avdl) • k1, b and k3are parameters , usually 1.2, 0.75 and 7-1000 • tf is the frequency of the term in a specific document • qtf is the frequency of the term in a topic from which Q was derived • dl and avdl are the document length and the average document length measured in some convenient unit • w(1) is the Robertson-Sparck Jones weight.
Combining Boolean and Probabilistic Search Elements • Two original approaches: • Boolean Approach • Non-probabilistic “Fusion Search” Set merger approach is a weighted merger of document scores from separate Boolean and Probabilistic queries
INEX ‘04 Fusion Search Subquery Subquery Final Ranked List Fusion/ Merge Subquery Subquery Comp. Query Results Comp. Query Results • Merge multiple ranked and Boolean index searches within each query and multiple component search resultsets • Major components merged are Articles, Body, Sections, subsections, paragraphs
Merging and Ranking Operators • Extends the capabilities of merging to include merger operations in queries like Boolean operators • Fuzzy Logic Operators (not used for INEX) • !FUZZY_AND • !FUZZY_OR • !FUZZY_NOT • Containment operators: Restrict components to or with a particular parent • !RESTRICT_FROM • !RESTRICT_TO • Merge Operators • !MERGE_SUM • !MERGE_MEAN • !MERGE_NORM • !MERGE_CMBZ
New LR Coefficients Estimates using INEX ‘03 relevance assessments for b1 = Average Absolute Query Frequency b2 = Query Length b3 = Average Absolute Component Frequency b4 = Document Length b5 = Average Inverse Component Frequency b6 = Number of Terms in common between query and Component
INEX CO Runs • Three official, one later run - all Title-only • Fusion - Combines Okapi and LR using the MERGE_CMBZ operator • NewParms (LR)- Using only LR with the new parameters • Feedback - An attempt at blind relevance feedback • PostFusion - Fusion of the new LR coefficients and Okapi
Query Generation - CO • # 162 TITLE = Text and Index Compression Algorithms • QUERY: topicshort @+ {Text and Index Compression Algorithms}) !MERGE_CMBZ (alltitles @+ {Text and Index Compression Algorithms}) !MERGE_CMBZ (topicshort @ {Text and Index Compression Algorithms}) !MERGE_CMBZ (alltitles @ {Text and Index Compression Algorithms}) • @+ is Okapi, @ is LR • !MERGE_CMBZ is a normalized score summation and enhancement
INEX CO Runs Strict Generalized Avg Prec FUSION = 0.0642 NEWPARMS = 0.0582 FDBK = 0.0415 POSTFUS = 0.0690 Avg Prec FUSION = 0.0923 NEWPARMS = 0.0853 FDBK = 0.0390 POSTFUS = 0.0952
INEX VCAS Runs • Two official runs • FUSVCAS - Element fusion using LR and various operators for path restriction • NEWVCAS - Using the new LR coefficients for each appropriate index and various operators for path restriction
Query Generation - VCAS • #66 TITLE = //article[about(., intelligent transport systems)]//sec[about(., on-board route planning navigation system for automobiles)] • Submitted query = ((topic @ {intelligent transport systems})) !RESTRICT_FROM ((sec_words @ {on-board route planning navigation system for automobiles})) • Target elements: sec|ss1|ss2|ss3
VCAS Results Generalized Strict Avg Prec FUSVCAS = 0.0321 NEWVCAS = 0.0270 Avg Prec FUSVCAS = 0.0601 NEWVCAS = 0.0569
Heterogeneous Track • Approach using the Cheshire’s Virtual Database options • Primarily a version of distributed IR • Each collection indexed separately • Search via Z39.50 distributed queries • Z39.50 Attribute mapping used to map query indexes to appropriate elements in a given collection • Only LR used and collection results merged using probability of relevance for each collection result
INEX 2005 Approach • Used only Logistic regression methods • “TREC3” with Pivot • “TREC2” with Pivot • “TREC2” with Blind Feedback • Used post-processing for specific tasks
Logistic Regression Probability of relevance is based on Logistic regression from a sample set of documents to determine values of the coefficients. At retrieval the probability estimate is obtained by: For some set of m statistical measures, Xi, derived from the collection and query
TREC2 Algorithm Term Freq for: Query Document Collection Matching Terms
Blind Feedback Document Relevance + - + Rt Nt -Rt Nt - R-Rt N-Nt-R+R N-Nt R N-R N Document indexing • Term selection from top-ranked documents is based on the classic Robertson/Sparck Jones probabilistic model: For each term t
Blind Feedback • Top x new terms taken from top y documents • For each term in the top y assumed relevant set… • Terms are ranked by termwt and the top x selected for inclusion in the query
Pivot method • Based on the pivot weighting used by IBM Haifa in INEX 2004 (Mass & Mandelbrod) • Used 0.50 as pivot for all cases • For TREC3 and TREC2 runs all component results weighted by article-level results for the matching article