Lecture 21: XML Retrieval

Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00 pm Spring 2007 http://courses.ischool.berkeley.edu/i240/s07 Lecture 21: XML Retrieval Principles of Information Retrieval

Mini-TREC • Proposed Schedule • February 15 – Database and previous Queries • February 27 – report on system acquisition and setup • March 8, New Queries for testing… • April 19, Results due (Next Thursday) • April 24 or 26, Results and system rankings • May 8 Group reports and discussion

Announcement • No Class on Tuesday (April 17th)

Today • Review • Geographic Information Retrieval • GIR Algorithms and evaluation based on a presentation to the 2004 European Conference on Digital Libraries, held in Bath, U.K. • XML and Structured Element Retrieval • INEX • Approaches to XML retrieval Credit for some of the slides in this lecture goes to Marti Hearst

Today • Review • Geographic Information Retrieval • GIR Algorithms and evaluation based on a presentation to the 2004 European Conference on Digital Libraries, held in Bath, U.K. • Web Crawling and Search Issues • Web Crawling • Web Search Engines and Algorithms Credit for some of the slides in this lecture goes to Marti Hearst

Introduction • What is Geographic Information Retrieval? • GIR is concerned with providing access to georeferenced information sources. It includes all of the areas of traditional IR research with the addition of spatially and geographically oriented indexing and retrieval. • It combines aspects of DBMS research, User Interface Research, GIS research, and Information Retrieval research.

Example: Results display from CheshireGeo: http://calsip.regis.berkeley.edu/pattyf/mapserver/cheshire2/cheshire_init.html

Other convex, conservative Approximations 2) MBR: Minimum aligned Bounding rectangle (4) 1) Minimum Bounding Circle (3) 3) Minimum Bounding Ellipse (5) 4) Rotated minimum bounding rectangle (5) 5) 4-corner convex polygon (8) 6) Convex hull (varies) After Brinkhoff et al, 1993b Presented in order of increasing quality. Number in parentheses denotes number of parameters needed to store representation

Our Research Questions • Spatial Ranking • How effectively can the spatial similarity between a query region and a document region be evaluated and ranked based on the overlap of the geometric approximations for these regions? • Geometric Approximations & Spatial Ranking: • How do different geometric approximations affect the rankings? • MBRs: the most popular approximation • Convex hulls: the highest quality convex approximation

Spatial Ranking: Methods for computing spatial similarity

Probabilistic Models: Logistic Regression attributes • X1 = area of overlap(query region, candidate GIO) / area of query region • X2= area of overlap(query region, candidate GIO) / area of candidate GIO • X3 = 1 – abs(fraction of overlap region that is onshore fraction of candidate GIO that is onshore) • Where: Range for all variables is 0 (not similar) to 1 (same)

CA Named Places in the Test Collection – complex polygons Counties Cities Bioregions National Parks National Forests Water QCB Regions

CA Counties – Geometric Approximations MBRs Convex Hulls Ave. False Area of Approximation: MBRs: 94.61% Convex Hulls: 26.73%

CA User Defined Areas (UDAs) in the Test Collection

42 of 58 counties referenced in the test collection metadata 10 counties randomly selected as query regions to train LR model 32 counties used as query regions to test model Test Collection Query Regions: CA Counties

LR model • X1 = area of overlap(query region, candidate GIO) / area of query region • X2= area of overlap(query region, candidate GIO) / area of candidate GIO • Where: Range for all variables is 0 (not similar) to 1 (same)

Mean Average Query Precision: the average precision values after each new relevant document is observed in a ranked list. Some of our Results For metadata indexed by CA named place regions: • These results suggest: • Convex Hulls perform better than MBRs • Expected result given that the CH is a higher quality approximation • A probabilistic ranking based on MBRs can perform as well if not better than a non-probabiliistic ranking method based on Convex Hulls • Interesting • Since any approximation other than the MBR requires great expense, this suggests that the exploration of new ranking methods based on the MBR are a good way to go. For all metadata in the test collection:

Mean Average Query Precision: the average precision values after each new relevant document is observed in a ranked list. Some of our Results For metadata indexed by CA named place regions: BUT: The inclusion of UDA indexed metadata reduces precision. This is because coarse approximations of onshore or coastal geographic regions will necessarily include much irrelevant offshore area, and vice versa For all metadata in the test collection:

Shorefactor Model • X1 = area of overlap(query region, candidate GIO) / area of query region • X2 = area of overlap(query region, candidate GIO) / area of candidate GIO • X3 = 1 – abs(fraction of query region approximation that is onshore – fraction of candidate GIO approximation that is onshore) • Where: Range for all variables is 0 (not similar) to 1 (same)

Some of our Results, with Shorefactor These results suggest: • Addition of Shorefactor variable improves the model (LR 2), especially for MBRs • Improvement not so dramatic for convex hull approximations – b/c the problem that shorefactor addresses is not that significant when areas are represented by convex hulls. For all metadata in the test collection: Mean Average Query Precision: the average precision values after each new relevant document is observed in a ranked list.

Results for All Data - MBRs Precision Recall

Results for All Data - Convex Hull Precision Recall

XML Retrieval • The following slides are adapted from presentations at INEX 2003-2005 and at the INEX Element Retrieval Workshop in Glasgow 2005, with some new additions for general context, etc.

INEX Organization Organized By: • University of Duisburg-Essen, Germany • Norbert Fuhr, Saadia Malik, and others • Queen Mary University of London, UK • Mounia Lalmas, Gabriella Kazai, and others • Supported By: • DELOS Network of Excellence in Digital Libraries (EU) • IEEE Computer Society • University of Duisburg-Essen

XML Retrieval Issues • Using Structure? • Specification of Queries • How to evaluate?

Cheshire SGML/XML Support • Underlying native format for all data is SGML or XML • The DTD defines the database contents • Full SGML/XML parsing • SGML/XML Format Configuration Files define the database location and indexes • Various format conversions and utilities available for Z39.50 support (MARC, GRS-1

SGML/XML Support • Configuration files for the Server are SGML/XML: • They include elements describing all of the data files and indexes for the database. • They also include instructions on how data is to be extracted for indexing and how Z39.50 attributes map to the indexes for a given database.

Indexing • Any SGML/XML tagged field or attribute can be indexed: • B-Tree and Hash access via Berkeley DB (Sleepycat) • Stemming, keyword, exact keys and “special keys” • Mapping from any Z39.50 Attribute combination to a specific index • Underlying postings information includes term frequency for probabilistic searching • Component extraction with separate component indexes

XML Element Extraction • A new search “ElementSetName” is XML_ELEMENT_ • Any Xpath, element name, or regular expression can be included following the final underscore when submitting a present request • The matching elements are extracted from the records matching the search and delivered in a simple format..

XML Extraction % zselect sherlock 372 {Connection with SHERLOCK (sherlock.berkeley.edu) database 'bibfile' at port 2100 is open as connection #372} % zfind topic mathematics {OK {Status 1} {Hits 26} {Received 0} {Set Default} {RecordSyntax UNKNOWN}} % zset recsyntax XML % zset elementset XML_ELEMENT_Fld245 % zdisplay {OK {Status 0} {Received 10} {Position 1} {Set Default} {NextPosition 11} {RecordSyntax XML 1.2.840.10003.5.109.10}} { <RESULT_DATA DOCID="1"> <ITEM XPATH="/USMARC[1]/VarFlds[1]/VarDFlds[1]/Titles[1]/Fld245[1]"> <Fld245 AddEnty="No" NFChars="0"><a>Singularitâes áa Cargáese</a></Fld245> </ITEM> <RESULT_DATA> … etc…

TREC3 Logistic Regression Probability of relevance is based on Logistic regression from a sample set of documents to determine values of the coefficients. At retrieval the probability estimate is obtained by: For the 6 X attribute measures shown on the next slide

TREC3 Logistic Regression Average Absolute Query Frequency Query Length Average Absolute Component Frequency Document Length Average Inverse Component Frequency Number of Terms in both query and Component

Okapi BM25 • Where: • Q is a query containing terms T • K is k1((1-b) + b.dl/avdl) • k1, b and k3are parameters , usually 1.2, 0.75 and 7-1000 • tf is the frequency of the term in a specific document • qtf is the frequency of the term in a topic from which Q was derived • dl and avdl are the document length and the average document length measured in some convenient unit • w(1) is the Robertson-Sparck Jones weight.

Combining Boolean and Probabilistic Search Elements • Two original approaches: • Boolean Approach • Non-probabilistic “Fusion Search” Set merger approach is a weighted merger of document scores from separate Boolean and Probabilistic queries

INEX ‘04 Fusion Search Subquery Subquery Final Ranked List Fusion/ Merge Subquery Subquery Comp. Query Results Comp. Query Results • Merge multiple ranked and Boolean index searches within each query and multiple component search resultsets • Major components merged are Articles, Body, Sections, subsections, paragraphs

Merging and Ranking Operators • Extends the capabilities of merging to include merger operations in queries like Boolean operators • Fuzzy Logic Operators (not used for INEX) • !FUZZY_AND • !FUZZY_OR • !FUZZY_NOT • Containment operators: Restrict components to or with a particular parent • !RESTRICT_FROM • !RESTRICT_TO • Merge Operators • !MERGE_SUM • !MERGE_MEAN • !MERGE_NORM • !MERGE_CMBZ

New LR Coefficients Estimates using INEX ‘03 relevance assessments for b1 = Average Absolute Query Frequency b2 = Query Length b3 = Average Absolute Component Frequency b4 = Document Length b5 = Average Inverse Component Frequency b6 = Number of Terms in common between query and Component

INEX CO Runs • Three official, one later run - all Title-only • Fusion - Combines Okapi and LR using the MERGE_CMBZ operator • NewParms (LR)- Using only LR with the new parameters • Feedback - An attempt at blind relevance feedback • PostFusion - Fusion of the new LR coefficients and Okapi

Query Generation - CO • # 162 TITLE = Text and Index Compression Algorithms • QUERY: topicshort @+ {Text and Index Compression Algorithms}) !MERGE_CMBZ (alltitles @+ {Text and Index Compression Algorithms}) !MERGE_CMBZ (topicshort @ {Text and Index Compression Algorithms}) !MERGE_CMBZ (alltitles @ {Text and Index Compression Algorithms}) • @+ is Okapi, @ is LR • !MERGE_CMBZ is a normalized score summation and enhancement

INEX CO Runs Strict Generalized Avg Prec FUSION = 0.0642 NEWPARMS = 0.0582 FDBK = 0.0415 POSTFUS = 0.0690 Avg Prec FUSION = 0.0923 NEWPARMS = 0.0853 FDBK = 0.0390 POSTFUS = 0.0952

INEX VCAS Runs • Two official runs • FUSVCAS - Element fusion using LR and various operators for path restriction • NEWVCAS - Using the new LR coefficients for each appropriate index and various operators for path restriction

Query Generation - VCAS • #66 TITLE = //article[about(., intelligent transport systems)]//sec[about(., on-board route planning navigation system for automobiles)] • Submitted query = ((topic @ {intelligent transport systems})) !RESTRICT_FROM ((sec_words @ {on-board route planning navigation system for automobiles})) • Target elements: sec|ss1|ss2|ss3

VCAS Results Generalized Strict Avg Prec FUSVCAS = 0.0321 NEWVCAS = 0.0270 Avg Prec FUSVCAS = 0.0601 NEWVCAS = 0.0569

Heterogeneous Track • Approach using the Cheshire’s Virtual Database options • Primarily a version of distributed IR • Each collection indexed separately • Search via Z39.50 distributed queries • Z39.50 Attribute mapping used to map query indexes to appropriate elements in a given collection • Only LR used and collection results merged using probability of relevance for each collection result

INEX 2005 Approach • Used only Logistic regression methods • “TREC3” with Pivot • “TREC2” with Pivot • “TREC2” with Blind Feedback • Used post-processing for specific tasks

Logistic Regression Probability of relevance is based on Logistic regression from a sample set of documents to determine values of the coefficients. At retrieval the probability estimate is obtained by: For some set of m statistical measures, Xi, derived from the collection and query

TREC2 Algorithm Term Freq for: Query Document Collection Matching Terms

Blind Feedback Document Relevance + - + Rt Nt -Rt Nt - R-Rt N-Nt-R+R N-Nt R N-R N Document indexing • Term selection from top-ranked documents is based on the classic Robertson/Sparck Jones probabilistic model: For each term t

Blind Feedback • Top x new terms taken from top y documents • For each term in the top y assumed relevant set… • Terms are ranked by termwt and the top x selected for inclusion in the query

Pivot method • Based on the pivot weighting used by IBM Haifa in INEX 2004 (Mass & Mandelbrod) • Used 0.50 as pivot for all cases • For TREC3 and TREC2 runs all component results weighted by article-level results for the matching article

Lecture 21: XML Retrieval