Analyzing Document Retrievability in Patent Retrieval Settings

Analyzing Document Retrievability in Patent Retrieval Settings Shariq Bashir, and Andreas Rauber DEXA 2009, Linz, Austria, 31 August – 4 September Department of Software Technology and Interactive Systems Vienna University of Technology, Austria {bashir, rauber}@ifs.tuwien.ac.at

Motivation Patent retrieval is a emerging & challenging area. Patents fall into legal category, use to protect inventions. • Patents are Complex • Patents have large document length. • Contain complex vocabulary. • Contain complex structure and technical contents. • Patent writers often intentionally use vague words and expressions, in order to pass their patents from examination test. • This creates serious word mismatch problems. • Relevant patents could not be findable from their relevant queries. • Users (Attorneys, Patent examiners) mostly use hundreds of queries for • Patent Retrieval is different to Web Retrieval • Patent retrieval is recall oriented domain. • Finding all relevant patents is considered more important than finding only small set of top relevant patents. • Exp: A single prior-art patent can invalidate the application of new patent, • but can we find such patent in given retrieval model?

Motivation • Role of Retrieval System in Accessing Information • Generally, there is always argue on the quality of user queries. • Therefore, rather than arguing on the quality of user queries. • In this paper, we check the role of retrieval systems in accessing information. • Can we access all information using given Retrieval Model? • How much retrieval system’s bias restrict our access to information? • Are there some subsets in given collection, which could not be find? • How easily we can find information in given retrieval system?

Document Retrievability (aka Findability) • We measure retrieval systems effectiveness using findability measure. • Findability Measure • Measures how easily a retrieval model can find all documents. • Findability is measured with top c results. (e.g. c = 35, c = 80 etc). • Can figure out which retrieval systems is better for finding patents. • Can figure out high/low findable subsets in the collection. • Can figure out non-findable subsets in the collection.

Given a collection of documents D with large set of Queries Q. The findability of document d1 is, how many times we can access d in top-c results, with all queries in Q. Exp: If a document d1 in findable in top-c of query q1, findability score r(d1) = 1. kdq is the rank of dD in query qQ. f(kdq,c)returns a value of 1 if kdq<= c, and 0 otherwise. Computing Findability Measure

Our Contribution • Findability is measured with single score across all queries. • We consider relevance of queries, analyzing • Findability across all queries • Findability considering only queries that the document is relevant for • Findability for queries that a document is NOT relevant for • Characteristics of high/low findable documents • To what extend we can increase the findability of documents

Experiment Setup • Retrieval models used • TFIDF, BM25, BM25F, Exact Match • Patents from US Patent and Trademark website http://www.uspto.gov • USPC class 433 - Dentistry Domain • For query generation, we used only Claim section • For indexing and searching we used all sections • Title, Abstract, Claim, Background Summary, Description, Captions • We used cut-off rank factor c = 35.

Query Generation • Queries based on patent invalidity search scenario • Extract all single terms from individual patents term frequency > 2 in claim section • Single terms expanded into two & three term combinations • A query is considered relevant for patent, if all its terms appear at least 3 times in a document

Term Frequency > 2

Results with Single-Term Queries rank-cut off = 35 BM25

Results with Two-Terms Queries rank-cut off = 35 BM25

Results with Three-Terms Queries rank-cut off = 35 BM25

Some Low Findable Patents with BM25

Some High Findable Patents with BM25

Conclusion • We analyze patents retrieval with findability measure. • We differentiate findability using relevant & irrelevant queries. • Our results indicate that • With well-known retrieval models, we could not able to find some patents in top-c results. • Large retrieval patents are more findable from irrelevant queries than relevant queries. • There is lot of noise on Top-c results of queries. • Future Work • For handling word mismatch, we need efficient Query Expansion technique. • Individual patents have different findability scores in different retrieval models. • Exp: Patents which are low findable in Model A, are high findable in Model B. • We need efficient Fusion technique.

Thank You

Analyzing Document Retrievability in Patent Retrieval Settings

Analyzing Document Retrievability in Patent Retrieval Settings

Presentation Transcript

Recognition and Retrieval from Document Image Collections

Space-Efficient Algorithms for Document Retrieval

Analyzing Industries and Markets Using Patent Data

Deliverable #3: Document and Passage Retrieval

The Status of Retrieval Evaluation in the Patent Domain

Adaptive Subjective Triggers for Opinionated Document Retrieval

Enhancing Query Formulation for Spoken Document Retrieval

Probabilistic Language-Model Based Document Retrieval

Chinese Spoken Document Retrieval and Organization

Imaged Document Text Retrieval without OCR

QUERY AND DOCUMENT EXPANSION IN TEXT RETRIEVAL

The Patent Document I

A Patent Document Retrieval System Addressing Both Semantic and Syntactic Properties

Analyzing Retrieval Models using Retrievability Measurement

Document Retrieval Problems

Language and Document Models in Information Retrieval

ANATOMY OF A PATENT DOCUMENT

Document retrieval

The Patent Document II

Comparing Document Segmentation for Passage Retrieval in Question Answering

Facebook gains patent on privacy settings

Document Image Databases and Retrieval