1 / 16

Analyzing Document Retrievability in Patent Retrieval Settings

Analyzing Document Retrievability in Patent Retrieval Settings. Shariq Bashir, and Andreas Rauber DEXA 2009, Linz, Austria, 31 August – 4 September. Department of Software Technology and Interactive Systems Vienna University of Technology, Austria {bashir, rauber}@ifs.tuwien.ac.at.

verna
Télécharger la présentation

Analyzing Document Retrievability in Patent Retrieval Settings

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Analyzing Document Retrievability in Patent Retrieval Settings Shariq Bashir, and Andreas Rauber DEXA 2009, Linz, Austria, 31 August – 4 September Department of Software Technology and Interactive Systems Vienna University of Technology, Austria {bashir, rauber}@ifs.tuwien.ac.at

  2. Motivation Patent retrieval is a emerging & challenging area. Patents fall into legal category, use to protect inventions. • Patents are Complex • Patents have large document length. • Contain complex vocabulary. • Contain complex structure and technical contents. • Patent writers often intentionally use vague words and expressions, in order to pass their patents from examination test. • This creates serious word mismatch problems. • Relevant patents could not be findable from their relevant queries. • Users (Attorneys, Patent examiners) mostly use hundreds of queries for • Patent Retrieval is different to Web Retrieval • Patent retrieval is recall oriented domain. • Finding all relevant patents is considered more important than finding only small set of top relevant patents. • Exp: A single prior-art patent can invalidate the application of new patent, • but can we find such patent in given retrieval model?

  3. Motivation • Role of Retrieval System in Accessing Information • Generally, there is always argue on the quality of user queries. • Therefore, rather than arguing on the quality of user queries. • In this paper, we check the role of retrieval systems in accessing information. • Can we access all information using given Retrieval Model? • How much retrieval system’s bias restrict our access to information? • Are there some subsets in given collection, which could not be find? • How easily we can find information in given retrieval system?

  4. Document Retrievability (aka Findability) • We measure retrieval systems effectiveness using findability measure. • Findability Measure • Measures how easily a retrieval model can find all documents. • Findability is measured with top c results. (e.g. c = 35, c = 80 etc). • Can figure out which retrieval systems is better for finding patents. • Can figure out high/low findable subsets in the collection. • Can figure out non-findable subsets in the collection.

  5. Given a collection of documents D with large set of Queries Q. The findability of document d1 is, how many times we can access d in top-c results, with all queries in Q. Exp: If a document d1 in findable in top-c of query q1, findability score r(d1) = 1. kdq is the rank of dD in query qQ. f(kdq,c)returns a value of 1 if kdq<= c, and 0 otherwise. Computing Findability Measure

  6. Our Contribution • Findability is measured with single score across all queries. • We consider relevance of queries, analyzing • Findability across all queries • Findability considering only queries that the document is relevant for • Findability for queries that a document is NOT relevant for • Characteristics of high/low findable documents • To what extend we can increase the findability of documents

  7. Experiment Setup • Retrieval models used • TFIDF, BM25, BM25F, Exact Match • Patents from US Patent and Trademark website http://www.uspto.gov • USPC class 433 - Dentistry Domain • For query generation, we used only Claim section • For indexing and searching we used all sections • Title, Abstract, Claim, Background Summary, Description, Captions • We used cut-off rank factor c = 35.

  8. Query Generation • Queries based on patent invalidity search scenario • Extract all single terms from individual patents term frequency > 2 in claim section • Single terms expanded into two & three term combinations • A query is considered relevant for patent, if all its terms appear at least 3 times in a document

  9. Term Frequency > 2

  10. Results with Single-Term Queries rank-cut off = 35 BM25

  11. Results with Two-Terms Queries rank-cut off = 35 BM25

  12. Results with Three-Terms Queries rank-cut off = 35 BM25

  13. Some Low Findable Patents with BM25

  14. Some High Findable Patents with BM25

  15. Conclusion • We analyze patents retrieval with findability measure. • We differentiate findability using relevant & irrelevant queries. • Our results indicate that • With well-known retrieval models, we could not able to find some patents in top-c results. • Large retrieval patents are more findable from irrelevant queries than relevant queries. • There is lot of noise on Top-c results of queries. • Future Work • For handling word mismatch, we need efficient Query Expansion technique. • Individual patents have different findability scores in different retrieval models. • Exp: Patents which are low findable in Model A, are high findable in Model B. • We need efficient Fusion technique.

  16. Thank You

More Related