Identification of Low/High Retrievable Patents using Content-Based Features

Identification of Low/High Retrievable Patents using Content-Based Features Shariq Bashir, and Andreas Rauber PaIR’09, Hong Kong, China, 6th November, 2009 Department of Software Technology and Interactive Systems Vienna University of Technology, Austria {bashir, rauber}@ifs.tuwien.ac.at

Patent Retrieval • Patent Retrieval is a recall oriented domain • Retrievability of all relevant Patents is considered more important than viewing only set of top rank Patents of Queries • Challenges in Patent Retrieval • Complex contents and technical structure • Acronyms and new terminology • Used of Many vague terms for narrowing the scope of their invention • Writer used their own terminologies for passing patents from examination test • These factors create non trivial effect on the Findability of Patents

Retrieval Systems Evaluation • Conventionally, retrieval systems are evaluated based on Average Precision, Q-measure, Normalized Discounted Cumulative Gain, metrics • These metrics cannot evaluate: • What we can find and what we can’t find? • Which Patents are easy to Find? • Which Patents are hard to Find? • Which retrieval system is better to find Patents on top rank results of queries • Retrieval systems are evaluated using the concept of retrievability • Retrievability analyzes, how easily users can find documents in given system

Retrievability Measurement • Measures how likely individual documents in collection (D) can be retrieved within top c results of queries • Defined as dD • c denotes the rank user willing to proceed • kdg is the rank of document dD in query qQ • f(kdg,c) is cost function, return 1 if kdg c, otherwise 0

Retrieval Systems Evaluation • Bias in Retrieval Systems • Large bias in different retrieval systems is figured out toward subset of collection • A large number of patents have very low retrievability score • Some patents could not be found via any query • Patents have different retrievability scores in different systems • Main factors behind low retrievability • Short queries, • System bias, • Terms mismatch document-query

Our Contribution Motivation: • Patents have different Retrievability Scores in different systems • Need to understand what factors effect Retrievability • Based on different factors, can we identify high/low retrievable patents a-priori? Contribution: • Retrievability is analyzed with content based features. • Other than Patent Length, Following Features are considered: • Rare Terms Ratio • Average Terms Frequencies • Frequent Terms Count • Average Terms Probabilities in Related Patents • Average Terms Probabilities in Whole Collection • Automatically classify patents into low/high retrieval based on text features

Experimental Setup • Dataset • Patents downloaded from http://www.uspto.gov/ (US Patent and Trademark office website) • United State Patent Classification (USPC) Classes 422 and 423 are used for experiments • Total Patents = 54,353 • With Average Size = 3,317.41 words (without stop words removal) • Retrieval Systems • TFIDF, • Exact Match • OKAPI BM25 • Jelinek-Mercer (JM) • Dirichlet (Bayesian) Smoothing (DirS) • Absolute Discounting (AbsDis) • Two-Stage Smoothing (Two-Stage).

Findability Distribution of Patents with Different Retrieval Systems

Experimental Analysis • Content-Based Feature Analysis • Randomly 800 low and 800 high retrievable patents are pick from each retrieval system for analysis • We consider a Patent Low retrievable, it has r(d) < 300, whereas patents with r(d) >= 700 are considered as High Retrievable • Features • Rare Terms Ratio (RTR) • Average Terms Frequencies (ATF) • Average Terms Probabilities in Related Patents (ATP) • Average Terms Probabilities in Whole Collection (ATP) • Frequent Terms Count (FTC) • Patent Length (PL) • Features RTF, ATP, ATP are further computed with two, three, and four terms combinations

Rare Terms Ratio (RTR): • Which Systems are worst for finding Patents with Large rare terms. • We consider a term “rare”, if it’s collection frequency is less than 200 • Patents with large RTR could indicate • New invention • Presence of hiding information • TFIDF is worst for finding Large RTR Patents. [Reason: IDF] • In Two-Stage Some Patents with Large RTR are high Retrievable

Average Terms Frequencies (ATF) • Helpful for understanding the effect of Terms Frequencies on Ranking • We consider both rare and all terms • BM25, JM and Two-Stage make smaller ATF patents more findable • DirS and AbsDis make larger ATF patents more findable

Average Terms Probabilities in Related Patents (ATPrd) • Helpful for understanding, whether system make those Patents more findable • which have similar terms in their related Patents (Strong Cluster), • or Weaker Clusters • We consider top-35 most similar Patents • TFIDF, JM, AbsDis, and DirS all make stronger clusters more retrievable. • In BM25, and Two-Stage weaker clusters have high findability • Thus BM25 and Two-Stage are suitable for finding those patents, which frequently used those alternative terms as compared to those terms which appear in their related patents

Average Terms Probabilities in Related Patents (ATP) • In ATPrd we used 35 most similar Patents • In this feature, whole collection is considered • Useful, for understanding the effect of Inverse Document Frequency (IDF) • TFIDF, DirS, AbsDis and JM make larger ATP patents more retrievable • BM25, and Two-Stage make smaller ATO patents more findable • Suitable for finding those patents which frequently used new terminology

Patent Length (PL) • Useful for understanding whether system makes • Patents with large length more findable, • Or Patents with smaller length more findable • In BM25 and Two-Stage • Only those Longer length Patents are Retrievable which have smaller ATF values (Due to effect of length normalization) • In AbsDis and DirS some shorter length Patents with higher ATF have higher findability

Experimental Analysis • Automatic Retrievability Classification • For automatic identification of low/high retrievable patents, we build classification model • Content-based features are used for learning model • 1600 random (800 low and 800 high) retrievable patents are used for learning classification model • Further 1600 random (800 low and 800 high) retrievable patents are used for testing classification accuracy • We used J48 classifier implemented in WEKA • For most of systems with both CQG approaches, our classification indicate more than 80% accuracy

Experimental Analysis • Automatic Retrievability Classification

Conclusions and Future Directions • Patent retrieval is a recall oriented retrieval domain • Patents have complex contents and technical structure • That’s effect on the Findability of Patents • Retrieval Systems are evaluated using the concept of Findability measurement • High and Low retrievable patents are analyzed using text based features • Based on text features, high and low retrievable patents are identified a-priori Future Directions: • Automatic Low/High Retrievable Patent Identification • Useful for Patents examiners for analyzing the contents of Very Low or Very High Retrievable Patents

Conclusions Future Directions: • Query Results Merging • Separate Low and High Retrievable Patents • Then Query in both Low and High Retrievable Patents • Can increase Retrievability • Retrieval Systems Fusion • Query with all Systems • Merge only those Systems in which Patents are high Retrievable

Thank You

Queries Generation • Controlled Query Generation (CQG) • Query are generated based upon Prior-Art Search • Two Methods used: • Without Patents Relatedness, and with Patents Relatedness • Query Generation combining Frequent Terms (QG-FT) • Each patent is considered as Query Patent for Prior-Art Search • Single Frequent Terms are Extracted with Minimum Document Frequency > 2 • Single Frequent Terms are combined for constructing longer length Queries Patent (A) ---------------------- Patent --------------------------------- Patent --------------------------------- Patent --------------------------------- Patent --------------------------------- Patent --------------------------------- Patent --------------------------------- Patent --------------------------------- Patent --------------------------------- Use Patent (A) as a query for searching related documents.

Queries Generation • Controlled Query Generation (CQG) • Query Generation using Document Relatedness (QG-DR) • In QG-FT, Queries are generated from Single Patents • Terms Mismatch can effect Results • In this approach, Query terms are selected from Related Patents • (Step 1): For each Patent, group related Patents in set (R) • (Step 2): Then UsingR and whole Collection, construct Language Model, for finding dominant terms • Where Pjm(t|R) is the probability of term t in set R, and Pjm (t|corpos) is the probability of term t in whole collection • (Step 3): Combine single terms with two, three, and four terms combinations for constructing longer queries

Queries Generation • Controlled Query Generation (CQG) • Descriptions of Query Sets used for Retrievability Analysis

Identification of Low/High Retrievable Patents using Content-Based Features