240 likes | 249 Vues
Patent Search QUERY Log Analysis. Shariq Bashir Department of Software Technology and Interactive Systems Vienna University of Technology. General Theme . In Automatic Evaluation of IR systems, query generation contains valuable importance. Generally, query generation space is very large.
E N D
Patent Search QUERY Log Analysis Shariq Bashir Department of Software Technology and Interactive Systems Vienna University of Technology
General Theme • In Automatic Evaluation of IR systems, query generation contains valuable importance. • Generally, query generation space is very large. • Need to understand, how to generate reasonable queries. • In this work, we understand this issue with the help Patent Search QUERY Log.
Automatic Query Generation for Analysis • Motivation/Problem • Patents contain large number of terms. • IR systems analysis using all combinations of terms is a difficult task. • Demands large processing time. • Can give wrong picture • A large combination of query terms are never used by users. • Question? • How to generate reasonable queries?
Query Log of Patents Search • (Patents Search Query Log) can help in generating queries for Analysis. • Patent search users are more experimented, we can utilize their experienced for effective queries generation. • In Query Log Analysis, on one side we have Query Patents and on the other side, we have their Query Logs • So this helps us in understanding • The types of terms that are mostly used for searching patents. • Can Prune Irrelevant Terms.
Applications of Query Log Analysis • Analyzing Bias of Retrieval Systems (Findability of Documents). • Selecting Terms for Query Expansion. • Learn to Rank for Prior-Art Search.
Experiments (QUERY Log DATASET) • Patent Search Query Log can be downloadable from USPTO portal (http://portal.uspto.gov/external/portal/pair). • Can’t be downloadable as a whole. Can be downloadable manually on individual patent basis. • Available in Scan Format, need OCR to convert in digital text format. • Need further cleansing operations to remove noise in queries. • Some queries contain past queries reference numbers. • There were lot of number in the queries. • Patents application number • IPC classes
Experiments (QUERY Log DATASET) • 242 Query Log of Patents are used for analysis. • 15013 queries. • We only considered the text queries for analysis.
Query Log Analysis • Given Query Log, we analyze it on the basis of following factors. • Term Frequencies of Query Terms. • Does Frequency of Terms in Patents contain any importance in Query Formulation? • Proximity/Closeness of Query Terms in Patent Text. • Query Terms Confidence in Similar IPC Classes. • Number of Retrieved Documents. Query Patent (Y) All Terms of Query Patent Understand diff between (All Terms of Patents/ and only Query Log Terms) Automatic Queries Generation Query Log of (Y) All Terms of Query Log
Terms Frequencies in Patents (1) • All Terms of Query Patents: • Large percentage of Terms in Patents have lower frequency. • While, very few percentage of Terms have higher frequency > 10.
Terms Frequencies in Patents (1) • [Percentage/out of Total Terms] Selected in Queries: • Higher Frequency Terms have very good percentage of selection in Queries. • Lower Frequency Terms such as <= 5, contain very poor percentage. Note in last slide almost 75% of Terms in Patents have <= 5 Frequency.
Terms Frequencies in Patents (1) • [Percentage/out of Query Terms] Appeared in Query Log: • Higher Frequency Terms are more frequently appeared in Query Log as compared to Lower Frequency Terms (<= 5).
Terms Proximity/Closeness in Query Log (2) • Proximity refers to closeness of Two Terms in Patent Text. • Helps in understanding whether Terms Proximity contains any importance in Queries formulation. • Proximity of Terms is calculated with two approaches • Minimum distance between two terms. • Co-Occurrence Frequency using Window Size. • Terms Pairs are selected based upon two factors • All Terms pairs of Query Patent. • Only Terms pairs that appeared in Query Log.
Terms Proximity/Closeness in Query Log • With Minimum Distance: • Lower Proximity Pairs are appeared in a larger percentage in Query Log, as compared to Higher Proximity Pairs. • This indicates that users give more focus toward those terms, which are closer together in the text. • In All Terms Pairs of Patents, 71% of Pairs have Minimum Proximity > 7.
Terms Proximity/Closeness in Query Log • With Co-Occurrence Frequency with Window Size = 14: • Higher Co-Occurrence Pairs are appeared in a larger percentage (90%) in Query Log, as compared to Lower Co-Occurrence Pairs (10%). • Almost 75% of All Pairs of Patents have Co-Occurrence Frequency <= 1.
Frequency in Similar IPC Classes • Query Patents fall in many IPC Classes. • Patent Users are usually experienced. • Their terms are more target oriented. • Need to check what is the Frequency of Query-Log Terms Pairs similar IPC classes. • Freq (IPC Classes) = Freq / |qd| • Freq = Frequency in similar IPC Classes • |qd| =Total # of Retrieved Documents.
Support in IPC Classes • Analysis indicates higher support of QUERY Log Terms Pairs in similar IPC classes as compared to All Terms Pairs of Patents.
Number of Retrieved Documents • Number of Retrieved Document denotes, QUERY Terms are present in how many Patents. • More common the QUERY Terms will be, the Larger Number of Retrieved Documents will be • This factor is analyzed with • All Terms Pairs of Patent • All Terms Pairs of Query Log
Number of Retrieved Documents • Analysis indicates Terms Pairs of Query Log, can retrieve smaller number of Patents as compared to All Terms Pairs of Patents.
Conclusion • For automatic IR System evaluation, Query Generation is an important factor. • We believe on the basis of past Query Log, we can understand this problem. • Using different statistical factors, there exists a huge difference between random queries and users queries. • We can considered these factors, while generating automatic queries.