160 likes | 172 Vues
This study explores novel query expansion methods for patent retrieval, comparing existing approaches with a new SynSet technique. Results show significant improvements in high-rank retrievals, but challenges in overall ranking and recall. The SynSet method, leveraging probabilistic weights, emerges as the most effective and efficient QE approach tested. Future work includes deeper analysis, real examiner query application, and method combination for optimization.
E N D
A Study on Query Expansion Methods for Patent Retrieval 24 October 2011 WalidMagdy Gareth Jones Centre for Next Generation Localisation School of Computing Dublin City University
Outline Agenda • What is the Problem? • Why Patents? • Current Solutions • Testing Existing Approaches • New Approach • Results • Conclusion Motivation Patent Characteristics Prior Work Applying Standard QE Novel Method Outcome Findings
Why Patents? Challenging wording Using vague and general terms Strange combination of terms No defined query (what words to select for search?) Low retrieval effectiveness Recall-oriented IR task Hypothesis:QE better query/doc match better results
Prior Work • Pseudo Relevance Feedback (PRF)(Kishida K, NTCIR-3; Itoh H, NTCIR-4) • QE using Rocchio formula: no significant improvement • QE using Taylor formula: no significant improvement • Reweighting query terms using PRF: no significant improvement • Inter Query Expansion (QE) for Patent Invalidity Search(Takeuchi H. et al, NTCIR-5) • QE for individual claims from same patent topic: significant improvement, but not applicable for other patent search tasks • Improving Retrievability for Patents(Bashir and Rauber, ECIR 2010) • Enrich queries to improve the retrievability of patents with low chance of retrieval, but not tested for real patent search task
Testing QE for Prior-Art Patent Search CLEF-IP 2010: 1.35M patents from the EPO 1.35K English patent topics Collection contains EN/FR/DE patents, with translations of titles and claims in three languages Expand query by: PRF vs. WordNet Use (Magdy et al., 2011) as BL without citation extraction (full patent description section as query) MAP and PRES was used for evaluationBL: 0.14 MAP, 0.486 PRES
Applying Pseudo Relevance Feedback • PRF implemented in Indri was used • Different values of FB terms and docs was tested
Using WordNet for Expansion • Expand terms in query using synonyms, hyponyms for nouns and verbs • Apply QE to sample 100 topics, then use best combination to the full 1.35k topics set
Standard QE Approaches • PRF: • Significant degradation in retrieval effectiveness. • This can be expected due to the low initial retrieval precision • WordNet: • Statistically significant degradation of results, but with some successful instances (31% of topics) • Large reduction in retrieval speed, since average query size is at least 5 times larger (34 times larger for the NS+NH+VS+VH) • A new effective and efficient QE method is required!
Automatically Generated SynSet English fields French transl. ENFR terms dic. FREN terms dic. ENEN terms dic. Align Sentences Remove Stopwords Stem Words Align Terms Backoff Alignment process for eliminating foreign matter from a waste heat stream procédé pour éliminer de la matière étrangère d'un courant de chaleur perdue process elimin foreign matter wast heat stream procéd élimin mati étrangèr cour chaleur perdu elimin: élimin0.71 elimin0.13 élimin: remov0.71 elimin0.14 elimin: remov0.85 elimin0.15 elimin: remov0.6 elimin0.16
SynSet QE Results 8M parallel EN/FR sentences were extracted from EPO patent collection to generate SynSets Two runs were adopted: Expanding query using SynSet without weights (Usynset) Utilizing SynSet probabilities as weights to terms in query
SynSet Expansion Significantly better MAP, but significantly worse PRESi.e. better retrieval at very high ranks, but worse ranking of relevant results over all ranks and less recall Some topics were improved (34% of topics), but some were degraded (39% of topics). Significantly more efficient than PRF and WordNet (query size is only 60% larger)
Deeper Look on SynSet No features with high correlation to SynSet QE success Initial retrieval quality of BL does not relate to the performance of QE
Conclusions PRF is not effective with patent prior-art search WordNet QE for patent search: Leads to overall significant degradation of retrieval Has some positive impact on the retrieval of some topics High computational cost SynSet QE for patent search: The most effective and efficient QE technique among those tested Significant improvement for very high ranks, but significant degradation of overall ranking and recall No indication of when it fails/succeeds SynSet can be used as a lexical resource for patent examiners
Future Work More analysis to better understand when QE fails/succeeds Applying SynSet on real patent examiners’ queries rather than automatically formulated queries Combining different QE methods Alternative methods for query modification, for example query reduction (QR)
Please Check in CIKM Poster Session Magdy W. and G. J. F. Jones. An Efficient Method for Using Machine Translation Technologies in Cross-Language Patent Search. Ganguly D., J. Leveling, W. Magdy, and G. J. F. Jones. Query Reduction based on Pseudo-Relevant Documents. Thank you