1 / 21

Applying IR to Patent Search TREC 2009 Chem Track Experience & A Recent Advancement in IR Theory

Applying IR to Patent Search TREC 2009 Chem Track Experience & A Recent Advancement in IR Theory. Le Zhao and Jamie Callan Language Technologies Institute School of Computer Science Carnegie Mellon University 2010-11-16. Prior Art task 2009.

doctor
Télécharger la présentation

Applying IR to Patent Search TREC 2009 Chem Track Experience & A Recent Advancement in IR Theory

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Applying IR to Patent SearchTREC 2009 Chem Track Experience& A Recent Advancement in IR Theory Le Zhao and Jamie CallanLanguage Technologies InstituteSchool of Computer ScienceCarnegie Mellon University2010-11-16

  2. Prior Art task 2009 • Task: Patent as query, Citations as relevant results • Our approach • Date filtering (Prior) [AleksandrBelinskiy, MaillistComm] • Query patent: • Multiple priority dates – use latest priority date • Result patent: • Multiple dates – use publication date • Weighted bag of word queries (Relevant Art) • Title + Claims • Description • Too long, only used to weight terms, (similar to selecting keywords)

  3. Indri Query Example • #filrej( #dateafter(07/07/1994) #weight( 0.6 #combine( detergent compositions) 0.4 #weight( 16 1 14 bleaching 12 agent 11 composition 11 oxygen 11 7 10 4 8 u 8 2 7 o 7 available 6 claims 5 triazacyclononane 5 silver 5 coating 5 clo 5 organic 5 3 4 mn 4 minutes 3 co 3 mniii 3 0 3 5 3 bispyridylamine 3 description 3 n 3 containing 3 described 3 releasing 3 method 2 mixtures 2 time 2 compound 2 mixture 2 dentate 2 remainder 2 rate 2 mniv 2 source 2 tri 2 making 2 sprayed 2 intimate 2 completely 2 oac 2 cl 2 trimethyl 2 selected 2 premixed 2 bleach 2 dispersing 2 compositions 2 pf 2 released 2 perchlorate 2 oil 2 10 2 di 2 group 2 methyl 2 release 2 non 2 cobalt 2 consisting 2 interval 2 process 2 paraffin 2 particles 2 present 1 claim 1 perhydrate 1 nh 1 salt 1 copper 1 total 1 corrosion 1 bispyridyl 1 chlorate 1 bi 1 8 1 dry 1 measured 1 partially 1 mnivbipy 1 och 1 trisdipyridylamine 1 comprises 1 mnivn 1 isothiocyanato 1 ligands 1 combination 1 triglycerides 1 bis 1 amine 1 6 1 bipy 1 binuclear 1 pyridylamine 1 mniiimniv 1 relasing 1 inorganic 1 mixed 1 precursor 1 iron 1 hydrogenated 1 peroxyacid 1 additional 1 inhibitor 1 tetra 1 tris 1 level 1 derivatives 1 provided 1 diglycerides 1 gluconate 1 mono 1 wholly 1 complexed 1 catalyst ) ) )

  4. MAP Performance (2009) Main Points Result PatentCitations • The optimal term weight (theory) • Failures of current IR models (practice) • Root cause of the problems • The beginning of a solution Query Patent Content 75% Term Weighting Relational retrieval/citation finding: (Ni Lao and William Cohen, ML 2010)

  5. The Optimal Term Weighting Main Points • Binary Independence Model • [Robertson and Spärck Jones 1976] • “Relevance Weight”, “Term Relevance” • P(t | R)is effectively the only part about relevance. • The optimal term weight (theory) • Failures of current IR models (practice) • Root cause of the problems • The beginning of a solution idf (sufficiency) Odds P(t | R)

  6. Definition of Necessity P(t | Rq) Collection Directly calculated given relevance judgements for q Relevant (q) Docs that contain t P(t | Rq) = 0.4 Term Necessity == term recall == 1 – mismatch

  7. Without Necessity • The emphasis problem for idf-only term weighting • Emphasize high idf terms in query • “prognosis/viability of a political third party in U.S.” (Topic 206) • Affects • tf*idf, • Okapi BM25, • Language Models, all models that use idf-only term weighting

  8. Ground Truth TREC 4 topic 206 Emphasis

  9. Indri Top Results 1. (ZF32-220-147) Recession concerns lead to a discouraging prognosis for 1991 2. (AP880317-0017) Politics … party … Robertson's viability as a candidate 3. (WSJ910703-0174) political parties … 4. (AP880512-0050) there is no viable opposition … 5. (WSJ910815-0072) A third of the votes 6. (WSJ900710-0129) politics, party, two thirds 7. (AP880729-0250) third ranking political movement… 8. (AP881111-0059) political parties 9. (AP880224-0265) prognosis for the Sunday school 10. (ZF32-051-072) third party provider (Google, Bing still have top 10 false positives. Emphasis also a problem for large search engines!)

  10. Without Necessity • The emphasis problem for idf-only term weighting • Emphasize high idf terms in query • “prognosis/viability of a political third party in U.S.” (Topic 206) • False positives throughout rank list • especially detrimental at top rank • No term recall hurts precision at all recall levels • Affected models: BIM, and also more advanced tf*idf, Okapi BM25, LM that use tf. • How significant is the emphasis problem?

  11. Failure Analysis of 44 Topics from TREC 6-8 Main Points • The optimal term weight (theory) • Failures of current IR models (practice) • Root cause of the problems • The beginning of a solution Necessity term weighting Term expansion Basis: Term Mismatch Problem RIA workshop 2003 (7 top research IR systems, >56 expert*weeks)

  12. Given True Necessity • +100% over BIM (in precision at all recall levels) • [Robertson and Spärk Jones 1976] • +30-80% over Language Model, BM25 (in MAP) • [Zhao and Callan 2010] • Limit for using necessity term weighting • Solving mismatch would give more gain!

  13. The Mismatch Problem Causes the Emphasis Problem • Emphasis problem: high mismatch & high idf • Solving mismatch solves emphasis problem!

  14. Expansion for Individual Terms Main Points • This works great:(prognosis OR viability OR possibility OR impossibility OR future)AND(political third party) • Even just this is better than original:(prognosis OR viability)AND(political third party) • The optimal term weight (theory) • Failures of current IR models (practice) • Root cause of the problems • The beginning of a solution

  15. WikiQuery Summary • A tool to easily create such complex queries • To easily modify and see the results • To store high quality queries • To share with others • To collaboratively & iteratively build a perfect query • The optimal term weight (theory) • == term mismatch + idf • Failures of current IR models (practice) • Emphasis (64%) + Mismatch (27%) • Root cause of the problems • Mismatch • The beginning of a solution: WikiQuery

  16. Feedback • Questions? Comments? Ideas? • Want to be Users? (Le Zhao: lezhao@cs.cmu.edu) Le Zhao and Jamie Callan. Term Necessity Prediction. CIKM 2010

  17. Failure Analysis – False Positives (EP topic 3) • Q: Oxygen-releasing (controlled release) bleaching agent, with a non-paraffin oil organic silver coating agent, and additional corrosion inhibitor compound • Top ranked results (cited means relevant) Relevance: [Title]: [Summary of invention] • NR: Controlled release laundry bleach product (+ 2 more others) • NR: Bleach activation: improved bleach catalyst for low temperatures • NR: Accelerated release laundry bleach product • R: Bleach activation: activated by a catalytic amount of a transition metal complex • R: Concentrated detergent powder compositions: a surfactant, a detergency builder, enzymes, a peroxygen compound bleach and a manganese complex as effective bleach catalyst

  18. Learning (false positives) • Bag-of-Wordwill fail in many cases • Most false positives have reasonably relevant descriptions • The most important part of a query patent is its novel part • Typically a small part of the document • Human intervention needed?

  19. Failure Analysis – Misses (EP topic 3) • Misses • Cited in the content: • “other catalyst examples include EPxxxx, USxxxx …” • about an unimportant area of the patent • These mentions also include the returned relevant patents

  20. Learning (misses) • Patents cite related prior patents • Many citations are • mentions of prior arts made by the query patent • about an unimportant part of the patent • If these are relevant, all false positives can be relevant • Will a patent cite other patents that may invalidate itself? • Mechanism to ensure that? Increased application fee? • For evaluation: what to include as Relevant? • Use the whole original reference list? • or only use citations added by others? • Patents have well marked search reports • For EP: X, Y, for US: *, judged by patent offices • EP 1-6, only 6 has 4 XYs, but a lot more applicant citations • We need better test sets

More Related