1 / 14

Genetic Learning for Information Retrieval

Genetic Learning for Information Retrieval. Andrew Trotman Computer Science 365 * 24 * 60 / 40 = 13,140. X. Genetic Learning. The Core Algorithm Crossover, Mutation, Reproduction Fitness proportionate selection Genetic Algorithms Chromosome is an array Genetic Programming

zena
Télécharger la présentation

Genetic Learning for Information Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Genetic Learning forInformation Retrieval Andrew Trotman Computer Science 365 * 24 * 60 / 40 = 13,140

  2. X Genetic Learning • The Core Algorithm • Crossover, Mutation, Reproduction • Fitness proportionate selection • Genetic Algorithms • Chromosome is an array • Genetic Programming • Chromosome isan abstract syntax tree {A B C D E F} X {1 2 3 4 5 6}

  3. Information Retrieval (Text) • Online Systems • Dialog, LexisNexis, etc. • Web Systems • Alta Vista, Excite, Google, etc. • Scientific Literature Systems • CiteSeer, PubMed, BioMedNet, etc. • Question: • How should scientific literature be ranked? • Less time searching / More time researching • Higher exposure for “good” work

  4. How Google Works • PageRank • Document ranking from PageRank • A document’s PageRank is some factor (d) of the rank of incoming citations • A document’s influence is some factor of its rank and its outgoing citations • Characteristics of Scientific Literature • Citations unidirectional (backwards in time) • 12 month publication cycle • Scientific citation “cliques”

  5. postings dictionary Record1: Of OtagoRecord2: Otago UniversityRecord3: OtagoRecord4: Of OF <1,1><4,1> OTAGO <2,1><3,1> UNIVERSITY <2,1> How IR works • Indexing • Build the dictionary • Construct the Postings (<d,f> pairs) • Searching • Look up terms in dictionary • Boolean resolution • Rank on density (probability, vector space, etc.) • Performance • Recall and precision

  6. doc:1 docid:2 place:3 cntry:5 sport:6 name:4 rank:7 <doc><docid>1</docid><place><name>University of Otago</name></place><cntry>New Zealand</cntry></doc> <doc><docid>2</docid><cntry>New Zealand</cntry><sport>sailing</sport></doc> <doc><docid>3</docid><place><name>University of Otago</name><rank>top</rank></place></doc> Structured-IR • Sci-Lit documents have structure • Title, abstract, conclusions, etc. • <d,f> becomes <d,p,f>

  7. Using Structure in Ranking • Documents have structure • Title, Abstract, Conclusions, etc. • Weight each structure on “importance” • Title higher than Abstract higher than … • How to choose the weights • Specified in the query (XIRQL) • Query feedback • Learn with a Genetic Algorithm • Adapt ranking model to use structure • Each tree node is a locus • Weights are genes

  8. 50 training queries 50 evaluation queries 25 generations Probabilistic IR Vector Space IR PROBABILISTIC IR 75.5% queries improved 6.7% increase in MAP (8.8% max) VECTOR SPACE IR 61% queries improved 4.7% increase in MAP (5.4% max) Experiment Results

  9. Ranking Algorithms • Multitude exist • Probability, vector space, Boolean • Several published nomenclatures • Over 100,000 “published” algorithms • Purpose • Put relevant documents first • Sorting • Performance measures with precision • Sources • Some guy thought it up

  10. 50 training queries 50 evaluation queries 31 runs Weekend time limit Compare to Probabilistic 67% queries improved 15% increase in MAP Experiment Results

  11. Function Comparison Vector Space Probability Learned wdq=StÎq(((((((((U / sqrt(sqrt(nt))) / (mq / sqrt((((Lq / (sqrt(sqrt(Ld)) / sqrt((U / nc)))) * min(mq, N)) / sqrt(((((((Tmax / sqrt(U)) / sqrt((((log2(sqrt(nt)) / sqrt(nt)) / sqrt(Umax)) / (M / nc)))) / sqrt((U / nc))) - uq) / mq) / sqrt(nt))))))) / sqrt((log(Tmax) / nc))) / sqrt(nt)) / sqrt(nt)) / sqrt((Lq / sqrt(((sqrt((sqrt(sqrt(Ld)) / sqrt((min(mq, sqrt((((log(Tmax) / nc) / sqrt(Umax)) / (mq / sqrt(((N * min((sqrt(nc) / sqrt(U)), Ld)) / sqrt(N))))))) / sqrt(Ld))))) / sqrt((Tmax / nc))) / sqrt(nt)))))) / sqrt((min(mq, N) / nc))) / sqrt((log(Tmax) / nc))) / sqrt(nt))

  12. Conclusions • Using document structure improved ranking • Structure weights can be learned with a GA • GP can be used to learn ranking functions Speculation • Combining GA and GP to learn a structure ranking algorithm will better GA and GP alone

  13. Questions?

  14. Random NumbersAre your results an artifact of your random number generator?

More Related