130 likes | 233 Vues
This document outlines a comprehensive framework for modern information retrieval systems. It details components such as the CPU, RAM, Operating System, and database technologies used, including specific languages for programming. Key indexing and ranking methodologies are presented, alongside examples of weight calculation for terms within documents. This report emphasizes the importance of query handling, recall, and precision metrics in determining search outcomes. Additionally, it highlights issues related to query length and the impact of stop words on indexing efficiency.
E N D
Modern Information Retrieval 第三組 87070300 陳國富 87068800 王俊傑 87070600 夏希璿
Our Environment • CPU : Duron 700 • RAM : 320MB • OS : Microsoft XP Professional • Database : Mysql Database • Program Language : PHP Script Language • Store Device : 30GB 7200rpm HardDisk
FrameWork WWW 文件 讀取 查詢 回報 Index Processor DataBase Ranking Processor 檢索結果 儲存 圖表1:系統架構圖
Indexing Processing 讀取文件 去除Stop Word 產生單字Weight 去除Weight過小的單字
Indexing Processing(Cont.) 計算方式: Weight = Wn + loge(T/Ts) • Wn : index在該文章中出現的次數 • T :文章總數 • Ts :含有該index的文章總數 • 去除weight過小的index
Indexing Processing(Cont.) • 舉例- • “Play” Weight : 5.965 • “Taiwan” Weight : 25.745 • “Stock” Weight : 13.922 • 每個index在不同文件中,其weight都不同
Search & Ranking • 假設Query = (Q1 , Q2 ,Q3 …..Qn)為使用者輸入的檢索,n為Query的單字數目;(D1,D2,D3….Dm)為檢索出來的文章,m為檢索結果的數目 wij為Qi在Dj中的weight值 W j = Σ wij = 所有檢索在Dj中的weight總和 DOCW j = 在文章j 中,所有index的weight總和
Search & Ranking(Cont.) • 依照下列來決定排名 • 文章中含有Query數目的多寡 • Query在該篇文章中所佔的比例重 = ( Wj / DOCWj ) • 文章的index weight總和( DOCWj ) • 當Query 單字 > 2時,才作為排名的依據
Recall Precision 不使用文章加權 0% 44.50% 10% 29.03% 20.39% 20% 16.89% 30% 40% 15.10% 50% 12.47% 9.02% 60% 70% 6.20% 80% 4.35% 2.42% 90% 0.29% 100% Precision 使用文章加權 47.77% 28.80% 21.26% 17.10% 15.97% 12.79% 9.00% 6.29% 4.38% 2.58% 0.29% Search & Ranking(Cont.)
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 用文章加權 38.16 16.78 10.03 7.18 6.45 3.97 3.05 2.43 1.99 0.97 0.01 無文章加權 32.39 17.17 8.52 6.82 4.96 3.43 3.04 2.27 1.97 0.69 0.01 Search & Ranking(Cont.)
Recall 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Precision 61.22% 45.64% 36.99% 30.98% 29.30% 25.13% 17.34% 11.69% 7.73% 4.83% 0.68% Search & Ranking(Cont.)
結論 • 對於較長的Query 準確率較低 • 無法分析Query中的每個word • Query中每個Word的比重不一定相同 • “Actions Against International Terrorists” • 重點在於 “Against”及 “Terrorists”兩字 • Index的選擇