250 likes | 445 Vues
Chap 14 Ranking Algorithm. 指導教授 : 黃三益 博士 學生 : 吳金山 鄭菲菲. Outline. Introduction Ranking models Selecting ranking techniques Data structures and algorithms The creation of an inverted file Searching the inverted file Stemmed and unstemmed query terms A Boolean systems with ranking
E N D
Chap 14 Ranking Algorithm 指導教授: 黃三益 博士 學生: 吳金山 鄭菲菲
Outline • Introduction • Ranking models • Selecting ranking techniques • Data structures and algorithms • The creation of an inverted file • Searching the inverted file • Stemmed and unstemmed query terms • A Boolean systems with ranking • Pruning
Introduction • Boolean systems • Providing powerful on-line search capabilities for librarians and other trained intermediaries • Providing very poor service for end-users who use the system infrequently • The ranking approach • Inputting a natural language query without Boolean syntax • Producing a list of ranked records that “answer” the query • More oriented toward end-users
Introduction (cont.) • Natural language/ranking approach • is more effective for end-users • The results being ranked based on co-occurrence of query terms • modified by statistical term-weighting • eliminating the often-wrong Boolean syntax used by end-users • providing some results even if a query term is incorrect
Figure 14.1 Statistical ranking • Simple Match • Query (1 1 0 1 0 1 1) • Rec1 (1 1 0 1 0 1 0) • (1 1 0 1 0 1 0) = 4 • Query (1 1 0 1 0 1 1) • Rec2 (1 0 1 1 0 0 1) • (1 0 0 1 0 0 1) = 3 • Query (1 1 0 1 0 1 1) • Rec3 (1 0 0 0 1 0 1) • (1 0 0 0 0 0 1) = 2 • Weighted Match • Query (1 1 0 1 0 1 1) • Rec1 (2 3 0 5 0 3 0) • (2 3 0 5 0 3 0) = 13 • Query (1 1 0 1 0 1 1) • Rec2 (2 0 4 5 0 0 1) • (2 0 0 5 0 0 1) = 8 • Query (1 1 0 1 0 1 1) • Rec3 (2 0 0 0 2 0 1) • (2 0 0 0 0 0 1) = 3
Ranking models • Two types of ranking models • ranking the query against Individual documents • Vector space model • Probabilistic model • ranking the query against entire sets of related documents
Ranking models (cont.) • Vector space model • Using cosine correlation to compute similarity • Early experiments • SMART system (overlap similarity function) • Results • Within document frequency weighting > no term weighting • Cosine correlation with frequency term weighting > overlap similarity function • Salton & Yang (1973) (Relying on term importance within an entire collection) • Results • Significant performance improvement using the within-document frequency weighting + the inverted document frequency (IDF)
Ranking models (cont.) • Probabilistic model • Terms appearing in previously retrieved relevant documents was given a higher weight • Croft and Harper (1979) • Probabilistic indexing without any relevance information • Assuming all query terms have equal probability • Deriving a term-weighting formula
Ranking models (cont.) • Probabilistic model • Croft (1983) • Incorporating within-document frequency weights • Using a tuning factor K • Result • Significant improvement over both the IDF weighting alone and the combination weighting
Other experiments involving ranking • Direct comparison of similarity measures and term-weighting schemes • 4 types of term frequency weightings (Sparch Jones,1973) • Term frequency within a document • Term frequency within a collection • Term postings within a document (a binary measure) • Term postings within a collection • Indexing was taken from manually extracted keywords • Results • Using the term frequency (or postings) within a collection always improved performance • Using term frequency ( or postings) within a document improved performance only for some collections
Other experiments involving ranking (cont.) • Harman(1986) • Four term-weighting factors • (a) The number of matches between a document & a query • (b) The distribution of a term within a document collection • IDF & noise measure • (c) The frequency of a term within a document • (d) The length of the document • Results • Using the single measures alone, the distribution of the term within the collection = 2 (c) • Combining the within-document frequency with either the IDF or noise measure = 2 (using the IDF or noise alone)
Other experiments involving ranking (cont.) • Ranking based on document structure • Not only using weights based on term importance both within an entire collection and within a given document (Bernstein and Williamson, 1984) • But also using the structural position of the term • Summary versus text paragraphs • In SIBRIS, increasing term-weights for terms in titles of documents and decreasing term-weights for terms added to a query from a thesaurus
Selecting ranking techniques • Using term-weighting based on the distribution of a term within a collection • always improves performance • Within-document frequency + IDF weight • often provides even more improvement • Within-document frequency + (Several methods) IDF measure • Adding additional weight for document structure • Eg. higher weightings for terms appearing in the title or abstract vs. those appearing only in the text • Relevance weighting (Chap 11)
The creation of an inverted file • Implications for supporting inverted file structures • Only the record id has to be stored (smaller index) • Using strategies that increase recall at the expense of precision • Inverted file is usually split into two pieces for searching • The dictionary containing the term, along with statistics about that term such as no. of postings and IDF, and a pointer to the location of the postings file for term • The postings file containing the record ids and the weights for all occurrences of the term
The creation of an inverted file (cont.) • 4 major options for storing weights in the postings file • Store the raw frequency • Slowest search • Most flexible • Store a normalized frequency • Not suitable for use with the cosine similarity function • Updating would not change the postings
The creation of an inverted file (cont.) • Store the completely weighted term • Any of the combination weighting schemes are suitable • Disadvantage: updating requires changing all postings • If no within-record weighting is used, then the postings records do not have to store weights
Searching the inverted file query • Figure 14.4 flowchart of search engine parser Dictionary Lookup Dictionary entry Get Weights Record numbers on a per term basis Accumulator Record numbers. Total weights Sort by weight Ranked record numbers
Searching the inverted file (cont.) • Inefficiencies of this technique • The I/O needs to be minimized • A single read for all the postings of a given term, and then separating the buffer into record ids and weights • Time savings can be gained at the expense of some memory space • Direct access to memory rather than through hashing • A final major bottleneck can be the sort step of the “accumulators” for large data sets • Fast sort of thousands of records is very time consuming
Stemmed and unstemmed query terms • If query terms were automatically stemmed in a ranking system, users generally got better results (Frakes, 1984; Canadela, 1990) • In some cases, a stem is produced that leads to improper results • the original record terms are not stored in the inverted file; only their stems are used
Stemmed and unstemmed query terms (cont.) • Harman & Candela (1990) • 2 separate inverted files could be created and stored • Stem terms: normal query • Unstemmed terms: don’t stem • Hybrid inverted file • Saving no space in the dictionary part • Saving considerable storage (2 versions of posting) • At the expense of some additional search time
A Boolean systems with ranking • SIRE system • Full Boolean capability + a variation of the basic search process • Accepts queries that are either Boolean logic strings or natural language queries (implicit OR) • Major modification to the basic search process • Merge postings from the query terms before ranking is done • Performance • Faster response time for Boolean queries • No increase in response time for natural language queries
Pruning • A major time bottleneck in the basic search process • The sort of the accumulators for large data sets • Changed search algorithm with pruning: • Sort all query terms (stems) by decreasing IDF value • Do a binary search for the first term (i.e., the highest IDF) and get the address of the postings list for that term • Read the entire postings file for that term into a buffer and add the term weights for each record id into the contents of the unique accumulator for the record id
Pruning (cont.) • Check the IDF of the next query term. If the IDF >= 1/3 (max IDF of any term in the data set)then repeat steps 2, 3, and 4 otherwise repeat steps 2, 3, and 4, but do not add weights to zero weight accumulators • Sort the accumulators with nonzero weights to produce the final ranked record list • If a query has only high-frequency terms, then pruning cannot be done.