Toward Consistent Evaluation of Relevance Feedback Approaches in Multimedia Retrieval

Toward Consistent Evaluation of Relevance Feedback Approaches in Multimedia Retrieval Xiangyu Jin, James French, Jonathan Michel July, 2005

Outline • Motivation & Contributions • RF (Relevance Feedback) in MMIR • PE (Performance Evaluation) Problems • Rank Normalization • Experimental Results • Conclusions

Motivation RF in MMIR is a cross discipline research area (1). CV & PR [Rui 98] [Porkaew 99] (2). Text IR [Rocchio 71] [Williamson 78] (3). DB & DM [Ishikawa 98][Wu 00][Kim 03] (4). HCI & Psychology ... These different background groups follow different traditions and have different standards toward evaluation, which makes it hard to: (1). study relations among them (2). compare their performance fairly

Motivation (1). Different testbed Dataset: COREL [Muller 03], TRECVid -”Every evaluation is done on a different image subset thus making comparison impossible” Groundtruth manual-judged [Kim 03] TRECVid Pure human labeling auto-judged [Rui 98] [Porkaew 99] MARS as reference system semi-auto-judged [Liu 01] MSRA MiAlbum, system-assisted human labeling

Motivation • (2). Different methodology • System-oriented vs. user-oriented • User-oriented method is not ideal for comparison since user experience varies from person to person, time to time. • Normalized rank vs. non-normalized rank • Rank-normalization is generally accepted in Text IR [Williamson 78] • but not MMIR

Problems & Contributions Prob 1. It is hard to study the relations among RF approaches Cont. 1.Briefly summarize RF algorithms according to their implementation to multi-query retrieval, so that each approach can be treated as a special case under the same framework and their intrinsic relations can be studied. Prob 2. It is hard to compare RF performance fairly Cont. 2. Give the critics toward PE work in the listed works. Demonstrate an example of how to fairly compare three typical RF approaches in large scale testbeds (both text and image). And show improper PE methodology can lead to different conclusions.

Where are we? • Motivation & Contributions • RF (Relevance Feedback) in MMIR • PE (Performance Evaluation) Problems • Rank Normalization • Experimental Results • Conclusions

RF in MMIR (framework) General RF model in distance-based IR Both the document and queries can be abstracted as points in some space rel-doc irel-doc

RF in MMIR (framework) General RF model in distance-based IR D(q,d) Each pair of points’ distance is defined by some distance function D(q,d) (assume D is metric). rel-doc irel-doc

RF in MMIR (framework) General RF model in distance-based IR Retrieval can be interpreted as getting the document points in the neighborhood of the query points (nearest neighbor search). rel-doc irel-doc

RF in MMIR (framework) General RF model in distance-based IR RF can be interpreted as a process to move and reshape the query region, so that it fits the user interested region in the space rel-doc irel-doc

RF in MMIR (framework) General RF model in distance-based IR Results Query Set Search Engine Feedback Examples Feedback examples are used to modify query points in the query set. The search engine could handle multiple query points, hence the search results is modified by the change of the query set.

RF in MMIR (framework) General RF model in distance-based IR Results Query Set Search Engine Feedback Examples In above discussion, D can only adapt to a single query point. We need to extend D(q,d) to D’(Q,d) so that it could handle a query set Q. Two possible solutions are: Combine Queries & Combine Distances.

RF in MMIR (framework) General RF model in distance-based IR Results Query Set Search Engine Feedback Examples Assumptions: (1). Our focus to RF research is on how to handle multiple query points. i.e., given D, how to construct D’. (2). Assume retrieval result is presented as a rank list. (3). User select feedback examples in the retrieval result.

RF in MMIR (framework) Combine Queries Approach Search Engine Results Query Set A single query point is generated from the query set by some algorithm. Then the synthetic query is issued to search.

RF in MMIR (framework) Combine Queries Approach f is a function which map Q to a single query point q, e.g. qi is a query point in the query set, wi is its corresponding weight.

RF in MMIR (framework) Combine Queries Approach • Combine query feedback modify the query region by the following mechanisms • (1). Move the query center by f so that query region is moved. • (2). Modify distance function D so that the query region is reshaped. • Usually the distance function is defined as a squared distance • D(q,d)=(q-d)TM(q-d), where M is the distance matrix. • Query-point-movement (QPM) [Rocchio 71]: M is an identity matrix • Re-weighting (Standard Deviation Approach) [Rui 98]: M is a diagonal matrix • MindReader [Ishikawa 98]: M is a symmetric matrix where det(M)=1

RF in MMIR (framework) Combine Distances Approach Mid-result Fusion Search Engine Mid-result Query Set Results Each query point is issued to search. The results are then combined in post-processing with some merging algorithm.

RF in MMIR (framework) Combine Distances Approach Define if both and The distances are combined using weighted power mean. The query center movement and distance function’s modification is a hidden process.

RF in MMIR (framework) Combine Distances Approach Query-expansion [Porkaew 99]: α=1 FALCON [Wu 00]: α>0 Fuzzy AND merge; α<0 Fuzzy OR merge

RF in MMIR (framework) Mixed Approach Q-Cluster [Kim 03] Feedback examples are clustered. (1). The cluster centers (denote as a set C) are used for combine distances feedback (by FALCON’s fuzzy OR merge) (2). Each cluster center use its own distance function Di. Di is trained using MindReader for query points in cluster i. Extremely complex!

RF in MMIR (example) An illustrative example Suppose we have a 2D database of weights, height of people. Steve Mike

The initial query region The new query region RF in MMIR (example) Combine Queries Approaches Query-point-movement: Rocchio’s method Height Weight 0 M is an identity matrix, the query region is a circle.

RF in MMIR (example) Combine Queries Approaches Re-weighting: Standard Deviation Approach Height Weight Band query: we want to find “tall” people, whose height is around 6 feet. The user interested region is a sharp band. No matter how you move the circle you cannot fit the query region.

RF in MMIR (example) Combine Queries Approaches Re-weighting: Standard Deviation Approach Height Weight Solution: Extend M to be a diagonal matrix, so that the query region is a ellipse (align to axis). The larger the variance along a axis is, the smaller the weight this axis is.

RF in MMIR (example) Combine Queries Approaches MindReader Height 0 Weight Diagonal query: we want to find “good shape” people, whose height/weight varies within a range. Since re-weighting can only form ellipse align to axes, so it cannot fits the region well.

RF in MMIR (example) Combine Queries Approaches MindReader Height 0 Weight Solution: extend M to be is a symmetric matrix with det(M)=1. Now the query region can be an arbitrary ellipse.

RF in MMIR (example) Combine Distances Approaches Query-expansion in MARS Height 0 Weight Triangle query: the user interested region is arbitrary shaped.

RF in MMIR (example) Combine Distances Approaches Query-expansion in MARS Height 0 Weight Solution: hiddenly change the distance function This is a special case where α=1 (arithmetic mean)

RF in MMIR (example) Combine Distances Approaches FALCON Height 0 Weight Disjoint query: user interest region is not a continuous region. Suppose user interested in two types of person, either small size or large size. This is common in a MMDB where low-level feature cannot reflect high-level semantic clusters.

RF in MMIR (example) Combine Distances Approaches FALCON Height 0 Weight Solution: use small circles (not necessary circle) around each query point to combine to a non-continuous region. αis negative

RF in MMIR (example) Mixed Approach Q-Cluster Height 0 Weight Idea: use small ellipse to combine to a non-continuous region. Use MindReader to construct each ellipse and use FALCON’s fuzzy OR merge to combine them.

PE Problems • There are many kinds of PE problems and we only list several. • (1). Dataset • (2). Comparison • (3). Impractical parameter settings • We give examples only in previous listed works. But this doesn’t mean ONLY these works has PE problems.

PE Problems • Dataset Problems • (1). Unverified assumptions in simulated environment • The algorithm is proposed based on some assumption. • E.g., re-weighting [Rui 98] requires ellipse query exist, MindReader [Ishikawa 98] requiresdiagonal query exist. • In MARS works [Rui 98] [Porkaew 99], groundtruth is generated by their own retrieval system with arbitrary distance function so that an “ellipse” query already exist. It would be not astonishing that re-weighting would over perform Rocchio method in this environment. • We are not arguing that these approaches are not useful, but the PE tells us verylittle information (since it is a high probability event).

PE Problems • Dataset Problems • (2). Real data is not typical for application • small scale (1k image), highly structured (strong linear relation), low dimensional (2D). • E.g., in MindReader [Ishikawa 98], a highly structural 2D dataset (the Montgomery Country dataset) is used to evaluation, where the task is in favor of their approach. • MMIR usually employ very high dimensional features, and only a few dozen examples are available to feedback. In this case, it is extremely hard to mine the relations among hundreds of dimensions via so few training examples. It would be risky to learn a more “free” & “powerful” distance function. • It is highly possible that user intention is overwhelmed by the noise and wrong knowledge is learned.

PE Problems Comparison Problems (1). Misused comparison The author proposed some modification to quicksort, but instead of compare his new sorting with quicksort, he compare it to bubble sort. For example, Q-Clusteris a modification to FALCON’s fuzzy OR merge. But [Kim 03] compared it to Q-Expansion and QPM. It is not astonishing that their approach performs much better, since COREL database is in favor of any fuzzy OR similar approach.

PE Problems Comparison Problems (2). Unfair comparison How to treat the training samples (feedback examples) in the evaluation? E.g., FALCON shift them to head, Rocchio doesn’t. But all methods can do this in post-processing! Directly compare approaches which inconsistentlyprocess the feedback examples will result in unfair comparison. FALCON [Wu 00] and Q-Cluster [Kim 03] papers all have this problem.

PE Problems • Impractical Parameter Settings • Assume a “diligent” user • (1). Ask the user to judge too many • [Re-weighting] ask the too look through top 1100 retrieved results to find feedback examples. • (2). Ask the user to click/select too many times • [Kim 03] and [Porkaew 99] feedback all relevant images in top 100 retrieved result. • Remember, COREL only has 100 relevant images for each query. This could result in their conclusion of improvement appears mostly in the FIRST iteration! • (3). Feedback too many iterations • [Wu 00] do feedback over 30 iterations

Rank Normalization Rank normalization: re-rank retrieval result according to feedback examples Although Rank Normalization is a generally accepted in text IR [Rank-Norm], but it is seldom paid enough attention in MMIR. Rank-shifting: shift the feedback examples to the head of the refined result even if they are not there. Easy to implement, fair for cross system comparison, unfair for cross iteration comparison. Rank-freezing [Rank-Norm]: freeze the rank of the feedback examples during refinement process. Hard to implement, fair for cross system comparison, fair for cross iteration comparison.

Rank Normalization Rank-Shifting Previous result 3 4 6 3 1 4 5 7 9 6 2 8 rel-doc

Rank Normalization Rank-Shifting Previous result 3 4 6 3 1 4 5 7 9 6 2 8 rel-doc Refined result (before rank-shifting) 2 3 8 4 5 9 1 6 7

Rank Normalization Rank-Shifting Previous result 3 4 6 3 1 4 5 7 9 6 2 8 rel-doc Refined result (before rank-shifting) 2 3 8 4 5 9 1 6 7 Refined result (after rank-shifting) 3 4 6

Rank Normalization Rank-Shifting Previous result 3 4 6 3 1 4 5 7 9 6 2 8 rel-doc Refined result (before rank-shifting) 2 3 8 4 5 9 1 6 7 Refined result (after rank-shifting) 3 4 6 2 8 5 9 1 7

Rank Normalization Rank-Freezing Previous result 3 4 6 3 1 4 5 7 9 6 2 8 rel-doc Refined result (before rank-freezing) 2 3 8 4 5 9 1 6 7 Refined result (after rank-freezing) 3 4 6

Rank Normalization Rank-Freezing Previous result 3 4 6 3 1 4 5 7 9 6 2 8 rel-doc Refined result (before rank-freezing) 2 3 8 4 5 9 1 6 7 Refined result (after rank-freezing) 3 2 4 8 5 9 6 1 7

Experimental Results Testbed (1). CBIR Basic CBIR in 3K image DB (COREL) (2). Text-IR Lucene in TREC-3

Experimental Results Feedback approaches (1). QPM, Query-point-movement (2). AND, FALCON with α=1 (Q-Expansion) (3). OR, FALCON with α=-1 Rank-normalization (1). Without rank-normalization (2). Rank-shifting (3). Rank-freezing

Toward Consistent Evaluation of Relevance Feedback Approaches in Multimedia Retrieval