Matchin Eliciting User Preferences with an online game

MatchinEliciting User Preferences with an online game By Sverin Hacker Luis von Ahn Presented by Daniel Sigalov

Agenda • GOAL: Eliciting user preferences for large datasets and creating global rankings of images. • The new method: Don’t ask users to tell what they prefer, ask what a random person would prefer. • Implemented by a new cool game on the internet called Matchin. • Discussion: Today’s algorithms VS the new one. • Additional information can be learned about the user only by a few clicks he made on random pictures..

Introduction This paper introduces a game that asks two randomly chosen partners: which of these two images do you think your partner prefers? • Both partners click on the same image - both obtain points • If they click on different images, neither of them receives points.

What do we learn from the game? • It is possible to extract a global “beauty” ranking within a large collection of images • After a person has played the game on a small number of pairs of images, it is possible to extract the person’s general image preferences. • We use the players’ preferences between images to create a simple gender model. • Wider Implications of such two-player game

Classification Methods (definitions): • The Judges – the people who give ratings • The judgments/decisions – the ratings of the judges.

1. Absolute vs. Relative Judgments • Absolute judgment – score assignment to an item (In example: Star rating from 1 to 5). • Disadvantages of Absolute judgment – • Lack of calibration • Limited resolution • Relative judgment – A comparison between items (A is better than B) • Advantages • Easy to make • Does not change after new information is received

2. Total vs. Partial Judgments • Total judgments - the judges are required to make judgments about all “𝑛” images. • Disadvantage • Requires comparison of every image with every other image, order of comparisons. Therefore, infeasible for large datasets. • Partial judgment – make judgments about part of the database • Disadvantage • Problem dealing with incomplete data • Advantage • Eases the efforts of the user • Doesn’t limit the size of the database

3. Selected Access vs. Predefined Access • Selected Access – the judges are allowed to search for particular items and then rate them. • Advantages • judges can focus on rating things in which they are most interested. • Disadvantages • Easy to maliciously manipulate the results. • Imbalance of the number of ratings each picture receives. Weighting the ratings becomes extremely difficult. • Predefined Access - the judges are given images to rate in a random predefined sequence and cannot influence which images they can rate. • Advantage • Possibility of cheating decreases as the amount of data increases

4. “I Like” vs.“Others Like” • “What do you like?” • Consider only your own opinion • “What do you think others will like?” • Also consider the opinion of your friends and acquaintances in combination with external information.

5. Direct vs. Indirect • Indirect methods - infer “beauty” through meta-information • Disadvantages • once the methods are known, their ratings can quickly be subjected to cheating • Direct methods - ask the judges about the “beauty” of an image.

Flickr • Flickrhas developed its own algorithm to rank images partly based on meta-data. • Ultimately, it is not clear whether “interestingness” measures the ”interestingness” of the image or the popularity of its author. • http://www.flickr.com/

Voting • users vote on images, using either approval/ disapproval or a rating scale (e.g., 1 to 5 stars). • Users can search for particular items and vote on them (selected access). Probably the most frequently used method on the Web, e.g., Digg, YouTube, IMDB

Hot or not? • Rank pictures (of people only) • uses a voting system from 1 to 10 • The images are given in random order (harder to cheat) • http://www.hotornot.com/

The methods on popular sites

OUR GAME - Matchin The Mechanism • 2-player game played over the Internet • The player is matched randomly with another person who is visiting the game’s Web site at the same time. • the players play several rounds. In each round, they see the same two images and both are asked to click on the image that they think their partner likes better. If they agree, they both receive points.

OUR GAME - Matchin The Mechanism (cont.) • To score more points, the user have to consider not only his preference but also their partner preference. • Every game takes 2 minutes. One pair of images, or “one round”, takes two to five seconds, thus a typical game consists of 30-60 rounds. • At the end of the game, the two players can review their decisions and chat with each other. • All clicks are recorded and stored in the database.

Fun FunFun • Matchin follows the spirit of GWAP. • The game has been played by tens of thousands of players for large periods of time. • Scoring: • Linear scoring function • Exponential scoring function • More excitement among the players • The rewards became too high • Sigmoid function. • Maximum at 1,000 points • Creates an artificial ladder from which players can fall due to mistakes

The Scoring System • Players are given more points for consecutive agreements. • Matchin uses a sigmoid function for scoring. • The first matches earn only a few points, but the score raises exponentially until the seventh match at which the growth of the function decreases.

Let’s play! • http://www.gwap.com/gwap/gamesPreview/matchin/

The numbers.. • The game was launched on GWAP Web site on May 15, 2008. Within only four months, 86,686 games had been played by 14,993 players. • In total, there have been 3,562,856 individual decisions (clicks) on images. Since the release of the game, there has been on average a click every three seconds . • The game is both very enjoyable and works well for collecting large amounts of data.

Implementing the methods • All of the judgments should truly reflect the judge’s opinion. • The judgments should be robust in the sense that they should still be considered “valid” after a long time. • Relative Judgment - Matchin asks every user to chose between 2 images. • Predefined Access - Matchin doesn’t allow the user to chose which images to rate. Therefore the rating is credible. • Partial Judgment - Matchin asks every user to rate only a small number of images. • “Others Like” - Matchin considers not only the player’s opinion, but also the opinions of others. • Direct Judgment - Matchin doesn’t use Meta-information.

Data Storage • An individual decision/record is stored in the form: <id, game_id, player, better, worse, time, waiting_time> id - a number assigned to identify the decision game_id - the ID of the game player - the ID of the player who made the decision in this record Better - the ID of the image the player considered better Worse - the ID of the other image time - the date and time when the decision was made waiting_time - the amount of time the player looked at the two images before making a decision.

GLOBAL RANKINGS • We examine several methods to combine the relative judgments into a global ranking. • For the global ranking, we consider the data as a multi-digraph . • The nodes 𝑉 are the images. A decision to prefer image 𝑖∈𝑉 over a image 𝑗∈𝑉 represented by a directed edge . • The goal: to produce a global ranking over all of the n images . • The methods use a ranking function that maps every image to a real number first, called its rank value, and then applies sorting. An image is ahead of a different image if its rank value is larger:

Ranking Methods • We will compare three different ranking functions: 1. Empirical Winning Rate (EWR) 2. ELO rating 3. TrueSkill rating • Then we will present a new algorithm: • Relative SVD

1. Empirical Winning Rate (EWR) • EWR is the number of times an image was preferred over a different image, divided by the total number of comparisons in which it was included. • In graph terms, the EWR of an image is just its out degree divided by its degree: • Problems • Images with low degree might get artificially high or low EWR. • Does not take the quality of the competing image into account.

2. ELO rating • In this model, each player’s rating is first being initialized to certain number according to a scale, which should reflect the player’s true skill. • If a player wins, his/her ELO rating goes up, otherwise it goes down, According to how surprising the win or loss is. • Initialize each image’s ELO rating to • Before each comparison, we compute the expected scores:

2. ELO rating (cont.) • After we know which image won, we update both pictures’ ELO according to this rule: A large value of K makes the scores more sensitive to “winning” or “losing” a single comparison. For all the next experiments 𝐾=16. • The new ELO rating of the image is used for the ranking function • Problem • The ELO ranking system assumes that all players have the same variance in their performance.

3. TrueSkill rating • Every player has 2 variables, skill and variance around that skill. • Every player’s skill is modeled as a normally distributed random variable centered around a mean and per-player variance A player’s particular performance in a game then is drawn from a normal distribution with mean and a per-game variance , where is a constant.

3. TrueSkill rating • When image i wins over image j, the following updates are needed: when - the standard normal probability density - the cumulative distribution function • As our ranking function, we use the conservative skill estimate, which is approximately the first percentile of the image’s quality: • Thus, with very high probability the image’s quality is above the conservative skill estimate.

COLLABORATIVE FILTERING • The new goal: finding out not only general preferences, but also each individual’s preferences. This will allow us to: • Recommend images to each user based on his/her preferences. • Compare users (which users are similar?) • Compare images (which images are similar?) • New collaborative filtering algorithm called “Relative SVD” based on comparative judgments

Relative SVD • Each user 𝑖 is described by a feature vector of length 𝐾. Store the user feature vectors in a • Each image 𝑗 is described by a feature vector of length 𝐾. Store the image feature vectors in a • The amount by which user 𝑖 “likes” image 𝑗 is equal to the dot product of their feature vectors:

Relative SVD • We interpret the data gathered from our game as a set 𝐷 of triplets (𝑖,𝑗,𝑘) . • We set for all triplets . • The error for a particular decision between a sample from the training data and our model can be written as And the sum of squared errors: The goal is to find the feature matrices that minimize the total sum of squared errors:

Relative SVD • In order to minimize this error, we compute the partial derivatives for each vector : • Applying ordinary gradient descent with a step size of 𝜂 while adding a regularization penalty with parameter 𝜆, we obtain the following update equations for the feature vectors:

Relative SVD algorithm 1. Initialize the image feature vectors and the user feature vectors with random values. Set 𝜆,𝜂 to small positive values. 2. Loop until converged: a. Iterate through all training examples 𝑖,𝑗,𝑘 ∈𝐷. i. Compute ii. Update iii. Update iv. Update b. Compute model error. • Experimentation showed 𝐾=60, 𝜂=0.02 and 𝜆=0.01 are good values. • After the user and image vectors have been computed, prediction of user preferences on images is easy, simply by computing the dot product between their corresponding feature vectors.

ANALYSIS • Comparison of the 3 Models and the new relative SVD • Split our data: for training for testing • Train all four models on the training data • Use the learned models to predict users’ behavior on the test data

Analysis table Testing error of different ranking algorithms as we increase the number of judgments in the training data. • For all models, the error decreases as we use more training data. • For fewer data points, we find that ELO works best, while EWR and Relative SVD perform worst. • As we increase the amount of training data, Relative SVD beats all the other models.

Analysis graph The learning curve for several learning algorithms.

Do humans learn while playing the game? • Do the players learn which type of images are generally preferred? If they adapt too much, it could have unwanted effects. • Test of agreement rate of first-time players vs. experienced players • First-time players agree 69.0% of the time with their partner • Experienced players agree 71.8% of the time with their partner • Conclusion: the players only marginally adapt to the game. • Additional check: Do people learn during the game? • measuring the agreement rate in the first half compared to the second half of the game. • Analysis of 100 games showed the agreement rate goes down from 67% to 64%.

Let’s play again! • http://www.gwap.com/gwap/

. Gender Prediction • The gender for 2,594 players was known from their profile settings • Of these players, 68% are male and 32% female. • To find a pair of images (𝐴,𝐵) that has a strong gender bias, we compute the conditional entropy Where 𝐺 is the gender, the player’s decision is 𝑋, 𝐴>𝐵 means that image 𝐴 was considered better than image 𝐵. • A pair (𝐴,𝐵) has a large gender bias (and is therefore good for gender determination) if the conditional entropy 𝐻[𝐺|𝑋] is small (learning the decision tells us a lot about the gender). • For the class conditionals, two ELO predictors were trained, one with male players only and one with female players only. We then compute 𝐻[𝐺|𝑋] for many pairs of images and select pairs for which 𝐻[𝐺|𝑋] is smaller than a fixed threshold value. • To predict the gender of new users we sample 10 edges from those with strong gender bias and we ask the users to choose the image they prefer for each pair. Then we choose the gender that maximizes the likelihood of the data: if H(Y | X = x) is the entropy of the variable Y conditional on the variable X taking a certain value x, then H(Y | X) is the result of averaging H(Y | X = x) over all possible values x that X may take.

Which one do you pick? The test achieved a total accuracy of 78.3% in predicting the players’ gender

Best & Worst Images • Top ranked images by the different global ranking algorithms • Nature pictures are ranked higher than pictures depicting humans. • Among the 100 highest ranked pictures there is not a single picture in which a human is the dominant element. • Animal pictures are also preferred over pictures depicting humans. • Exotic animals like pandas, tigers, chameleons, fish and butterflies. • Pets are also ranked high, but usually below the exotic animals. • Pictures of flowers, churches, bridges and famous tourist attractions. • The worst pictures: • Almost all were taken indoors and include a person. • many of these pictures are blurry or too dark. • Some of the worst pictures are screenshots or pictures of documents or text. • The pictures that made it into the top 100 are neither provocative nor offensive. Why?

DISCUSSION & CONCLUSION • Collaborative filtering indicates that there are substantial differences among players in judging images, and taking those differences into account can greatly help in predicting the users’ behavior on new images. • We can predict with a probability of 83% which of two images a known player will prefer, compared to only 70% if we do not know the player beforehand. • Players do not learn much about their partner’s preferences by playing the game • More experienced players had about the same error rate as new players

CONCLUSION • The paper provides a new method to elicit user preferences. • For two images, we ask users not to tell which one they prefer, but rather which one a random person will prefer. • We compared several algorithms for combining these relative judgments into a total ordering and found that they can correctly predict a user’s behavior in 70% of the cases. • We described a new algorithm called Relative SVD to perform collaborative filtering on pair-wise relative judgments. Relative SVD outperforms other ordering algorithms that do not distinguish among individual players in predicting a known player’s behavior. • We saw a gender test that asks users to make some relative judgments and, based only on these judgments, we can predict a random user’s gender in an accuracy of 80%.

Future work • Generalize the game to other kinds of media • Short videos or songs. • Generalize the game toother types of questions • “Which image do you think your partner thinks is more interesting?” • “Given your partner is female which image do you think your partner prefers?” • It remains to be investigated how much other personal information can be gathered in the same way as our gender test does.

Questions Can I play again? Pleassssse!!!

Matchin Eliciting User Preferences with an online game

Matchin Eliciting User Preferences with an online game

Presentation Transcript

Eliciting Honest Feedback

Towards Customized Emotional Design: an Explorative Study of User Personality and User Interface Skin Preferences

Inferring User Political Preferences from Streaming Communications

Game User Experience and Serious Game Research

Model-Eliciting Activities:

International game user research

Persona: An Online Social Network with User-Defined Privacy

Planning With Preferences

Game Programming (User Input)

Game Programming (User Input)

Matchin: Eliciting User Preferences with an Online Game

LEARNING USER PLAN PREFERENCES OBFUSCATED BY FEASIBILITY CONSTRAINTS

Learning User Preferences

iSite 3.5: User Preferences

Model-Eliciting Activities

Approaches to Modeling and Learning User Preferences

An Introduction to Online Bingo Game

Planning With Preferences

User preferences for coworking space characteristics

Bitcoin Dice Game - Online Bitcoin Dice Game with Faucet

Game Online

Ludo Online game with 3plus Games