1 / 46

How Many Participants is Enough in a Usability Test? Dr. Bob Bailey www.webusability.com

How Many Participants is Enough in a Usability Test? Dr. Bob Bailey www.webusability.com. How many usability test participants do you think is the correct number? ____________. Number of Participants.

halen
Télécharger la présentation

How Many Participants is Enough in a Usability Test? Dr. Bob Bailey www.webusability.com

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. How Many Participants is Enough in a Usability Test?Dr. Bob Baileywww.webusability.com

  2. How many usability test participants do you think is the correct number?____________

  3. Number of Participants • Having the appropriate number of subjects will accomplish the goals of a usability test as efficiently as possible • If too many are used • Increased cost • Increased development time • If too few are used • Fail to detect some important problems • Reduce the usability of the product

  4. Factors Influencing the Number of Participants • Phase in the development cycle • Design approach used (user-centered?) • The product’s life cycle • Prototype (fidelity level?) • New system • Existing system • Complexity of the product • Number of features • Number of fields/screens/windows/pages • Traditional, new or unique technologies

  5. Factors Influencing the Number of Participants (continued) • Testers/Evaluators • Usability testing experience • Domain knowledge • Users • Diverse nature of the population (unique segments) • Required domain knowledge (much, little) • Frequency of performance in actual system • Number of users in the population (e.g., unique monthly visitors)

  6. Unique Audience for Certain Federal Government Sites (1 Month)* • Treasury 11,700,000 • Department of Defense 8,300,000 • Health and Human Services 7,300,000 • NASA 5,200,000 • Department of Education 4,300,000 • Executive Branch 2,680,000 • Department of State 2,100,000 • Department of Labor 1,990,000 • Department of Energy 1,600,000 • FirstGov 1,380,000 • Central Intelligence Agency 914,000 • National Archives 894,000 *Nielsen//NetRatings, February 2003

  7. Factors Influencing the Number of Participants (continued) • Overall task complexity • Easier: Find the per diem rate for Chicago • Harder: Determine how much tax you owe • Repercussions of task failure • Lose money • Lose time • Lose a life

  8. Usability Testing Categories • Automated evaluations • Inspection evaluations • Expert reviews • Heuristic evaluations • Cognitive walkthroughs • Human performance testing • Usability lab (local) • Online (remote) • Operational evaluations • Online intercept surveys (ACSI) • Web analytics

  9. Web Analytics • The direct monitoring and analysis of online user behavior and interactions with a website • Commercial web analytics products: • CoreMetrics Online Analytics • NetGenesis • WebTrends 7 • Omniture SiteCatalyst • WebSideStory • These toolkits anonymously capture and analyze website traffic volumes and track visitor behavior • Typical features include • Capturing and reporting of navigation paths through the site • Referral analysis, e.g., where users came from, what content they view, etc. • Geographic trends analysis (origin of user access)

  10. Usability Testing Categories • Automated evaluations • Inspection evaluations • Expert reviews • Heuristic evaluations • Cognitive walkthroughs • Human performance testing • Usability lab (local) • Online (remote) • Operational evaluations • Online intercept surveys (ACSI) • Web analytics

  11. Testing Levels • Level 1: Traditional inspection evaluations • Primary focus: To identify usability issues • Evaluation methods • Heuristic evaluations • Cognitive walkthroughs • Level 2: Algorithmic expert reviews with scenarios • Primary focus: To identify usability issues • Evaluators • Focus on ‘algorithmic’ (not heuristic) issues, i.e., black text on white background • Use scenarios to stay focused on the most important tasks to identify usability issues • Level 3: Usability tests • Primary focus: To identify usability issues while observing participants • Use available participants • Use a set of test scenarios • Require participants to think-aloud during testing (discussions) • Secondary objective: Collecting quantitative data

  12. Testing Levels (continued) • Level 4: Usability tests • Primary focus: To identify usability issues while observing participants • Use representative participants • Use representative test scenarios • Participants provide feedback at the end of a scenario or end of the test • Objectives include • Collecting quantitative data to set a baseline, or compare with a baseline • Collecting qualitative data • Use findings to identify usability problems • Level 5: Usability tests (rigorous, tightly controlled) • Primary focus: To compare an existing product (website) with • A set of objectives • A previous iteration of the same product • A competing product • Use truly representative participants • Use truly representative test scenarios • Collect quantitative data to use for • Demonstrating the performance level (summative test) • Comparing with other test results • Secondary objective: To identify usability problems

  13. How Many Participants are Needed?The Short Answer • Level 1: At least 4 evaluators • Level 2: At least 4 expert evaluators • Level 3: 6 participants (or as many as you can get) • Level 4: 12-15 participants • Level 5: 20 participants per group

  14. Usability TestingLevels 1 and 2: Inspection Evaluations

  15. Evaluator Effect • Having multiple evaluators • Evaluating the same interface using the same method, and • Detecting markedly different sets of problems • One study found that the ‘evaluator effect’ for the three main evaluation methods was about the same (“they were all equally poor”)

  16. The Evaluator Effect (continued) • The ‘evaluator effect’ existed for • Both novice and experienced evaluators • Both cosmetic and severe problems • Both problem detection and severity assessment, and • Both simple and complex systems • The average agreement between any 2 evaluators ranged from 5% to 65%

  17. Determining the Number of EvaluatorsVirzi, 1990; Lewis, 1993 • The formula for calculating the number of evaluators needed to find a specific percentage of ‘problems’ • 1‑(1‑p)n wherep = mean probability of detecting a problem n = the number of evaluators • Problem discovery in inspection evaluations is consistent with this cumulative binomial probability formula

  18. What is ‘p’? • Assume that all inspection evaluators together find 100 unique usability issues (duplicates are eliminated) • Assume that the average number of issues found by each evaluator was 30 • In this case: p = .30 (30/100) • How many evaluators will be needed to find 90% of the 100 usability issues?

  19. Using 5 Evaluators and p=.30 • 1‑(1‑p)n wherep = mean probability of detecting a problem n = the number of evaluators • 1-(1-.30)5 • 1-(.7)5 • 1-.17 • .83 or 83%

  20. Using 6 Evaluators and p=.30 • 1‑(1‑p)n wherep = mean probability of detecting a problem n = the number of evaluators • 1-(1-.30)6 • 1-(.7)6 • 1-.12 • .88 or 88%

  21. Using 7 Evaluators and p=.30 • 1‑(1‑p)n wherep = mean probability of detecting a problem n = the number of evaluators • 1-(1-.30)7 • 1-(.7)7 • 1-.08 • .92 or 92%

  22. You Only Need to Test With 5 UsersNielsen, 2000 • Nielsen suggested that “elaborate usability tests are a waste of resources” • “Collecting data from a single test subject enables a designer to learn almost a third of all there is to know about the usability of the design”

  23. Only Five … (Nielsen continues) • “Testing a second potential user adds some new insights, but not near as much as did the first user • “When the third, fourth and fifth users are added, less and less new information is learned • “After the fifth user, you are wasting your time by observing the same findings repeatedly but not learning much new”

  24. Iterative Design • The following graphic “clearly shows that you need to test with at least 15 users to discover all the usability problems in the design” • Nielsen’ graph

  25. Average Detection RateVariations of ‘p’ • Spool (2001 - Live websites) - Average detection rate: 0.08 • Law and Hvannberg (2004 - Prototype): 0.09 • Lewis (1994 - Prototype): 0.16 • Walker, Takayama and Landay (2000 - Prototypes): 0.21 • Nielsen and Landauer (1993 - Prototypes): 0.31 • Virzi (1990 - Prototypes): 0.36 • Woolrych and Cockton (2002 - Prototypes): 0.43 • Nielsen (1992 - Prototypes) • Novice evaluators: 0.29 • Usability specialists with no domain experience: 0.46 • Usability specialists with domain experience: 0.61 • Jacobsen, Hertzum and John (1998 - Prototypes): 0.52

  26. Average Detection RateVariations of ‘p’ • Spool (2001 - Live websites) - Average detection rate: 0.08 • Law and Hvannberg (2004 - Prototype): 0.09 • Lewis (1994 - Prototype): 0.16 • Walker, Takayama and Landay (2000 - Prototypes): 0.21 • Nielsen and Landauer (1993 - Prototypes): 0.31 • Virzi (1990 - Prototypes): 0.36 • Woolrych and Cockton (2002 - Prototypes): 0.43 • Nielsen (1992 - Prototypes) • Novice evaluators: 0.29 • Usability specialists with no domain experience: 0.46 • Usability specialists with domain experience: 0.61 • Jacobsen, Hertzum and John (1998 - Prototypes): 0.52

  27. The True Number of Problems • Some people erroneously assume that the total number of issues found by all evaluators is identical to the total number of problems in the interface • The p-value is far less than the original .31 [reduce by half] • Only about half of the proposed usability issues are actually usability problems [reduce by half, again]

  28. Usability TestingLevel 3

  29. Level 3 Usability Testing • Primary focus: To identify usability issues while observing participants • Use fairly representative participants • Use test scenarios (presented by UTE) • Have participants think-aloud during testing (many probes and discussions) • Collect ‘soft’ quantitative data to help identify scenarios with the most usability problems

  30. Usability TestingLevels 4 and 5[Now we need to get serious about the number of participants!]

  31. Level 4 Usability Testing • Primary focus • To compare against usability objectives • To set a baseline • To further identify usability issues • Use representative participants (based on a screener) • Use test scenarios (presented by UTE) • Participants provide observations or feedback at the end • Of a scenario, or • Of the test • Enables • The effective collection of both quantitative and qualitative data • Making inferences from the sample to the population

  32. Samples and Populations • Testers usually do not have access to an entire population of users • Too large • Not willing to be measured • Measurement process is too • Expensive or • Time-consuming • To estimate some population characteristic (e.g., the average time to click a link) • Take a sample, and • Compute a quantity (a statistic) • ‘Samples’ help testers understand the characteristics of a ‘population’

  33. Confidence Intervals • One good way to determine how well a sample reflects the population is to use the concept of ‘confidence intervals’ • A confidence interval provides an estimate of the range of values that will most likely contain the true value, or true population value • Generally, the range of values includes those that have a 95% chance of being the true population value

  34. Using Confidence Intervals • You just finished a usability test • You had 5 participants attempt a task in a new web application • All 5 of the participants completed the task • You announce this success to the development team and your supervisor • Your supervisor asks, “OK, this is great with 5 users, but what are the chances that 50 or 1000 or 10,000 will have the same 100% completion rate?”

  35. Using Confidence Intervals (continued) • If five out of five users complete a task, you can be 95% confident that • The completion rate in the overall user population could be as high as 100% • But it also could be as low as 48% • In other words, when this web application is used by real users • All of them could successfully complete the task, or • Over half (52%) could fail the task

  36. Confidence Level • The confidence level is the percent likelihood statement that accompanies the width of the confidence interval • It is usually set at 95% • A confidence level of 95% means that 5 out of 100 times your sample completion rate will NOT fall within your confidence interval

  37. 95% Confidence Interval (N=5)

  38. 95% Confidence Interval (N=10)

  39. Usability Testing Level 5 • Primary focus: To compare an existing product (website) • Against a set of objectives • With a previous iteration of the same product • With a competing product • Secondary objective: To identify usability issues • The test procedure is very tightly controlled • Use truly representative participants • Use test scenarios • Collect quantitative and qualitative data • Demonstrates the performance level (summative test) • Allows comparison with other test results • Suggests some changes to the product

  40. Power of the Test • The ‘power’ of a statistical test is a measure of its ability to detect a difference when there is one • Sample size is one of the main factors used to determine the power of a test

  41. Comparing the Average ‘Click Time’ with a Competitor • After posting the website with an average click time of 10 seconds, management found a competing website that had reduced the average click time to 8 seconds • The website was redesigned with the goal of reducing the average click time to 6 seconds

  42. The ‘Null’ and Alternative Hypotheses • The null hypothesis • Example: The average time to click on the correct link will be the same when compared with the competitor’s homepage • The alternative hypothesis • Example: The average time to click on the correct link when using the redesigned website will be reliably faster than the competitors

  43. Required Number of Participants • Enough to be reasonably sure that you can detect a reliable difference if one exists • But not so many participants that small and unimportant differences are detected

  44. Increasing Power • Consider ways to increase statistical power so that testers do not miss something important • Improving power • Increase the sample size - Keep in mind that if power is already high, increasing the sample size will do little or nothing • Increase the alpha level (.05 rather than .01) • Increase the acceptable effect size (try to identify larger differences) • Narrow the variance • Select randomly from actual users • Use the same tester for all participants • Exert greater control over all variables while testing • Make measurements that are more realistic and precise • Use ‘same subject’ (within subjects) tests where possible

  45. Power-Sample Size CalculatorParticipants Required per Group http://www.health.ucalgary.ca/~rollin/stats/ssize/n1.html Local

  46. How Many Participants are Needed?The Final Answer • Level 1: 4-8 evaluators • Level 2: 4-8 expert evaluators • Level 3: 4-8 participants in each iteration • Level 4: 12-15 participants, or the number needed to ensure acceptable confidence intervals • Level 5: “35” per group, or the number needed to ensure sufficient testing power

More Related