380 likes | 487 Vues
Explore the process of evaluating information visualizations through usability tests, controlled experiments, expert reviews, and statistical analyses to determine the impact of visualization tools on user performance. Learn the principles of empirical experiments, cause and effect analysis, and the role of statistics in drawing meaningful conclusions from data.
E N D
Evaluating Visualizations cs5764: Information Visualization Chris North
Evaluating Visualizations • Usability Test • Observation, problem identification • Controlled Experiment • Formal controlled scientific experiment • Comparisons, statistical analysis • Expert Review • Examination by visualization expert • Heuristic Evaluation • Principles, Guidelines • Algorithmic
Projects • Implementation projects: • Small usability test of implementation • Short usability report • Experiment projects: • Main controlled experiment • Experiment materials and raw data • Then data analysis
Usability test vs. Controlled Expm. • Usability test: • Formative: helps guide design • Single UI, early in design process • Few users • Usability problems, incidents • Qualitative feedback from users • Controlled experiment: • Summative: measure final result • Compare multiple UIs • Many users, strict protocol • Independent & dependent variables • Quantitative results, statistical significance
What is Science? • Measurement • Modeling
Scientific Method • Form Hypothesis • Collect data • Analyze • Accept/reject hypothesis • How to “prove” a hypothesis in science? • Easier to disprove things, by counterexample • Null hypothesis = opposite of hypothesis • Disprove null hypothesis • Hence, hypothesis is proved
Empirical Experiment • Typical question: • Which visualization is better in which situations? Spotfire vs. TableLens
Cause and Effect • Goal: determine “cause and effect” • Cause = visualization tool (Spotfire vs. TableLens) • Effect = user performance time on task T • Procedure: • Vary cause • Measure effect • Problem: random variation • Cause = vis tool OR random variation? random variation Realworld Collecteddata uncertain conclusions
Stats to the Rescue • Goal: • Measured effect unlikely to result by random variation • Hypothesis: • Cause = visualization tool (e.g. Spotfire ≠ TableLens) • Null hypothesis: • Visualization tool has no effect (e.g. Spotfire = TableLens) • Hence: Cause = random variation • Stats: • If null hypothesis true, then measured effect occurs with probability < 5% • But measured effect did occur! (e.g. measured effect >> random variation) • Hence: • Null hypothesis unlikely to be true • Hence, hypothesis likely to be true
Variables • Independent Variables (what you vary), and treatments (the variable values): • Visualization tool • Spotfire, TableLens, Excel • Task type • Find, count, pattern, compare • Data size (# of items) • 100, 1000, 1000000 • Dependent Variables (what you measure) • User performance time • Errors • Subjective satisfaction (survey) • HCI metrics
Example: 2 x 3 design • n users per cell Ind Var 2: Task Type Ind Var 1: Vis. Tool Measured user performance times (dep var)
Groups • “Between subjects” variable • 1 group of users for each variable treatment • Group 1: 20 users, Spotfire • Group 2: 20 users, TableLens • Total: 40 users, 20 per cell • “With-in subjects” (repeated) variable • All users perform all treatments • Counter-balancing order effect • Group 1: 20 users, Spotfire then TableLens • Group 2: 20 users, TableLens then Spotfire • Total: 40 users, 40 per cell
Issues • Eliminate or measure extraneous factors • Randomized • Fairness • Identical procedures, … • Bias • User privacy, data security • IRB (internal review board)
Procedure • For each user: • Sign legal forms • Pre-Survey: demographics • Instructions • Do not reveal true purpose of experiment • Training runs • Actual runs • Give task, measure performance • Post-Survey: subjective measures • * n users
Data • Measured dependent variables • Spreadsheet:
Step 1: Visualize it • Dig out interesting facts • Qualitative conclusions • Guide stats • Guide future experiments
Step 2: Stats Ind Var 2: Task Type Ind Var 1: Vis. Tool Average user performance times (dep var)
TableLens better than Spotfire? • Problem with Averages: lossy • Compares only 2 numbers • What about the 40 data values? (Show me the data!) Avg Perf time (secs) Spotfire TableLens
The real picture • Need stats that compare all data Avg Perf time (secs) Spotfire TableLens
Statistics • t-test • Compares 1 dep var on 2 treatments of 1 ind var • ANOVA: Analysis of Variance • Compares 1 dep var on n treatments of m ind vars • Result: • p = probability that difference between treatments is random (null hypothesis) • “statistical significance” level • typical cut-off: p < 0.05 • Hypothesis confidence = 1 - p
p < 0.05 • Woohoo! • Found a “statistically significant” difference • Averages determine which is ‘better’ • Conclusion: • Cause = visualization tool (e.g. Spotfire ≠ TableLens) • Vis Tool has an effect on user performance for task T … • “95% confident that TableLens better than Spotfire …” • NOT “TableLens beats Spotfire 95% of time” • 5% chance of being wrong! • Be careful about generalizing
p > 0.05 • Hence, no difference? • Vis Tool has no effect on user performance for task T…? • Spotfire = TableLens ? • NOT! • Did not detect a difference, but could still be different • Potential real effect did not overcome random variation • Provides evidence for Spotfire = TableLens, but not proof • Boring, basically found nothing • How? • Not enough users • Need better tasks, data, …
Data Mountain • Robertson, “Data Mountain” (Microsoft)
Comparison of Info Vis Systems • Kobsa
Cleveland’s Rules for Secondary Tasks • Chewar et al.
Usability test vs. Controlled Expm. • Usability test: • Formative: helps guide design • Single UI, early in design process • Few users • Usability problems, incidents • Qualitative feedback from users • Controlled experiment: • Summative: measure final result • Compare multiple UIs • Many users, strict protocol • Independent & dependent variables • Quantitative results, statistical significance
Usability Test Setup • Set of benchmark tasks • Easy to hard, specific to open-ended • Coverage of different UI features • E.g. “find the 5 most expensive houses for sale” • Consent forms • Not needed unless video-taping user’s face (new rule) • Experimenters: • Facilitator: instructs user • Observers: take notes, collect data, video tape screen • Executor: run the prototype if faked • Users • 3-5 users, quality not quantity
Usability Test Procedure • Goal: mimic real life • Do not cheat by showing them how to use the UI! • Initial instructions • “We are evaluating the system, not you.” • Repeat: • Give user a task • Ask user to “think aloud” • Observe, note mistakes and problems • Avoid interfering, hint only if completely stuck • Interview • Verbal feedback • Questionnaire • ~1 hour / user
Usability Lab • E.g McBryde 102
Data • Note taking • E.g. “&%$#@ user keeps clicking on the wrong button…” • Verbal protocol: think aloud • E.g. user expects that button to do something else… • Rough quantitative measures • HCI metrics: e.g. task completion time, .. • Interview feedback and surveys • Video-tape screen & mouse • Eye tracking, biometrics?
Analyze • Initial reaction: • “stupid user!”, “that’s developer X’s fault!”, “this sucks” • Mature reaction: • “how can we redesign UI to solve that usability problem?” • the user is always right • Identify usability problems • Learning issues: e.g. can’t figure out or didn’t notice feature • Performance issues: e.g. arduous, tiring to solve tasks • Subjective issues: e.g. annoying, ugly • Problem severity: critical vs. minor
Cost-Importance Analysis • Importance 1-5: (task effect, frequency) • 5 = critical, major impact on user, frequent occurance • 3 = user can complete task, but with difficulty • 1 = minor problem, small speed bump, infrequent • Ratio = importance / cost • Sort by this • 3 categories: Must fix, next version, ignored
Refine UI • Simple solutions vs. major redesigns • Solve problems in order of: importance/cost • Example: • Problem: user didn’t know he could zoom in to see more… • Potential solutions: • Better zoom button icon, tooltip • Add a zoom bar slider (like moosburg) • Icons for different zoom levels: boundaries, roads, buildings • NOT: more more “help” documentation!!! You can do better. • Iterate • Test, refine, test, refine, test, refine, … • Until? Meets usability specification
Project revisited • For implementation projects: • Informal test • A few users • Not (tainted) info vis students • 102 lab not required • Simple data collection • Biometrics optional! • 1 iteration • Exploit this opportunity to improve your design • For experiment projects: • See controlled experiments