Experimental methods: A review

Experimental methods: A review (Source: Tombaugh & Dillon’s “A Practical Introduction to Experimental Design in CHI Research”, 1993)

Empirical methods in HCI • Heuristic evaluation • Controlled laboratory experiments • Quasi-experiments • Ethnographic observation • Task analysis • User studies/user testing

Experiment Design is similar to user interface design: • Iterative: conceptualize, pilot, improve, run • The perfect experiment does not exist! • Experiments involve tradeoffs: • cost vs. running the ideal experiment • saving time vs. running ideal experiment • sometimes there are conflicting guidelines

What IS an experiment? • All experiments have control conditions (comparisons are made) • Hypotheses are tested • Assignment is random (If assignment is not random, then it’s a quasi-experiment) • In the successful experiment, conclusions about causality can be made (no bias or confounds) • Can be replicated!!

Criteria for expt design • External validity • Internal validity • Reliability

Overview, experiment design Determine research problem Consider validity, etc. Pilot !! Design experiment Is NO Design ok? YES Collect data

Choices in experiment design • Variables (to manipulate and measure) • Design (within-Ss, between-Ss, mixed) • Controls (what will be compared) • Sample (how subjects will be chosen) • Task (many considerations) • Stimuli

Variables • IV: Independent variable (What is manipulated; “factor”, “treatment”) • DV: Dependent variable (What is measured) • Extraneous variable: (Any variable, other than the IV, that might affect the DV) • Confound: An extraneous variable that co-varies with the IV

Confounds • Example from handout: Compare Screen A vs. Screen B • Screen A is used in a room w/ windows • Screen B is used in a room w/out windows • What can you conclude if performance is better with Screen B than Screen A? • Another example: The dreaded “end of the semester” effect

Between subjects designs • Each person is tested in one condition

Avoiding confounds • Subjects are randomly assigned to conditions. • If individual differences are likely to be important, subjects can be matched on important characteristics.

Within subjects designs (also known as repeated measures) • Each person is tested in all conditions • This avoids effects of ind. Differences! • Order of conditions is randomized or counterbalanced • But you can get unexpected order effects!

Advantages of Within-Ss • fewer subjects needed • statistical tests are more powerful • control for individual differences • best way to study learning or forgetting or the effects of expertise (longitudinal designs)

Disadvantages of Within-Ss • Order effects can ruin the results (practice, fatigue, learning, boredom); counterbalancing is necessary! • More testing materials are required. • Order of presentation of materials must be controlled (counterbalanced). • It can be difficult to get subjects to return for repeated testing.

Disadvantages of Within-Ss • Order effects can ruin the results (practice, fatigue, learning, boredom); counterbalancing is necessary! • More testing materials are required. • Order of presentation of materials must be controlled (counterbalanced). • It can be difficult to get subjects to return for repeated testing. (Between-Ss designs are an alternative.)

Mixed designs • Combine between- and within-subjects comparisons • One or more comparison is between two groups of different people • One or more comparison is within the same group of people

What is an interaction? Example: Which type of dialog is better, commands or menus?

What is an interaction? Example: Which type of dialog is better, commands or menus? (When the answer is “it depends!”, that suggests an interaction.)

Example of an interaction DIALOG STYLE CommandMenu Novice 42 28 USER Expert 16 20 (Here, the DV is the time it takes to do the task. The IVs are Dialogue Style and User’s Expertise).

Interaction For example: Which type of dialog is better, commands or menus? Answer: Commands are faster for experts & menus are faster for novices.

Interaction For example: Which type of dialog is better, commands or menus? Answer: Commands are faster for experts & menus are faster for novices. (If the answer is “it depends…” that is, it depends on levels of another independent variable such as expertise, then there is an interaction.)

Example of an interaction DIALOG STYLE CommandMenu Novice 42 28 USER Expert 16 20

Example of an interaction Time

Example of an interaction Time (Which representation is better?)

Between vs within vs mixed designs DIALOG STYLE CommandMenu Novice 10 Ss 10 Ss USER Expert 10 Ss 10 Ss

Between vs within vs mixed designs DIALOG STYLE CommandMenu Novice 10 Ss USER Expert 10 Ss (What are the advantages and disadvantages of having Dialog Style as a within-subs vs. a between-subs variable?)

What can go wrong - Two types of errors: • Type 1 - Your data show a statistical effect, but it’s not real. • Type 2 - Your data fail to show any statistical effect, but the effect is out there in the world.

What can go wrong - Two types of errors: • Type 1 - Your data show a statistical effect, but it’s not real. • Type 2 - Your data fail to show any statistical effect, but the effect is out there in the world. Avoid Type 1 errors by replicating your effects. Avoid Type 2 by increasing your power.

Power • Run more subjects • Include more observations/items/tasks • Eliminate noise (reduce variance) • Try to achieve better control (These pairs of distributions show the same differences in means, but very different variances.)

Common ways to reduce variance and increase power • Random assignment • Minimize differences in subjects & items • Try to use a within-Ss design • Match subjects’ characteristics - remove confounds by making the comparison groups or items similar • Counterbalance

Quasi-Experiments • When no random assignment is possible, e.g. in the study of: • Gender effects • Bilingualism • Experts vs. novices (unless expertise can be acquired in the course of the expt.) • You cannot conclude anything about causality in a quasi-expt, due to potential confounds! • Many HCI expts are quasi-expts.

When is it appropriate to do an expt? • when a direct comparison of two or more systems or variables is required • when it’s feasible to achieve some measure of control • when you want to show causality • when you want to test predictions; when precise understanding is needed • when you need data for establishing the parameters of a model

Advantages of experiments • provide comparative data • enable strong statements of causality • a wide variety of designs are available • excellent conceptual match between experimental problems and statistical tests (always know how you’re going to analyze the data before you collect it!!)

Limitations of experiments Experiments often • are time consuming • are expensive • can be ineffective for comparing complex systems (what is causing differences?) • can result in weak generalization (if the task is overly simplified, or if materials and setting aren’t varied enough)

“R&D” (Research vs. Development) • Scientific research is strongly associated with experiments (hypothesis-testing) • Science also includes a descriptive component • Discuss: Applied vs. basic research; development

Summary Designing an expt involves many tradeoffs - There is no perfect expt. Pilot! (Expt design is iterative!) Applied research makes tradeoffs differently than academic research: it’s more timely, more generalizable, more descriptive, less controlled, and more relevant to real-world problems.

A note about questionnaires:

Questionnaire design • Sample questionnaire, CGB, Fig. 14.3 • Overall, the system was easy to use • “strongly disagree” to “strongly agree” • The system was quick and efficient • The system had the capabilities I expected. • What are the top 2 suggestions you could make to improve the system? • What are the top 3 things you liked about the system?

Questionnaire design - critique! • Sample questionnaire, CGB, Fig. 14.3 • Overall, the system was easy to use. • “strongly disagree” to “strongly agree” • The system was quick and efficient. • The system had the capabilities I expected. • What are the top 2 suggestions you could make to improve the system? • What are the top 3 things you liked about the system?

Questionnaire design - critique! • Sample questionnaire, CGB, Fig. 14.3 • Overall, the system was easy to use. • “strongly disagree” to “strongly agree” • The system was quick and efficient. • The system had the capabilities I expected. • What are the top 2 suggestions you could make to improve the system? • What are the top 3 things you liked about the system? Biased!

Review of user studies (Gomoll’s article)

Set up observation (tasks, users, situation) Describe the evaluation’s purpose Tell user she can quit at any time Introduce equipment Explain how to “think aloud” Explain that you will not provide help Describe tasks and system Ask for questions Conduct the observations (debrief the subject) Summarize results User studies (Gomoll, 1990)

Experimental methods: A review