IEEM 552 - Human-Computer Systems

IEEM 552 - Human-Computer Systems Dr. Vincent Duffy - IEEM Week 7 - Hazards in HCI March 16, 1999 http://www-ieem.ust.hk/dfaculty/duffy/552 email: vduffy@uxmail.ust.hk 1

For today • 1. Further discussion/review of the summary of results you submitted • based on the week 3 in-class exercise ‘an example’ • 2. Hazards to conducting and interpreting HCI Experiments • 3. Brief discussion - Pictogram, Miller & Stanney • 4. brief discussion about exam 1 2

A test of 2 interfaces - Which interface is better? • Self rating of Expertise for Library Online Searches • Human Subjects Consent Form • Data sheet for data collector • Data Sheet for Subject • Introduction • subjects will leave the room 3

Is it enough to ask which is better? • What do I expect from the results/data? • What are the hypotheses? • What do I think is true about the system before I start? • What questions am I trying to answer with the data/analyses? • H.1. www search is faster. • H.2. More errors using www search. • H.3. More data can be found by www. 4

Self-rating and consent form 5

Group # _____Name member_______ • Please rate your experience using the Library search databases at UST or other universities. • Least Most • experience experience • 1 2 3 4 5 6

Data sheet for subject 7

Data sheet for subject • Group no./name _________________ • The test administrator will show you which interface to use first. Please do all 3 of the following tasks. Do not stop until all 3 tasks are completed. • Interface 1 • 1. Please write the 'call number' of the book titled : 'user interface design' by Eberts. • Call no. ________________ • 2. Please write the call number for a video about the Wright Brothers. We do not know the title. However, it is less than 30 minutes long (duration). • Call no. ______________ • 3. Please locate/find as many Visual C++books in less than 5 minutes). • Number of Visual C++ books found _____________ 8

Data/instructions for data collector 9

Before the experiment-subjects out of the room • Group #____ Data sheet for data collector • 1. record name of data collector__________ • 2. the data collector will need to record time (by watch, clock, computer, etc.) • 3. be sure subject signs human subjects consent form • 4. give instruction sheet, allow 1 minute for reading and one minute for questions. • 5. Show the subject how to start library online systems. • Count time beginning when the subject double clicks the correct icon • (the two interfaces to be tested are www or telnet/dos)- • odd numbered groups (eg. 1,3,5) should begin with www • even numbered groups (eg. 2,4,6) should begin with telnet/dos interface). 10

During the experiment • collect 14 pieces of data - 7 pieces for each of two interfaces • 1. subject name/group no. _______________ • 2. www interface (1) or telnet/dos interface (2)_______________ • 3. time to find item 1 (call number of 'user interface design' by Eberts). _______________ • 4. time to find item 2 (a video about Wright Brothers - less than 30 minutes video)______________ • (begin counting time immediately after finding item 1) • 5. number of errors in finding item 2 (count error as any back, prev. record, start over, etc.)________ • 6. time to find (how many books can you find on Visual C++ in less than 5 minutes)._____________ • 7. quantity of Visual C++ books ________________ 11

After collecting the data • Use sample experimental data (previously collected) • upload to www, download so accessible to you, run analyses • using SAS -Statistical Analysis Software • how to compare? • Simple test of difference in means - we used T-test (comparing only 2 variables) • discuss hypotheses • for hw: asked you to interpret the output 6 12

For assumptions of analyses/output • hint : see Chapter 6 Cody and Smith (p.138-149) • How do I determine if Hypothesis 1-3 are supported? • H.1. www search is faster. • H.2. More errors using www search. • H.3. More data can be found by www. 13

Sample SAS program 14

H.1. www search is faster. • For our hypothesis we want to check difference in means. • First check if variances equal to help decide which p values to use (find p>F’ - prob. That we reject Ho incorrectly? if p<.05 reject Ho) • Either way, look at p>|T|, probability that we reject Ho incorrectly. • If p<.05, reject - Ho- for T-test - for which Ho says ‘means are same’ • if you reject, then conclude - means are statistically different • For time for task 1 means are not statistically different. 15

How do these results influence conclusions about H1? 16

H.2. More errors using www search.Suppose your results looked like this…. • check difference in means • First check if variances equal to help decide which p values to use (find p>F’ - prob. That we reject Ho incorrectly? if p<.05 reject Ho) • Either way, look at p>|T|, probability that we reject Ho incorrectly. • If p<.05, reject - Ho- for T-test - for which Ho says ‘means are same’ • if you reject then conclude - means are statistically different • For number of errors means are not statistically different. 17

However, this was your data... • What do you conclude about H2?

H.3. More data can be found by www. • check difference in means • First check if variances equal p=.004, reject Ho (that Variances are equal) • use the information to decide which p value to observe for the T-test • In this case, look at p>|T|, probability that we reject Ho incorrectly. • For ‘unequal variances’ to help decide whether to reject Ho- for the T-test which says ‘means are equal’ • p=.356, accept - Ho- for T-test - ‘means are same’ • if you reject then conclude - means are statistically different • For quantity of books found, means are not statistically different. 18

None of our 3 hypotheses were fully supported • Does this mean we were incorrect from the start? • WWW is no better than the dos/telnet based system? • Does it mean www is • not faster, • not less error prone, • not more likely to allow you to find more information? • What might have gone wrong? 19

Hazards to conducting & interpreting HCI experiments • To be avoided • when conducting experiments • To be noticed • when reading experiments of other people • to see if the methodology or interpretation of results invalidates some conclusions • Sheil (1981) found large % of studies had some methodology problem which made results suspect 20

What is wrong with the following? (please submit a separate sheet with your answers) • Q1. Hypothesis states/asks: ‘Is this new interface effective?’ • Q2a.New interface is compared to old interface. The subjects tested using the new design also have used the old design. • Q2b.subjects for new improved design treated more enthusiastically (or more quiet room) • Q3. Software manufacturer tests financial planning software on its employees (mostly programmers) • Q4. Two experiments show the same mean difference between interface measures, but the difference is statistically significant in one experiment and not in the other • Q5. One person administers a test to 10 subjects for one interface test condition (treatment). A different person administers the test to 11 subjects for the other. • Q6. Suppose a correlation (R=.55) shows a significant relationship (p<.05) is found between ‘percent correct’ and ‘frequency of use’ of help menus • Also suppose a correlation is found betwn. likert scale (1-7 scale) variable & ‘frequency of use’ of help menus. Which are you more likely to use for drawing conclusions? • Q7. Experiment finds that there is no statistical difference between measured variables of the old and new designs. He concludes that the two are the same. Marketing of the new design is halted. • Q8. a vendor is trying to sell your software company a computer programming tool that was found to reduce programming time by 50%. You are told you should expect 50% reduction in software development time. The product was previously tested on novices.

Q.1. What is wrong with this? • Q. Is this new interface effective? • what is meant by effective? • should this mean faster or fewer errors? • should this mean people prefer this one? • for whom? expert or novice? • effective compared to what? 2 different designs? some standard? • evaluations must be made for two or more treatment conditions • a better question/hypothesis • can new interface can be used with less assistance? 22

Hazard 1 - Question phrased improperly • What’s the big deal? • experimenter may discover certain measures for which data should have been collected • too late • How to avoid it • planning, behind the scenes work • conduct a pilot test on a small number of subjects • understanding the underlying theories related to the independent variables or the dependent (performance) measures 23

Q.2. What is wrong with this? • Q.2.a. New interface is compared to old interface. The subjects tested using the new design also have used the old design. • important variable not controlled • subjects have prior experience (training) • Q.2.b. subjects for new improved design treated more enthusiastically (or more quiet room) • treatment of subjects varies w/level of ind.var. 24

Hazard 2 - Important variables not controlled • What is the big deal? • uncontrolled variable (confounding) can simulate or counteract (eliminate detection) of a treatment effect • How to eliminate or minimize this? • list all the variables that might influence • control each variable through • randomization, hold constant or eliminate (variability), manipulate it 25

Consider ‘an example’ Hyp. 2 & ‘which came first’? • For task 2, ‘find wright brothers video’, • time to complete and errors significantly higher for the first interface (is it because it was improvement (2nd time doing task) or is www is easier to use? Do you know?) 26

Q.3 What is wrong with this? • Software manufacturer tests financial planning software on its employees (mostly programmers) • inappropriate sample used • tested mostly experts when software was designed for computer novices • mixed group of novice and expert employees 27

Hazard 3 - Inappropriate sample used • What’s the big deal? • results can be misleading if they are generalized to the wrong group of users • How to avoid? • try to demonstrate that subjects have been stabilized at some performance level • or report honestly that subjects may not have been allowed sufficient time to become proficient at task (so as to stabilize level) 28

Q.4. What is wrong with this? • Two experiments show the same mean difference between interface measures, but the difference is statistically significant in one experiment and not in the other • not enough subjects are used • What is meant by statistically significant? • usually set at p<.05 (probability you reject null incorrectly ex: null: no difference) 29

Hazard 4 - Not enough subjects used • Why does this commonly happen? • finding appropriate sample, that is large enough, is difficult sometimes • in practice, number of subjects often determined by number available or based on reports of previous studies • How to avoid? • choose larger samples (more expensive) • consider trade-offs: sample size, variability, size of potential effect (to be measured) 30

Q.5. What is wrong with this? • One person administers a test to 10 subjects for one interface test condition (treatment). A different person administers the test to 11 subjects for the other. • what if the two experimenters (people administering the test) conduct the experiment differently? • test administered improperly • two different people should not administer the test in this manner 31

Hazard 5 - Test administered improperly - experimental studies • What’s the big deal? • different distractions or test conditions can influence the results, or increase the variability making actual differences difficult to detect (or differences may be due to sloppy test conditions) • How to avoid? • Test the interface to eliminate bugs, stabilize the experimental/room, and general test conditions • How is this different for field studies? 32

Q.6. What is wrong with this? • a correlation (R=.55) shows a significant relationship (p<.05) is found between ‘percent correct’ and ‘frequency of use’ of help menus • correlation is found betwn. likert scale (1-7 scale) variable & ‘frequency of use’ of help menus • the first example violates the assumptions of the method of analysis used (percent correct not usually normally distributed - likert scales are). • parametric statistics assume normality and homogeneity of variances 33

Suppose our T-test shows a significant difference in the means between time to complete task 2 and interface. Can we safely conclude that our hypothesis is supported? • What kind of distribution is shown by this data? • Normal, uniform? • What are the assumptions of the statistics we used (eg.T-test) • Normality • If the data is not normally distributed, you can not use the statistics that require normality as a basic assumption (correlation,, t-test, anova, etc.).

Hazard 6 - Improper analysis used • What’s the big deal? • it can invalidate the results of the experiment • How to avoid? • test the data - distributions of variables should be normal and should have equality of variances- for multi-variate stats like regressions • if necessary, use a different method of analysis (non-parametric-not as robust) or transform data • be sure that the data meets the assumptions before running the analysis otherwise you waste your time 35

Q.7. What is wrong with this? • Experiment finds that there is no statistical difference between measured variables of the old and new designs. He concludes that the two are the same. Marketing of the new design is halted. • if you can not reject the null hypothesis (no difference), that does not prove it • it only shows that you could not prove a difference • it may still exist. how? 36

Hazard 7 - Null effects interpreted incorrectly • examples of some things that may make a difference more difficult to detect • Hazard 2. a confound may have occurred • Hazard 4. not enough subjects to detect a difference • Hazard 5. treatments administered poorly causing high variability in the conditions • Hazard 6. was wrong statistical test conducted? • your measure may not be sensitive enough to detect a difference 37

Q.8. What is wrong with this? • a vendor is trying to sell your software company a computer programming tool that was found to reduce programming time by 50%. You are told you should expect 50% reduction in software development time. The product was previously tested on novices. • software developers are likely not novices, so it is difficult to know what to expect. 38

Hazard 8 - Results generalized beyond conditions tested • What’s the big deal? • can mislead readers. • we can be misled if we only read the abstract and conclusion • How to avoid? • be careful not to generalize the results beyond the sample & conditions tested • your results may lend evidence, but further testing may be needed to confirm 39

For week 8 - Exam details • Old exam - on web page • closed book format in class • 100 points • 65% lecture notes, 3 videos, 2 cases & demo • 35% integrating concepts with the research papers & the class trip • week 1-7, lectures 1-5 & demos. • Background reading: Chapter 1,3 Eberts; Cody & Smith, Ch. 6 (p.138-146), 3 journal papers -’Thinking Aloud’ and ‘Task complexity’ and ‘Pictogram’, 2 cases. 40

IEEM 552 - Human-Computer Systems