Tutorial 2019 - Evaluation of artifacts in Design science - PART 1

Tutorial 2019-Evaluation of artifacts in Design science- PART 1 Gondy Leroy, Ph.D. Management Information systems University of arizona gondyleroy@email.arizona.edu

Presenter Background • Education • B.S. and M.S. in Cognitive, Experimental Psychology, University of Leuven, Leuven, Belgium • M.S. and Ph.D. in Management Information Systems, University of Arizona, Tucson, AZ • Relevant Experience • Principal Investigator $2.3M of funded research, by NIH, NSF, AHRQ, Microsoft Research • Book “Designing User Studies in Informatics”, Springer (August 2011) • Current Position and Contact Information • Gondy Leroy, PhD • Professor, Management Information Systems • Director - Tomorrow's Leaders Equipped for Diversity • Eller College of Management, University of Arizona • http://nlp.lab.arizona.edu/

Additional Resources • Content based on Designing User Studies in Informatics Gondy Leroy, Ph.D. Springer, 2011 • Freely available in most academic libraries (chapters can be downloaded electronically) • Book information • bridges gap between informatics, design science, and behavioral sciences. • explains what an experimenter should pay attention to and why • Practical, to-the-point, hands-on • Contains a ‘cookbook’ with step-by-step instructions for different types of artifact evaluations

Overview • 1. Design science and study types • 2. Experiment design • Independent, dependent, nuisance, confounding variables • Short exercises • 3. Basic statistics • t-test, ANOVA • 4. Exercise • (Extra Materials – not covered but available)

Design Science context PART 1

Design science • Two paradigms in IS: behavioral and design science (Hevner et al 2004) • Behavioral paradigm • Develop theories, test theories • Theories to understand, predict organizational and human phenomena relevant to information systems • Design science • Roots in engineering and AI (Simon 1996) • Problem solving paradigm: create artifact to solve a problem • Evaluation • Can use many approaches to evaluate artifacts • Simulation, mathematical description, impact studies • Focus on correctness, effectiveness, efficiency … • Studies to evaluate impact • Work with users in some cases • Work with experts in some cases • Work with gold standards in some cases

Development Life Cycle • Cyclical nature of development in informatics • Digital libraries, online communities, mobile apps, … • Education, business, medical informatics, biology, commerce, … • Good fit of cycle (regardless of which one is adopted) with different types of user studies • Example: Importance of testing EARLY and FREQUENTLY: Requirements Analysis- most errors originate here, 60% undiscovered until user acceptance testing (Gartner Group, 2009) Paper prototyping of interfaces: easy to accept changes Remodeling, renewing, improving: test new features against previous version Algorithm development and evaluation: possibility to pinpoint location of strengths and weaknesses Possibility of batch process evaluations System evaluation: test additional features, test synergy of all components together

Focusing the Study • Increase your chances of getting published and funded by focusing the study according to these three principles: • Define the goal of the system • Will help choose comparison points (=define independent variable) and pinpoint measures (=dependent variables) • Keep stakeholders in mind • Why is the study conducted (to improve?), who is interested in it (what are they interested in?) • Will help relate to people evaluating your artifact and study • Timeline and Development Cycle • What is available for testing, what can be tested in design phase? • Will help design appropriate (series of) studies • Example: reducing the number of no-shows at a clinic • Goal: have more patients show up for appointment, don’t increase workload of staff • Stakeholders: patients (will need to be easy, no effort), staff at clinic (low training, easy to manage), purchaser at clinic (demonstrated effect on no-shows) • Timeline and Development Cycle: paper prototyping for interface, comparison after implementation with existing system

Different Study Types (1/3) • Naturalistic observation • Individuals in their natural setting, no intrusion, ideally, people not aware • Passive form of research, observation: • Observe in person • Use technology to observe (tracking, video, alerts, …) • Case Studies, Field Studies and Descriptive Studies • Several types of each exist • Help explain and answer difficult questions, e.g., “Why was the system not accepted?” • Can consider characteristics of work environment, culture, lifestyle and personal preferences when searching for explanations systematically controlled in experiments) • can be combined with action research • Action Research • Case studies + direct involvement of the researcher • Goal is to solve a problem or improve an existing situation • Less role of observer but iterative and error-correcting approach

Different Study Types (2/3) • Surveys • Surveys are useful to measure opinions, intentions, feelings and beliefs, … • Dangers: 1) often hastily constructed, misconception about easiness of constructing a survey and validity, 2) Not a measure of behaviors or actions 3) Few people in IS properly trained to design surveys • Correlation Studies • About changes in variables, to find where change in one variable coincides with change in another. • Involve many variables and many data points (surveys and large population samples) • Do not attempt to discover what causes a change

Different Study Types (3/3) • Quasi-experiment • No randomization (main difference with experiments) • Often useful when groups pre-exist because • Geographical constraints, e.g., population across a country, in different cities • social constraints, e.g., siblings in families • time constraints, e.g., comparable group already studied in past • Experiments (focus of tutorial) • Goal is to evaluate hypotheses about causal relations between variables • In informatics: • evaluate the impact, benefit, advantages, disadvantages or other effects of information systems, algorithms, or interfaces, ... • New/improved system is compared to other systems or under different conditions and evaluated for its impact

Two different KINDS of Experiments related to stages of development • Early Stages: focusing on algorithm/system development • Indirect involvement of users • Batch-process approach to experiments • Example Advantages: large scale, efficient, quick turn around • Example Dangers: confusing development and evaluation, forgetting design considerations (randomization, double blind evaluation, …) • Later Stages: focusing on human-computer interaction, longitudinal evaluations of impact • Direct involvement of users • Example Advantages: rich data, useful information for product improvement • Example Dangers: users not representative, IRB slows down process

Experimental Design in a Nutshell Part 2

Three steps to design a study • STEP 1: What is goal • The answer helps define independent variables • Independent variables: what will be manipulated • STEP 2: How will we know that goal is reached • The answer helps define dependent variables • Dependent variables: what will be measured • STEP 3: What can affect the system, users, use, actions, opinions, … • The answer helps define confounded and nuisance variables • Confounded variables: 2 variables that change from one treatment to another • Nuisance variables: variables that add errors and variance but are of no interest to the researcher (they should be controlled)

STEP 1: Choose the Independent Variables • Define the goal of the study • Examples: • Evaluate a new system where there was no system before? • Evaluate whether a system is better than another system • Evaluate whether different types of users get better use with system • Independent Variable (IV) • Other terms: treatment, intervention • Manipulated by researcher • Goal of a user study is to compare the results for different treatments • Types of Independent Variable (IV) • Qualitative Independent Variables • Describe different kinds of treatments • Quantitative Independent Variables • Describes different amounts of a given treatment.

STEP 1: Choose the Independent Variables • There can be multiple independent variables • In informatics, often only 1 or 2. • Seldom more. • Critical to make this a TRUE experiment • Assignment to a condition/treatment has to be done randomly

STEP 1: Find the Independent VariablesEXERCISE • Y. Gu, G. Leroy, D. Kauchak, "When synonyms are not enough: Optimal parenthetical insertion for text simplification," Accepted for the AMIA Fall Symposium, November 2017, Washington DC. • Abstract−As more patients use the Internet to answer health-related queries, simplifying medical information is becoming increasingly important. To simplify medical terms when synonyms are unavailable, we must add multi-word explanations. Following a data-driven approach, we conducted two user studies to determine the best formulation for adding explanatory content as parenthetical expressions. Study 1 focused on text with a single difficult term (N=260). We examined the effects of different types of text, types of content in parentheses, difficulty of the explanatory content, and position of the term in the sentence on actual difficulty, perceived difficulty, and reading time. We found significant support that enclosing the difficult term in parentheses is best for difficult text and enclosing the explanation in parentheses is best for simple text. Study 2 (N=116) focused on lists with multiple difficult terms. The same interaction is present although statistically insignificant, but parenthetical insertion can still significantly simplify text.

STEP 1: Find the Independent VariablesEXERCISE G. Leroy, "Persuading Consumers to Form Precise Search Engine Queries", American Medical Informatics (AMIA) Fall Symposium, San Francisco, November 14-18, 2009. Abstract−Today’s search engines provide a single textbox for searching. This input method has not changed in decades and, as a result, consumer search behaviour has not changed either: few and imprecise keywords are used. Especially with health information, where incorrect information may lead to unwise decisions, it would be beneficial if consumers could search more precisely. We evaluated a new user interface that supports more precise searching by using query diagrams. In a controlled user study, using paper based prototypes, we compared searching with a Google interface with drawing new or modifying template diagrams. We evaluated consumer willingness and ability to use diagrams and the impact on query formulation. Users had no trouble understanding the new search method. Moreover, they used more keywords and relationships between keywords with search diagrams. In comparison to drawing their own diagrams, modifying existing templates led to more searches being conducted and higher creativity in searching.

STEP 1: Find the Independent VariablesEXERCISE C. H. Ku, A. Iriberri, and G. Leroy, Crime Information Extraction from Police and Witness Narrative Reports, 2008 IEEE International Conference on Technologies for Homeland Security, May 12-13, 2008 Abstract−To solve crimes, investigators often rely on interviews with witnesses, victims, or criminals themselves. The interviews are transcribed and the pertinent data is contained in narrative form. To solve one crime, investigators may need to interview multiple people and then analyze the narrative reports. There are several difficulties with this process: interviewing people is time consuming, the interviews – sometimes conducted by multiple officers – need to be combined, and the resulting information may still be incomplete. For example, victims or witnesses are often too scared or embarrassed to report or prefer to remain anonymous. We are developing an online reporting system that combines natural language processing with insights from the cognitive interview approach to obtain more information from witnesses and victims. We report here on information extraction from police and witness narratives. We achieved high precision, 94% and 96%, and recall, 85% and 90%, for both narrative types.

STEP 1: Find the Independent VariablesEXERCISE G. Leroy, A. Lally, and H. Chen. “The Use of Dynamic Contexts to Improve Casual Internet Searching,” ACM Transactions on Information Systems (ACM -TOIS), vol. 21 (3), pp 229-253, July 2003. Abstract−Research has shown that most users’ online information searches are suboptimal. Query optimization based on a relevance feedback or genetic algorithm using dynamic query contexts can help casual users search the Internet. These algorithms can draw on implicit user feedback based on the surrounding links and text in a search engine result set to expand user queries with a variable number of keywords in two manners. Positive expansion adds terms to a user’s keywords with a Boolean “and,” negative expansion adds terms to the user’s keywords with a Boolean “not.” Each algorithm was examined for three user groups, high, middle, and low achievers, who were classified according to their overall performance. The interactions of users with different levels of expertise with different expansion types or algorithms were evaluated. The genetic algorithm with negative expansion tripled recall and doubled precision for low achievers, but high achievers displayed an opposed trend and seemed to be hindered in this condition. The effect of other conditions was less substantial.

STEP 2: Choose the Dependent Variables • What needs to be measured to know if goal was reached • Examples: • Were enough relevant articles found? Did people like using the system (or did they get frustrated)? • Dependent Variable (DV) • Other terms: outcome or response variable • Metrics chosen by researcher • Goal is to use metrics to compare different treatments, preferably use complementary measures to assess the impact of a treatment • Choose a relevant dependent variable • Keep the stakeholders in mind, what do they care about? • Development phase affects choice • Early on: do usability in formative phases of development • Then: relevance, completeness of results, … • Later: cost savings, risk taking, improved decision making, … • Historically used metrics are good to include • Because they are probably already well understood • Because it will be expected by stakeholders • Good starting point: effectiveness, efficiency and satisfaction (aka: outcome measures, performance measures, satisfaction measures)

STEP 2: Choose the Dependent Variables • Commonly used metrics for outcome measures • Precision, recall, F-measure • Accuracy, true positive, true negative, false positive, false negative, specificity, sensitivity • Counts • Commonly used metrics for performance measures • Time (to completion) • Errors • Usability when measured in an objective manner: counting events, errors or measuring task completion. • interesting measures compare different users on their training or task completion times. • E.g., comparison between novice and expert users. • Satisfaction and Acceptance • Usually measured with one survey. • NOTE: users need to be satisfied in short term before they will accept in the long term

STEP 1: Find the Dependent VariablesEXERCISE • D. Kauchak, G. Leroy and A. Hogue, "Measuring Text Difficulty Using Parse-Tree Frequency", Journal of the Association for Information Science and Technology (JASIST), 68, 9, 2088-2100, 2017. • Abstract - Text simplification often relies on dated, unproven readability formulas. As an alternative and motivated by the success of term familiarity, we test a complementary measure: grammar familiarity. Grammar familiarity is measured as the frequency of the 3rd level sentence parse tree and is useful for evaluating individual sentences. We created a database of 140K unique 3rd level parse structures by parsing and binning all 5.4M sentences in English Wikipedia. We then calculated the grammar frequencies across the corpus and created 11 frequency bins. We evaluate the measure with a user study and corpus analysis. For the user study, we selected 20 sentences randomly from each bin, controlling for sentence length and term frequency, and recruited 30 readers per sentence (N=6,600) on Amazon Mechanical Turk. We measured actual difficulty (comprehension) using a Cloze test, perceived difficulty using a 5-point Likert scale, and time taken. Sentences with more frequent grammatical structures, even with very different surface presentations, were easier to understand, perceived as easier and took less time to read. Outcomes from readability formulas correlated with perceived but not with actual difficulty. Our corpus analysis shows how the metric can be used to understand grammar regularity in a broad range of corpora.

STEP 1: Find the Dependent VariablesEXERCISE G. Leroy and T.C. Rindflesch, "Effects of Information and Machine Learning Algorithms on Word Sense Disambiguation with Small Datasets," International Journal of Medical Informatics, 74, 7-8, 573-585, 2005. Abstract - Current approaches to word sense disambiguation use and combine various machine-learning techniques. Most refer to characteristics of the ambiguous word and surrounding words and are based on hundreds of examples. Unfortunately, developing large training sets is time-consuming. We investigate the use of symbolic knowledge to augment machine learning techniques for small datasets. UMLS semantic types assigned to concepts found in the sentence and relationships between these semantic types form the knowledge base. A naïve Bayes classifier was trained for 15 words with 100 examples for each. The most frequent sense of a word served as the baseline. The effect of increasingly accurate symbolic knowledge was evaluated in eight experimental conditions. Performance was measured by accuracy based on 10-fold cross-validation. The best condition used only the semantic types of the words in the sentence. Accuracy was then on average 10% higher than the baseline; however, it varied from 8% deterioration to 29% improvement. In a follow-up evaluation, we noted a trend that the best disambiguation was found for words that were the least troublesome to the human evaluators.

STEP 1: Find the Dependent VariablesEXERCISE • G. Leroy, S. Helmreich, and J. Cowie, "The Influence of Text Characteristics on Perceived and Actual Difficulty of Health Information", International Journal of Medical Informatics, 79 (6), 438-449, 2010. Abstract - Purpose: Willingness and ability to learn from health information in text are crucial for people to be informed and make better medical decisions. These two user characteristics are influenced by the perceived and actual difficulty of text. Our goal is to find text features that are indicative of perceived and actual difficulty so that barriers to reading can be lowered and understanding of information increased. Methods: We systematically manipulated three text characteristics, – overall sentence structure (active, passive, extraposed-subject, or sentential-subject), noun phrases complexity (simple or complex), and function word density (high or low), – which are more fine-grained metrics to evaluate text than the commonly used readability formulas. We measured perceived difficulty with individual sentences by asking consumers to choose the easiest and most difficult version of a sentence. We measured actual difficulty with entire paragraphs by posing multiple-choice questions to measure understanding and retention of information in easy and difficult versions of the paragraphs. Results: Based on a study with 86 participants, we found that low noun phrase complexity and high function words density lead to sentences being perceived as simpler. In the sentences with passive, sentential-subject, or extraposed-subject sentences, both main and interaction effects were significant (all p < .05). In active sentences, only noun phrase complexity mattered (p < .001). For the same group of participants, simplification of entire paragraphs based on these three linguistic features had only a small effect on understanding (p = .99) and no effect on retention of information. Conclusions: Using grammatical text features, we could measure and improve the perceived difficulty of text. In contrast to expectations based on readability formulas, these grammatical manipulations had limited effects on actual difficulty and so were insufficient to simplify the text and improve understanding. Future work will include semantic measures and overall text composition and their effects on perceived and actual difficulty.

Confounded Variables • Confounding = when the effects of 2 (or more) variables cannot be separated from each other • A variable, other than the independent variable, may have caused the effect • Reduces the internal validity: unsure whether the independent variable caused the effect • Random assignment to experimental condition is essential in avoiding confounding (but not always sufficient) • Example: An experiment with 2 conditions to test a DSS for managers: old vs. new system • Old system – tested with experienced managers (25+ yrs) • New system – tested with inexperienced managers (1 yr)

Nuisance Variables • Nuisance variables add variation to the study outcome • not due to the independent variables • of no interest to the experimenter. • reduce the chance of detecting the systematic impact of the independent variable • Noise = if the variation is unsystematic • E.g., conduct experiment at different times of day. At some times, the environment was noisy (train passes by, class changes and noise in hallway, people partying)  affects performance of subjects • Bias = if the variation is systematic • E.g., conduct experiment and for each level of IV have different graduate student conduct study. One student is constantly chatting with friends during the experiment  affects performance of subjects (true story)

basic Statistics Part 3

Basic Designs: testing with people • Example study: app vs consultant for weight loss intervention • Between-subjects designs: • Each ‘subject’ participates in only one experimental condition • Participants assigned to the app or consultant • Within-subjects designs: • Each ‘subject’ participates in only all experimental conditions • Participants assigned to app and consultant (make sure to reverse order for half of them)

Basic Designs: testing without people • Example study: Google Translate vs New Program for automated email translation • Between-subjects designs: • Each ‘subject’ participates in only one experimental condition • Each email assigned to Google or New Program • Within-subjects designs: • Each ‘subject’ participates in only all experimental conditions • Each email assigned to Google and New Program

stats for basic designs

Between-subjects design: STATS • When every treatment has a group of different subjects • Statistics for 1 variable with 2 treatments: Independent samples t-test • Statistics for 2+ variables or 3+ treatments: ANOVA

within-subjects design: STATS • When subjects participate in multiple treatments • Statistics for 1 variable with 2 treatments: Paired-samples t-test • Statistics for 2+variables or 3+ treatments: Repeated-Measures ANOVA

summary • T-test • Comparison between 2 treatments • Useful for both between- and within-subjects designs: • Independent samples vs. Paired samples • Bonferroni adjustment needed when many test are conducted • ANOVA • Comparison between 3 or more treatments • Useful for both between- and within-subjects designs: • ANOVA vs. repeated measures ANOVA • Omnibus test: • Main and Interaction effects • Test if there is any significant difference between treatment (main effect) • Post-hoc analysis needed to pinpoint which pairs of treatment are different • A note about Multivariate ANOVA (MANOVA) • Multiple dependent variables may inspire to MANOVA • MANOVA has underlying factor analysis • MANOVA appropriate when there are many dependent variables for which a simpler structures (fewer factors) are of interest • Example: • 35 measures of intelligence • MANOVA can indicate when there are a few factors that contribute to results, e.g., 7 measure may load on ‘verbal component’, 8 measures may load on ‘abstract component’, …

exercise Part 4

Choose a topic – design a study • You have developed an app to help people diet using principles from psychology (e.g., the new ‘noom’) • You have developed an improved dashboard to track tasks/completion/personnel in a business • You have developed text mining algorithms that can predict outbreaks of asthma from Twitter • You have developed automated translation algorithms to translate legal text into layperson text

Decide • What is ‘new’ • How can you show that it ‘works’ • What can influence that decision? • Independent Variables? (what do you manipulate) • Dependent Variables? (what are the important measures) • Other?

Additional information

errors to avoid

Errors to avoid • Trade-offs! • Know your target population and sample • Randomization is crucial in avoiding bias • Of subjects assigned to conditions • Facilitators assigned to conditions • Order of conditions • Of output for evaluation • …

Avoid Subject-related Bias • “when subjects act in a certain way because they are participating in a study” • Subject related bias, examples: • Good subject effect: different behaviors because observed (want to look good) • Volunteer effect (selection bias): volunteers of experiment are found to have different traits. E.g., may be healthier in medical studies, may need money from participation (drug interactions possible in clinical studies) • How to avoid • Location: clinic? Lab? Hospital? At work? Boss’s office? Next to train station?  try to avoid influences • Explain importance of being honest • Limit interaction with facilitator. Be careful of using only computer-based instructions • Provide anonymity or confidentiality • Single-blind studies = subject does not know experimental condition • Make user task realistic

Avoid Experimenter-related Bias • “Effects that are the result of experimenter/facilitator behaviors” • Experimenter related bias, examples: • Experimenter effects are related to experimenter behaviors. Most famous example: Clever Hans (a horse) • How to avoid • Behaviors • professional and courteous interaction • standardized (practiced) instructions • Working with multiple evaluators/facilitators (but not one per condition!) • Reduce interaction with study subjects • Double-blind study designs = both subject and facilitator are unaware of experimental condition • Easier in IS than expected • E.g., when comparing different algorithms

Avoid Design-related Bias • “Several biases introduced by use of a particular experimental design” • Within-subjects design: subjects participate in multiple conditions • Minimize order-effects • Cross-over design: balance order of conditions for subjects, e.g., A-B for half of subjects and B-A for other half • When many conditions, randomize order of conditions per subject • Leave enough time between conditions • Between-subjects design: subjects participate in only one condition • Minimize effects of time/day • Randomize assignment to conditions • Avoid assignment of one facilitator per condition

Tutorial 2019 - Evaluation of artifacts in Design science - PART 1

Tutorial 2019 - Evaluation of artifacts in Design science - PART 1

Presentation Transcript

Suspension Design Part 1

UML Artifacts (Part 2)

UML Artifacts (Part 2)

Evaluation Part 1

Bridge Design part 1

Program Design Tutorial #1

Nature of Science Part 1

Campaign Evaluation: Part 1

Digital Design Part(1 )

Lesson Design – Part 1

Approaches of Digital design Part 1

THE SCIENCE OF GROWTH Part 1

Artifacts as Evidence in the KEEP Evaluation System

5. Windows System Artifacts Part 1

Design Principles – Part 1 of 3

Evaluation Framework, Design and Data Collection (Part 1)

Part 1: Design Principles

Evaluation of Recursive Queries Part 1: Efficient fixpoint evaluation “Seminaïve Evaluation”

RESEARCH DESIGN (PART 1)

Internet2: A Tutorial Part 1 of 4

Suspension Design Part 1