Mingyu Feng, Worcester Polytechnic Institute Neil T. Heffernan, Worcester Polytechnic Institute

Assessing Students’ Performance Longitudinally: Item Difficulty Parameter vs. Skill Learning Tracking Mingyu Feng, Worcester Polytechnic Institute Neil T. Heffernan, Worcester Polytechnic Institute

A web-based tutoring system that assists students in learning mathematics and gives teachers assessment of their students’ progress The “ASSISTment” System

An ASSISTment Geometry • We break multi-step problems into “scaffolding questions” • “Hint Messages”: given on demand that give hints about what step to do next • “Buggy Message”: a context sensitive feedback message • (Feng, Heffernan & Koedinger, 2006a) • Skills • The state reports to teachers on 5 areas • We seek to report on more and finer grain-sized skills (Demo/movie) The original question a. Congruence b. Perimeter c. Equation-Solving The 1st scaffolding question Congruence The 2nd scaffolding question Perimeter A buggy message A hint message

The ASSISTment Project What Level of Tutor Interaction is Best? By Leena Razzaq, Neil Heffernan & Robert Lindeman Collaborators Sponsors Goal Experiment Design To determine the best level of tutor interaction to help students learn the mathematics required for a state exam based on their math proficiency. Experiment Screen Shots 3 levels of interaction: • Scaffolding + hints represents the most interactive experience: students must answer scaffolding questions, i.e. learning by doing. • Hints on demand are less interactive because students do not have to respond to hints, but they can get the same information as in the scaffolding questions by requesting hints. • Delayed feedback is the least interactive condition because students must wait until the end of the assignment to get any feedback. 2 levels of math proficiency: • Students in Honors math classes. • Students in Regular math classes. Background on ASSISTments • The Assistment System is a web-based assessment system that tutors students on math problems. The system is freely available at www.assistment.org • As of March 2007, 1000’s of Worcester middle school students use ASSISTments every two weeks as part of their math class. • Teachers use the fine-grained reporting that the system provides to inform their instruction. Students in this condition interact with the tutor by answering scaffolding questions. Students in this condition can get hints when they ask for them by pressing the hint button. Students in this condition get no feedback until the end of the assignment when they get answers and solutions. Analysis and Conclusions • 566 8th grade students participated. • Results showed a significant interaction between condition and math proficiency (p < 0.05), a good case for tailoring tutor interaction to types of students. Scaff. Q. #1 The Interaction Hypothesis Hint #1 Students see the solution after they finish all of the problems. When one-on-one tutoring, either by a human tutor or a computer tutor, is compared to a less interactive control condition that covers the same content, then students will learn more in the interactive condition than the control condition. Scaff. Q. #2 Hint #2 Scaff. Q. #3 • Is this hypothesis true? • We found evidence to support this hypothesis in some cases, not in others. • Based on the results of Razzaq & Heffernan (2006), we believe the difficulty of the material influences how effective interactive tutoring will be. Hint #3 Hint #4 Hint #5 • Regular students learned more with scaffolding + hints (p < 0.05): less-proficient students benefit from more interaction and coaching through each step to solve a problem. • Honors students learned more with delayed feedback (p = 0.075): more-proficient students benefit from seeing problems worked out and getting the big picture. • Delayed feedback performed better than hints on demand (p=.048) for both more- and less-proficient students: students don’t do as well when we depend on student initiative. Scaff. Q. #4 Our Hypothesis Hint #6 • More interactive intelligent tutoring will lead to more learning (based on post-test gains) than less interactive tutoring. • Differences in learning will be more significant for students who are less-proficient than students who are more-proficient. Hint #7 Hints on Scaff. Q. This work has been accepted for publication at the 2007 Artificial Intelligence in Education Conference in Los Angeles.

CAREER: Learning about Learning: Using Intelligent Tutoring Systems as a Research Platform to Investigate Human Learning Free researcher, teacher and student accounts for 7th-10th grade math preparation at www.assistment.org Goal 1) To help researchers learn about student learning. 2) To help students learn math and report to teachers valuable information about their students’ knowledge. Summary • The Assistment System is a web-based assessment system that tutors students on items they get wrong • The system is freely available at www.assistment.org • Thousands of students use it in Worcester and surrounding towns every two weeks as part of their math class or for homework. • The system tracks 98 skills for 8th grade math, and reports on those skills to teachers. • Teachers and schools (and researchers) can use our web-based tools to create their own content quickly. Do Students Learn from Assistments? • Yes! We compared 19 pairs of items that address the same concept with 681 students got significant results (p<.05). See Razzaq et al (2005) and Razzaq & Heffernan (2006). Do Assistments Assess Accurately? • Yes, the Assistment System can predict a student’s MCAS score quite reliably and can track different rates of learning for different skills. See Feng, Heffernan & Koedinger(2006) Funding/People • PI Neil Heffernan at WPI with collaborator Kenneth Koedinger at Carnegie Mellon. • Over 50 people have helped contribute. • Thanks for $3 millions in funding from National Science Foundation (NSF) CAREER, US Department of Education, Office of Naval Research, Spencer Foundation, and US Army • Contact: Professor Neil T. Heffernan (508) 831-5569, nth@wpi.edu * Feng, M., Heffernan, N.T, Koedinger, K.R.,(2006) Predicting State Test Scores Better with Intelligent Tutoring Systems: Developing Metrics to Measure Assistance Required, The 8th International Conference on Intelligent Tutoring System, 2006, Taiwan * Razzaq, L., Feng, M., Nuzzo-Jones, G., Heffernan, N.T., Koedinger, K. R., Junker, B., Ritter, S., Knight, A., Aniszczyk, C., Choksey, S., Livak, T., Mercado, E., Turner, T.E., Upalekar. R, Walonoski, J.A., Macasek. M.A., Rasmussen, K.P. (2005) The Assistment Project: Blending Assessment and Assisting. The 12th Annual Conference on Artificial Intelligence in Education 2005, Amsterdam * Razzaq L., Heffernan, N.T. (2006). Scaffolding vs. Hint in the Assistment System. The 8th International Conference on Intelligent Tutoring Systems, 2006, Taiwan. What the State MCAS test provides What the teacher who builds the tutoring sees. What a student sees This shows a student that first guessed 16 (real answer is 24), then got the first scaffolding question correct with “AC”. The student then clicked on “½*8x” and the system spit out the “bug” message in red. The student, twice in a row, asked for a hint shown in the green box. The originalquestion Uploaded image Teacher Reports The firstscaffold The second scaffold The author wrote this hint message shown in the green box, put typing it in here. Recent Results - 2006 The three hint messages for the second scaffold. Teachers get reports per student, per skill, and per item. This project has 5 main research thrusts; 1) For the designing cognitive models thrust we report that we can do a better job of modeling students by using finer-grained models (i.e., that track more knowledge components) of student than more courser grain model (Zapdos et al, 2006, Feng, et al, 2006). 2) For the research thrust of inferring what students know and are learning we can report two new results. First, we can do a better job of assessing students (as measured by predicting state test scores) by seeing how much tutoring they need to solve a question (Feng, et al, 2006a). Secondly, we have shown that we can do a better job of modeling students’ learning overtime by building models that take allow us to model different rates of learning for different skills (Feng et al, 2006a). 3) For the optimizing learning thrust we have new empirical results that show that students learn more with the type of tutoring we provide that compared to a traditional Computer-Added Instruction (CAI) control (Razzaq & Heffernan, 2006). 4) For the thrust for informing educators, we have some recent publications on the types of feedback we give educators (Feng & Heffernan, 2005& 2006)). Additionally, we have work that shows we can track student motivation and then inform educators in novel manners that increase student motivation (Walonoski & Heffernan, 2006a & 2006b). 5) Finally, for the thrust of allowing user adaptation we have shown that the authoring tools we have built can be used to teachers and quickly create content for their classes (Heffernan, Turner et al, 2006). References are at www.asssistment.org The “bottom out” hint. The third scaffold This dialog shows the author has tagged the third scaffold with three different grained sized models. By tagging items with skills, teachers can 1) get reports on which skills students are doing poorly on, and 2) track them over time. The fourth and last Scaffold

Scaling up a Server-Based Web Tutor Jozsef Patvarczki & Neil Heffernan Results Introduction Assistment Features Our research team has built a web-based tutor, located at www.ASSISTment.org [1], that is used by hundreds of students a day in Worcester and surrounding townsThe system’s focus is to teach 8th and 10th grad mathematics and MCASpreparation. Because it is easily accessible, it helps lower the entry barrier for teachers and enable both teachers and researchers to collect data and generate reports. Scaling up a server-based intelligent tutoring system requires developers to care about speed and reliability. We will present how the Assistment system can improve performance and reliability with a fault-tolerant scalable architecture. • Since each public school classes have about 20 students, we noticed clusters (shown in ovals in the bottom left) of intervals where a single class was logged on. • The log-on procedures is the most expensive step in the process and this data shows that this might be a good place for us to improve. • We noticed a second cluster of around 40 users, which most likely represents instances where two classes of students were using the system simultaneously. • There was no appreciable pattern towards a slower page creation time with more users. • Three simulated scenarios with 10s random delay between student actions: • In the first scenario we used 50 threads simulating 50 students working without load-balancer, one application server, and one database • Second scenario with load-balancer and two application servers • Third scenario with web-cache technique and load- balancer • We seem to have able to get linear speed-up by the help of the load-balancer and an additional application server. • We have a possibility to reduce the execution time of the computation intensive applications by the help of the GRID computing. Users begin interacting with our system through the “Portal” that manages all activities This problem uses a pseudo-tutor (state-based implementation) with pre-made scaffolding and hint questions selected based upon student input. Incorrect responses are in red, and hints are in green. Example of a State-based Pseudo Tutor • Horizontal scaled configuration • Scalable • Fault-tolerant • Dynamically configurable Architecture Client’s actions represent the system’s load HTTP server as Load Balancer System Scalability and Reliability • Two concerns when running the Intelligent Tutor on a central server are: • 1) building a scalable server architecture; • 2) providing reliable service to researchers, teachers, and students. • We will answer several research questions: • 1) can we reduce the cost of authoring ITS; • 2) how can we improve performance and reliability with a better server architecture. • In order to server thousands of users, we must achieve high reliability and scalability at different levels. • Scalability at our first entry point through the use of a virtual IP for www.assistment.org, provided by the CARP protocol. • Random and round-robin redirection algorithms can provide very effective load-sharing and the load-balancer distributes load over multiple application servers. • This will allow us to redirect incoming web requests and build a web portal application in a multiple-server environment. • The monitoring system uses Selenium has allowed us to send text messages to our administrators when the system goes down. • Multiple database servers with automatic synchronization, pooling, and fail-over detection. Additional application servers for load balancing GRID computing: Bayesian Network Application Workflow Editor and Manager Visualization and Resource Information System WPI P-GRADE GRID Portal http://pgrade.wpi.edu • Reference • Razzaq, L, Feng, M., Nuzzo-Jones, G., Heffernan, N.T. et. al (2005). The Assistment Project: Blending Assessment and Assisting. 12th Annual Conference on Artificial Intelligence inEducation 2005, Amsterdam Contact: Neil Heffernan, nth@wpi.edu

How was the Skill Models Created

Fine grained skill models in reporting • Teachers get reports that they think are credible and useful. (Feng & Heffernan, 2005, 2006, 2007)

Research Question • In the ASSISTment project, which approach works better on assessing students’ performance longitudinally? • Skill learning tracking? • Or using item difficulty parameter? (unidimensional)

Data Source • 497 students of two middle schools • Students used the ASSISTment system every other week from Sep. 2004 to May 2005 • Real state test score in May 2005 • Item level online data • students’ binary response (1/0) to items that are tagged in different skill models • Some statistics • Average usage: 7.3 days • Average questions answered: 250 • 138,000 data points

Data Source

Item Difficulty Parameter Fit one-parameter logistic (1PL) IRT model (Rasch model) on our online data • the dependent variable: probability of correct response for student i to item n • The independent variables: the person’s trait score and the item’s difficulty level .

Longitudinally Modeling • Mixed-effects Logistic Regression Models • Models we fitted • Model-beta： time + beta -> item response • Model-WPI5: time + skills in WPI5 -> item response • Model-WPI78: time + skills in WPI78 -> item response • Evaluation • The accuracy of the predicted MCAS test score was used to evaluate different approaches Singer & Willet (2003). Applied Longitudinal Data Analysis. Oxford University Press: New York. Hedeker & Gibbions (in preparation). Longitudinal Data Analysis.

Results > > P-values of both Paired t-tests are below 0.05

Conclusion • We have found evidence that shows skill learning tracking can better predict MCAS score than simply using item difficulty parameter and fine-grained models did even better than coarse-grained model • Our skill mapping is good (maybe not optimal) • We are considering using these skills models in selecting the next best-problem to present a student with. • Although we used Rasch model to train the item difficulty parameter, we were not modeling students' response with IRT. One interesting work will be comparing our results to predictions made through item response modeling approach.

Modeling Student Knowledge Using Bayesian Networks to Predict Student Performance By Zach Pardos –Neil Heffernan, Advisor – Computer Science Joint work with Brigham Anderson and Cristina Heffernan Collaborators Sponsors Goal Predicting student responses within the ASSISTment tutoring system Student Test Score Prediction Process To evaluate the predictive performance of various fine-grained student skill models in the ASSISTment tutoring system using Bayesian networks. Bayesian Belief Network • Result: • The finer-grained the model, the better prediction accuracy. The finest-grained WPI-106 performed the best with an average of only 5.5% error in prediction of student answers within the system. • Skill probabilities are inferred from a student’s responses to questions on the system The Skill Models • The skill models were created for use in the online tutoring system called ASSISTment, founded at WPI. They consist of skill names and associations (or tagging) of those skill names with math questions on the system. Models with 1, 5, 39 and 106 skills were evaluated to represent varying degrees of concept generality. The skill models’ ability to predict performance of students on the system as well as on a standardized state test was evaluated. • The five skill models used: • WPI-106: 106 skill names were drafted and tagged to items in the tutoring system and to the questions on the state test by our subject matter expert, Cristina. • WPI-5 and WPI-39: 5 and 39 skill names drafted by the Massachusetts Department of Education. • WPI-1: Represents unidimensional assessment. Predicting student state test scores • Arrows represent associations of skills with question items. They also represent conditional dependence in the Bayesian Belief Network. • Probability of Guess is set to 10% (tutor questions are fill in the blank) • Probability of getting the item wrong even if the student knows it is set to 5% • Result: • The finest-grained model, the WPI-106, came in 2nd to the WPI-39 which may have performed better than the 106 because 50% of its skills are sampled on the MCAS Test vs. only 25% of the WPI-106’s. 2. Inferred skill probabilities from above are used to predict the probability the student will answer each test question correctly Bayesian Networks • A Bayesian Network is a probabilistic machine learning method. It is well suited for making predictions about unobserved variables by incorporating prior probabilities with new evidence. Background on ASSISTment Conclusions • ASSISTment is a web-based assessment system for 8th-10th grade math that tutors students on items they get wrong. There are 1,443 items in the system. • The system is freely available at www.assistment.org • Question responses from 600 students using the system during the 2004-2005 school year were used. • Each student completed around 260 items each. • The ASSISTment fine-grained skill models excel at assessment of student skills (see Ming Feng’s poster for a Mixed-Effects approach comparison) • Accurate prediction means teachers can know when their students have attained certain competencies. • Probabilities are summed to generate total test score. • Probability of Guess is set to 25% (MCAS questions are multiple choice) • Probability of getting the item wrong even if the student knows it is set to 5% This work has been accepted for publication at the 2007 User Modeling Conference in Corfu, Greece.

Tracking skill learning longitudinally

Mingyu Feng, Worcester Polytechnic Institute Neil T. Heffernan, Worcester Polytechnic Institute