Assessing Students. Performance Longitudinally: Item Difficulty Parameter Vs. Skill Learning Tracking


Feng, M., Heffernan, N. T. (2007). Assessing Students. Performance Longitudinally: Item Difficulty Parameter vs. Skill Learning Tracking. Paper presented at the 2007 Annual meeting of National Council of Measurement on Educational (NCME’2007), Chicago.


Most large standardized tests (like the math-subtest of the Graduate Record Examination (GRE)) analyzed with Item Response Theory are “unidimensional” in that they are analyzed as if all the questions are tapping a single underlying knowledge component (i.e., skill). However, cognitive scientists such as Anderson & Lebiere (1998), believe that students are learning individual skills, and might learn one skill but not another. Among the reasons that psychometricians analyze large scale tests in a unidimensional manner is that students’ performance on different skills are usually highly correlated, even if there is no necessary prerequisites relationship between these skills. Another reason is that students usually do a small number of items in a given setting (e.g. 39 items for the 8th grade math Massachusetts Comprehensive Assessment System test). We are engaged in an effort to investigate if we can do a better job of predicting a large scale test (MCAS) by modeling individual skills in different grain-sized skill models than by using item difficulty parameters induced from traditional Item Response Theory models, on which computer adaptive testing relies. We consider 2 different skill models1, one has 5
skills we call the “WPI-5”, and the other is our most fine-grained model has 78 skills we call the “WPI-78”. In both cases, a skill model is a matrix that relates questions to the skills needed to solve the problem. The measure of model performance is the accuracy of the predicted MCAS test score based on the assessed skills of the students.

Given that the WPI-78 composed of 78 skills, people might worry about that we were overfitting our data by fitting a model with so many free parameters. However, we were not evaluating the effectiveness of the skill models over the same online ASSISTment data based on which the models will be constructed. Instead, we used totally different data (from the external, paper-and-pencil based state test) as the testing set. Hence, we argue that overfitting would not be a problem in our approach.

Read more from SRI