Mixing up Math: Item Response Theory (part 3)
In the previous post, we investigated the claim by Doug Rohrer and colleagues that mixing up math questions has an effect size of about 0.8 on a math test. In the analysis, we assumed that the points scored on the test could simply be added up. We already questioned this assumption when we explored the data visually and found that for one topic, there seemed to be a jump in understanding - students either answered all four item correct or none correct.
Now we no longer assume that student’s understanding of the math can be mapped directly onto the number of points scored. Instead, we model this understanding, or latent ability, with Item Response Theory (IRT). We first explain the basics of this theory and then use it to evaluate the test that was used in the Rohrer study. How to actually use the latent scores is a topic for a later post.
Item Response Theory
The assumption of IRT is that when we for example administer a test, we are not just measuring the points scored, but really care about an underlying latent variable that causes points to be scored. This variable is the true ability
Before we do so, we have to discuss two assumptions. First, we assume that a student’s performance on test items are independent of each other. Second, we assume (for now) that the test measures understanding on one dimension. In the case of the Rohrer math test, it was made up of four different topics, so in that sense it is certainly wrong to make this assumption. But then, we are already talking about only math questions. Also, even within a domain of math such as geometry we can always find subtopics. So by this logic we could never really make this assumption. Let’s go with the assumption of unidimensionality for now then and scrutinize it later.
With respect to the data, we have {0,1} observations (a question is answered correctly or not) and so we want to model the probability that a students answers a questions correctly (i.e. scores a 1) based on their latent ability
Let’s ignore estimation problems for now and assume that we have somehow conjured up some
library(tidyverse)
## create a domain
x<- c(-4:4)
df<-data.frame(x)
# write a function that gives a logistic function with parameter a
map_logit <- function(a,color,s){
stat_function(fun=function(x) (exp(a*x)/(1 + exp(a*x) )),
color=color,
size=1.05)
}
# plot the curves
ggplot(df,aes(x))+
map_logit(1,"black") +
map_logit(0.5,"blue") +
map_logit(3,"red") +
labs(
x = quote(theta),
y = "P(X=1)"
)
The practical interpretation that we can give is that a bigger
We could also add a parameter
Technical note
Estimating latent ability scores is a challenge. Let’s have a look at the 2PL model - a logistic model, where we want to estimate the probability of a correct answer given the latent ability and two parameters.
There surely are a lot of parameters to estimate! So how to go about this? There are several methods.
One is so called joint maximum likelihood estimation. In that case we search for combinations of
Another solution to estimating parameters on a multivariate distribution is marginal maximum likelihood estimation. Taking the marginal of a multivariate distribution means that you collapse one axis onto the others, so that the probability mass is spread out over this lower dimensional shape. So if we have a two dimensional support below a surface for example, then taking the marginal means that the 3d-object with volume 1 (because of the total probability) is turned into a surface with area 1.
This is what we do here with respect to
mirt
package that is dedicated to IRT analyses and is used below.
At this point we will not discuss another alternative, which are Bayesian estimation techniques. See here for a guide.
Once we have estimated the parameters, we find for each person the
We can also estimate the standard error for an estimate with the so called information function. This is the familiar Fisher information matrix adapted to account for
Analysis of the Test
Let us now apply the IRT model to the responses on the 16 test items in the Rohrer study. After loading the package mirt
and the data, we fit a 2PL model to the data. We plot the graphs for each of the 16 items to check whether our parameters lead to models that fit the data. The data are {0,1} so the dots are binned data, with 95% confidence intervals around them.
# Fit a 1-dim model
fit1 <- mirt(quest,
1,
itemtype = '2PL')
# Plot the fit of the models
for (i in 1:length(quest)){
ItemPlot <- itemfit(fit1,
group.bins = 15,
empirical.plot = i,
empirical.CI = .95,
method = 'ML')
print(ItemPlot)
}
First of all, the estimated models appear to fit the data reasonably well. Second, we note that questions 1-4 give an humongous amount of information for students around
fit1_coefs <- coef(fit1,
simplify=T,
IRTpars=T)
latent_scores <- as.vector(fscores(fit1))
print(fit1_coefs$items[,1])
## G1 G2 G3 G4 I1 I2 I3 I4
## 19.986196 28.201574 31.570392 30.585809 1.771157 1.521330 1.837990 1.682591
## E1 E2 E3 E4 C1 C2 C3 C4
## 1.416098 1.534383 1.062528 1.333830 1.036982 1.171095 1.032342 1.171904
The first four values of
Next, we plot a so called Wright Map, which plots the item difficulty as dots together with a histogram containing the average
wrightMap(latent_scores, fit1_coefs$items[,2])
## [,1]
## G1 0.326588911
## G2 0.314661307
## G3 0.326976070
## G4 0.328789665
## I1 -0.244611823
## I2 -0.031869313
## I3 -0.267937550
## I4 -0.128697944
## E1 0.555965627
## E2 0.487270267
## E3 0.333455121
## E4 0.526004490
## C1 -0.925990531
## C2 -0.009478124
## C3 -0.817776980
## C4 0.006975076
We see that the average
Next, we want to get a feel for the information that each of the items contributes. We therefore plot the item information function.
plot(fit1, type='infotrace')
We only see the information curve for the first four items G1-G4, because their curves are so extremely steep. We therefore adjust the scale on the y-axis to get a better picture of the contribution of the other items.
plot(fit1, type='infotrace',ylim=c(0,1))
As the test information function simply sums the item information functions, it makes sense to plot the latter in one graph. We see the contribution of each item, given the level of
plot(fit1, type='infotrace', facet_items=F,ylim=c(0,1))
We are now ready to plot the standard error, given a level of
plot(fit1, type='infoSE')
This is a plot that comes with the
mirt
package, like all of the others. I am not sure whether double y-axis are ever a good idea, but here it is instructive at least to see the relation between the information and the standard error. We note that standard errors are extremely small for students with average
So what have we gained by using IRT? To begin to answer this question, we plot the latent ability scores against the standardized sum of points scored on the test.
rohrer_irt <- rohrer %>% mutate(irt =latent_scores,
total = rowSums(.[grep("[A-Z]+[1-4]", names(.))]),
stand_total = (total - mean(total))/sd(total[Group == "block"]))
rohrer_irt %>% ggplot(aes(x=stand_total, y=irt)) +
geom_point() +
labs(x = "raw point score",
y = quote(theta))
We notice a correlation, but also observe that the IRT model pulls
quest2 <- rohrer %>% select(I1:C4) %>%
as.data.frame()
fit2 <- mirt(quest2,
1,
itemtype = '2PL')
latent_scores2 <- as.vector(fscores(fit2))
quest2 %>%
mutate(irt =latent_scores2,
total = rowSums(.[grep("[A-Z]+[1-4]", names(.))]),
stand_total = (total - mean(total))/sd(total)) %>%
ggplot(aes(x=stand_total, y=irt)) +
geom_point() +
labs(x = "raw point score",
y = quote(theta))
So the IRT model can’t handle the first four question on the test. Visually we noted in the first post that the responses were binary - all wrong or all right. So it’s not that surprising that the IRT model has difficulty mapping them on a continuous scale. They shouldn’t be.
Leaving out the first four items does mean we loose information though, as can be seen when we compare the plot of the standard errors with the plot of the standard error for the full model with all the items.
plot(fit2, type='infoSE')
A separate issue is whether we are justified in treating the
Another question is how we can use IRT to estimate the difference between the two groups in the Rohrer study, which is what we are ultimately concerned with. This question will also be taken up in a future post.