Question Difficulty and Student Ability in Classroom Assessments
Estimating the difficulty of questions is a crucial step in designing quizzes or tests, as it directly impacts assessment validity, student motivation, and instructional feedback. In educational measurement, question difficulty refers to the proportion of students expected (or observed) to answer an item correctly. A balanced distribution of difficulty levels ensures that the assessment can discriminate effectively between learners of different ability levels while maintaining fairness and engagement.
Valid assessments capture both lower-order and higher-order learning outcomes, ensuring that they measure what they intend to. Tests with questions that are either too easy or too difficult fail to provide meaningful insights into student learning (Haladyna & Rodriguez, 2013). Overly difficult tests discourage learners and reduce self-efficacy, while tests that are too easy will not challenge students enough to demonstrate higher-level thinking. Analyzing question difficulty also helps to identify which topics are well-understood and which require reteaching. According to classical test theory and item response theory, difficulty indices are essential for improving instructional design and for developing adaptive assessments that personalise learning (Embretson & Reise, 2013). Classical Test Theory (CTT) and early models of Item Response Theory (IRT) are less computationally demanding, easier to implement, and still provide s meaningful insights into item quality and test performance.
Understanding Question Difficulty with Classical Test Theory (CTT)
Classical Test Theory (CTT) remains one of the foundational models in educational measurement. It is widely used in classroom and small-scale testing contexts because of its simplicity and applicability to modest sample sizes. The core idea of CTT is that an observed score (X) is made of two parts: the true score (T) and error (E).
The teacher never observes T directly; instead, CTT provides methods to estimate it by reducing measurement error (Omer, 2017).

- Item Difficulty (p-value) is the proportion of students who answer an item correctly. The scale ranges from 0 (hardest) to 1 (easiest); items with p-values of 0.3–0.7 are moderately difficult and most informative (Haladyna & Rodriguez, 2013).

- Item Discrimination (rpb) explains how well an item differentiates between high and low scorers. It is the Point-biserial correlation between item score and total test score.

- CTT provides several indices of internal consistency. Cronbach’s Alpha is the most common reliability coefficient. A reliability of 0.70–0.80 is acceptable for classroom tests, whereas >0.90 is desirable for high-stakes testing (Tavakol & Dennick, 2011).

Where:
- k = number of items,
- p = item difficulty,
- q = 1-p
= variance of total scores.
Benchmarks (Tavakol & Dennick, 2011):
- 0.70 – 0.79 = acceptable.
- 0.80 – 0.89 = good.
- ≥0.90 = excellent but may indicate redundancy if too high.
- The standard error of management (SEM) provides a range around each student’s observed score.
Where:
- SD = standard deviation of the test scores,
- α = reliability of the test (Cronbach’s alpha or KR-20).
Suppose Test mean = 70, SD = 10 & = 0.84 then SEM = 4. Therefore, a student who scored 70 has a likely “true score” between 66 and 74.
Classical Test Theory (CTT) is an entry-level psychometric framework that offers valuable insights for educators. CTT can be implemented using relatively small groups of students like 100–120 learners. This makes it highly practical for classroom settings.
However, its sample dependence, test dependence, and limited modelling of error restrict its suitability for large-scale, high-stakes, or adaptive testing environments. CTT relies heavily on total test scores, which do not account for the specific interaction between an individual’s ability and an item’s difficulty. This means two individuals with the same score may have very different abilities if they answered different sets of items with varying difficulties (Magno, 2009). Thus, the lack of probabilistic modelling in Classical Test Theory (CTT) leads to skewed judgments about both items and examinees (Virzi et al., 2025). This absence makes judgments less objective and less comparable across different tests or populations.
Linking Question Difficulty and Student Ability with Probability
Adaptive testing tailors the difficulty of items presented to each test-taker by selecting items near their current estimated ability level. This helps to maximize measurement precision and efficiency (Kostikov et al., 2022). The Rasch model, a type of Item Response Theory (IRT), uses probability to estimate how likely a person with a given ability is to answer an item of specific difficulty correctly. It assumes that a correct response is a logistic function of the difference between a person’s ability and an item’s difficulty (Bond, 2015).

Where P(Xni = 1) is the probability that person n answers item i correctly, θn is the person’s latent ability, and bi is the item difficulty.
According to Bond (2015), Rasch analysis yields item difficulty estimates that remain consistent regardless of the sample, as well as person ability estimates that are not dependent on the specific test form, assuming the model demonstrates adequate fit. However, the Rasch model assumes that all test items measure the same latent trait known as unidimensionality (EUGENIO & SILVIA, 2013). This means every item on the test is expected to reflect only one ability or trait, such as mathematical skill or reading comprehension. The model insists that variation in item response is explained solely by differences in the single trait among individuals, not by multiple abilities or secondary factors. Thus, if real data violate s the assumptions of unidimensionality (i.e., items are measuring more than one trait) or equal discrimination (some items are much better at discriminating than others), the model fit will deteriorate (EUGENIO & SILVIA, 2013). In these situations, item and person measures cannot be trusted, undermining the Rasch model fairness and generalizability.
To deepen the understanding, Rasch measurement was implemented on a student response data collected from two sessions of a classroom quiz. Student responses were coded dichotomously as correct=1, incorrect=0. In total, 83 students attempted the quizzes, producing a response matrix of 83×13. The datasets from both sessions were merged into a unified student × item matrix. Missing responses were treated as incorrect (coded as 0). Student identifiers were retained for mapping ability estimates back to individuals.
import pandas as pd
import numpy as np
# Load the file
file_path = "psy_scores.xlsx"
xls = pd.ExcelFile(file_path)
xls.sheet_names
['Session 1', 'Session 2', 'Question bank']
#Load the data
s1 = pd.read_excel ("psy_scores.xlsx", sheet_name ="Session 1")
s2 = pd.read_excel ("psy_scores.xlsx", sheet_name ="Session 2")
# Merge by student ID (outer join in case some students only appear in one session)
df = pd.merge(s1, s2, on="ID", how="outer").fillna(0)
# Separate IDs and response matrix
student_ids = df["ID"].astype(int).values
item_cols = [c for c in df.columns if c ! = "ID"]
X = df[item_cols].apply(pd.to_numeric,errors="coerce").fillna(0).clip(0,1).astype (int).values
N,I = X.shape
N, I
(93, 13)
The Rasch model (1PL IRT) was implemented using Newton–Raphson updates for both student abilities (θ) and item difficulties (β). As discussed earlier, the Rasch (1PL) model is defined by a sigmoid (logistic) function . This is the probability of a correct response given learner ability (θ) and item difficulty (β).
At θn = βn , P=0.5. Thus, difficulty is “where the item turns from unlikely to likely”, and ability is “where the learner sits on that same scale”.
Thus, we fit θ and β with an alternating Newton (MLE) routine and mild ridge regularization for stability with short tests or extreme patterns.
def sigmoid(x):
x = np.clip (x, -35, 35) # numeric stability
return 1.0 / (1.0 + np.exp(-x))
# Initialize parameters
theta = np.zeros(N) #abilities
beta = np.zeros (I) # difficulties
lam = 0.25 # small L2 penalty prevents blow-ups with all-1/all-0 rows/cols
def update_theta (theta0, b_vec, x_row, lam=0.25, iters=30, tol=1e-6):
t = float(theta0)
for_in range(iters):
p = sigmoid(t - b_vec)
grad = np.sum (x_row - p) - lam*t
hess = -np.sum(p*(1-p)) - lam
if hess == 0: break
step = grad / hess
t -= step
if abs(step) < tol : break
return float(np.clip (t, -6, 6))
def update_beta (b0, t_vec, x_col, lam=0.25, iters=30, tol=1e-6):
b = float(b0)
for _ in range(iters):
p = sigmoid(t_vec - b)
grad = np.sum ( p - x_col ) - lam*b
hess = np.sum (p*(1-p)) + lam
if hess == 0: break
step = grad / hess
b -= step
if abs(step) < tol : break
return float(np.clip(b, -6, 6))
Parameters were then iteratively updated until convergence, with both sets of estimates centred at mean zero for identifiability.
- Person ability estimates (θ) represent the latent trait of each student.
- Item difficulty estimates (β) represent the location on the ability scale at which a student has a 50% probability of answering correctly.
# Alternating MLE with centering for identifiability
for _ in range(60):
for n in range(N):
theta[n] = update_theta (theta[n], beta, X[n,:], lam=lam)
theta -= theta.mean()
for i in range(I):
beta[i] = update_beta(beta[i], theta, X[:, i], lam=lam)
beta -= beta.mean()
theta -= theta.mean()
df_abilities = pd.DataFrame ({"StudentID":student_ids, "Ability (theta)":theta}).sort_values ("StudentID")
df_difficulties = pd.DataFrame({"QID": item_cols , "Difficulty (beta)": beta})
df_abilities.head(), df_difficulties
(StudentID Ability (theta)
0 2.530966
0 6.461878
1 -5.538122
2 6.461878
4 -0.057348,
QID Difficulty(beta)
Q1 3.692308
Q2 3.692308
Q3 3.692308
Q4 3.692308
Q5 3.692308
Q6 3.692308
Q7 -8.307692
Q8 3.692308
Q9 3.692308
Q10 3.692308
Q11 -8.307692
Q12 -8.307692
Q13 -8.307692)
While these logits indicate meaningful contrasts in ability and difficulty, the unusually wide spread is due to the fact that the model is operating under constraints such as limited numbers of items and responses.
In Rasch modeling, both person ability (θ) and item difficulty (β) are placed on the same logit scale, allowing for a direct probabilistic comparison. The estimates are derived using Maximum Likelihood Estimation (MLE), which iteratively seeks the parameter values that maximize the likelihood of the observed response patterns. This approach ensures that the model provides the “most probable” estimates of ability and difficulty given the limited dataset.
Typically, ability estimates for educational data cluster within –2.0 to +2.5 logits, while item difficulties are often within –1.5 to +1.8 logits (Bond, 2015). However, in the current dataset a much wider spread is observed.
- Person abilities (θ) range from –5.54 to +6.46 logits (SD = 4.20).
- Item difficulties (β) range from –8.31 to +3.69 logits (SD = 5.76).
This suggests two key phenomena:
- Heterogeneity in the sample: students differ substantially in their mastery levels, with some far below and others far above the expected Rasch range.
- High variance in item difficulties: the test includes both extremely easy items (β ≈ –8.31) and comparatively difficult items (β ≈ +3.69).
The questions with β ≈ –8.31 are so easy that nearly all students, regardless of ability, would have a high probability (>95%) of answering them correctly. These items contribute little to differentiating higher-ability students. Whereas t he items with β ≈ +3.69 fall well above the typical classroom test range. Only the most advanced students (θ > +3) are likely to succeed consistently on these questions. Students with θ between –2 and +2 logits can be considered within the conventional range of classroom performance. Those at θ < –2 require significant scaffolding and remediation, while those at θ > +2 demonstrate advanced competence and may benefit from enrichment tasks. Furthermore, with only a handful of items, students who answer all items correctly (or none) are pushed to extreme θ values . Similarly, items that are always answered correctly (or always missed) are given extreme β values (very easy or very hard), even if they may not be that polarizing.
Rasch modeling effectively aligns student ability with item difficulty, but its accuracy depends on having enough items and responses. Adding more items, particularly mid-range difficulties would smooth out extremes and provide more reliable inferences about both student ability and item quality. In practice, instructors can construct adaptive assessments by assigning items that cluster around β mean. Such targeting increases both measurement precision and instructional relevance (Boone et al., 2014) .
However, the estimation of learner ’s ability was derived from raw scores rather than full maximum likelihood estimation (MLE) or Bayesian methods. This introduces bias at the extremes, particularly for students who answered all items correctly or incorrectly. In this analysis, all questions were assumed to be equally effective at distinguishing between students. In reality, some questions are better at separating strong from weak students than others (Embretson & Reise, 2013).
Combining Bloom’s Taxonomy with Question Difficulty and Student Ability
While Rasch provides a statistical picture of ability and difficulty, it does not capture the cognitive demand of tasks. This is where Bloom’s Taxonomy can complement psychometric calibration. Items mapped to lower levels such as Remember & Understand are expected to cluster at lower logits. W hile those requiring higher-order skills such as Apply, Analyse, Evaluate & Create should align with higher logits. Integrating Bloom’s taxonomy into item design ensures that difficulty is not reduced merely to statistical rarity of correct responses but also reflects intended learning outcomes (Krathwohl, 2002).

Bloom’s Taxonomy provides a structured and hierarchical framework for defining, planning, and assessing educational goals, learning activities, and outcomes. It ensures that both teaching and assessment progress from basic recall to higher-order thinking skills like analysis, evaluation, and creation.
Step1: Linking Cognitive Demand to Question Difficulty
Classify quiz questions into one of Bloom’s six levels based on the mental process required.
- Q1 (β = -1.83) → simple recall → Remember.
- Q2 (β = -1.11) → recognition with basic comprehension → Understand.
- Q3 (β = 0.52) → applying a rule in a new situation → Apply.
- Q4 (β = 0.91) → comparing alternatives or identifying patterns → Analyse.
- Q5 (β = 1.51) → making a judgment with justification → Evaluate.
Step 2: Align Items with Rasch Difficulty (β)
Once items are tagged with Bloom’s levels, compare their placement on the Rasch difficulty scale.
- Lower-order skills (Remember, Understand) generally appear at negative logits, confirming their accessibility.
- Higher-order skills (Apply, Analyse, Evaluate) appear at positive logits, showing increased challenge.
This dual calibration ensures that item difficulty is understood both statistically (probability of success) and cognitively (type of thinking required).
Step 3: Interpreting Student Ability with Bloom’s Levels
Each student’s Rasch ability estimate (θ) can now be interpreted in terms of Bloom’s levels:
- Learners at θ ≈ -0.3 logits → suitable for Remember and Understand items (Q1, Q5).
- Learners at θ ≈ 0.3–0.9 logits → appropriate for Apply and Analyse items (Q2, Q3).
- Learners at θ > 1.5 logits → capable of tackling Evaluate level items (Q4).
This framework enables instructors to tailor assessments to individual needs, allowing students who require additional support to focus on foundational skills while providing more advanced learners with opportunities to engage in higher-order cognitive tasks.
Step 4: Adapting Questions to Student Ability
Using both β (difficulty) and Bloom’s tags:
- Assign items that are within ±0.5 logits of a student’s θ for statistical fairness.
- Within this range, ensure a balanced selection of Bloom’s levels to effectively scaffold learning.
- A weaker learner might receive 2 “Remember” items and 1 “Understand” item.
- A stronger student might receive 1 “Analyse” and 2 “Evaluate” items.
This approach avoids overloading learners with items either too easy or too cognitively demanding and promotes stepwise growth through Bloom’s hierarchy.
Step 5: Reflect and Revise
After administering Bloom-tagged, Rasch-calibrated tests:
- Review item fit and student performance.
- Adjust Bloom’s classification if items do not function as expected. For example a supposedly Apply item is too easy that it is actually “Understand).
- Expand the item bank.
When combined with Bloom’s Taxonomy, MLE-based Rasch estimates can be used not only for statistical scaling but also for pedagogical mapping. Instructors can link estimated item difficulties to specific cognitive levels such as easier items with “Remembering, ” harder items with “ Analyzing ” or “Creating”. This ensures that adaptive assessments support both fair measurement and deeper learning progression. In other words, coupling Rasch with Bloom’s Taxonomy ensures that difficulty is not treated as a purely statistical construct, but as one rooted in cognitive demand and learning progression.
References
- Anderson, L. W. (Ed.). (2009). A taxonomy for learning, teaching, and assessing: A revision of Bloom’s taxonomy of educational objectives (Abridged ed., [Nachdr.]). Longman.
- Bond, T. (2015). Applying the Rasch Model (0 ed.). Routledge. https://doi.org/10.4324/9781315814698
- Boone, W. J., Staver, J. R., & Yale, M. S. (2014). Rasch Analysis in the Human Sciences . Springer Netherlands. https://doi.org/10.1007/978-94-007-6857-4
- Embretson, S. E., & Reise, S. P. (2013). Item Response Theory (0 ed.). Psychology Press. https://doi.org/10.4324/9781410605269
- EUGENIO, B., & SILVIA, G. (2013). Unidimensionality in the Rasch model: How to detect and interpret [Application/pdf]. Statistica; Vol 67 , No 3 (2007); 253261. https://doi.org/10.6092/ISSN.1973-2201/3508
- Haladyna, T. M., & Rodriguez, M. C. (2013). Developing and Validating Test Items (0 ed.). Routledge. https://doi.org/10.4324/9780203850381
- Kostikov, A. A., Vlasenko, K. V., Lovianova, I. V., Volkov, S. V., & Avramov, E. O. (2022). Rusch model-based knowledge assessment algorithm. Educational Dimension , 6 , 40–54. https://doi.org/10.31812/educdim.4482
- Krathwohl, D. R. (2002). A Revision of Bloom’s Taxonomy: An Overview. Theory Into Practice , 41 (4), 212–218. https://doi.org/10.1207/s15430421tip4104_2
- Magno, C. (2009). Demonstrating the Difference between Classical Test Theory and Item Response Theory Using Derived Test Data. CSN: General Cognitive Social Science (Topic) , 1 .
- Omer, H. (2017). Parental Vigilant Care: A Guide for Clinicians and Caretakers (1st ed.). Routledge. https://doi.org/10.4324/9781315624976
- Tavakol, M., & Dennick, R. (2011). Making Sense of Cronbach’s Alpha. International Journal of Medical Education , 2 , 53–55. https://doi.org/10.5116/ijme.4dfb.8dfd
- Virzi, R., Bozzi, M., Costigliolo, M., & Zani, M. (2025, June 17). Comparison between Rasch analysis and Classical Test Theory for a Physics questionnaire validation. 11th International Conference on Higher Education Advances (HEAd’25) . Eleventh International Conference on Higher Education Advances. https://doi.org/10.4995/HEAd25.2025.20062
I am an interdisciplinary educator, researcher, and technologist with over a decade of experience in applied coding, educational design, and research mentorship in fields spanning management, marketing, behavioral science, machine learning, and natural language processing. I specialize in simplifying complex topics such as sentiment analysis, adaptive assessments and data visualizatiion. My training approach emphasizes real-world application, clear interpretation of results and the integration of data mining, processing, and modeling techniques to drive informed strategies across academic and industry domains.
Discuss