Question Difficulty and Student Ability in Classroom Assessments

By Abhinash Jena on October 2, 2025

Estimating the difficulty of questions is a crucial step in designing quizzes or tests, as it directly impacts assessment validity, student motivation, and instructional feedback. In educational measurement, question difficulty refers to the proportion of students expected (or observed) to answer an item correctly. A balanced distribution of difficulty levels ensures that the assessment can discriminate effectively between learners of different ability levels while maintaining fairness and engagement.

Valid assessments capture both lower-order and higher-order learning outcomes, ensuring that they measure what they intend to. Tests with questions that are either too easy or too difficult fail to provide meaningful insights into student learning (Haladyna & Rodriguez, 2013). Overly difficult tests discourage learners and reduce self-efficacy, while tests that are too easy will not challenge students enough to demonstrate higher-level thinking. Analyzing question difficulty also helps to identify which topics are well-understood and which require reteaching. According to classical test theory and item response theory, difficulty indices are essential for improving instructional design and for developing adaptive assessments that personalise learning (Embretson & Reise, 2013). Classical Test Theory (CTT) and early models of Item Response Theory (IRT) are less computationally demanding, easier to implement, and still provide s meaningful insights into item quality and test performance.

Understanding Question Difficulty with Classical Test Theory (CTT)

Classical Test Theory (CTT) remains one of the foundational models in educational measurement. It is widely used in classroom and small-scale testing contexts because of its simplicity and applicability to modest sample sizes. The core idea of CTT is that an observed score (X) is made of two parts: the true score (T) and error (E).

The teacher never observes T directly; instead, CTT provides methods to estimate it by reducing measurement error (Omer, 2017).

Item Difficulty (p-value) is the proportion of students who answer an item correctly. The scale ranges from 0 (hardest) to 1 (easiest); items with p-values of 0.3–0.7 are moderately difficult and most informative (Haladyna & Rodriguez, 2013).

Item Discrimination (r_pb) explains how well an item differentiates between high and low scorers. It is the Point-biserial correlation between item score and total test score.

CTT provides several indices of internal consistency. Cronbach’s Alpha is the most common reliability coefficient. A reliability of 0.70–0.80 is acceptable for classroom tests, whereas >0.90 is desirable for high-stakes testing (Tavakol & Dennick, 2011).

Where:

k = number of items,
p = item difficulty,
q = 1-p
= variance of total scores.

Benchmarks (Tavakol & Dennick, 2011):

0.70 – 0.79 = acceptable.
0.80 – 0.89 = good.
≥0.90 = excellent but may indicate redundancy if too high.
The standard error of management (SEM) provides a range around each student’s observed score.

Where:

SD = standard deviation of the test scores,
α = reliability of the test (Cronbach’s alpha or KR-20).

EXAMPLE

Suppose Test mean = 70, SD = 10 & = 0.84 then SEM = 4. Therefore, a student who scored 70 has a likely “true score” between 66 and 74.

Classical Test Theory (CTT) is an entry-level psychometric framework that offers valuable insights for educators. CTT can be implemented using relatively small groups of students like 100–120 learners. This makes it highly practical for classroom settings.

However, its sample dependence, test dependence, and limited modelling of error restrict its suitability for large-scale, high-stakes, or adaptive testing environments. CTT relies heavily on total test scores, which do not account for the specific interaction between an individual’s ability and an item’s difficulty. This means two individuals with the same score may have very different abilities if they answered different sets of items with varying difficulties (Magno, 2009). Thus, the lack of probabilistic modelling in Classical Test Theory (CTT) leads to skewed judgments about both items and examinees (Virzi et al., 2025). This absence makes judgments less objective and less comparable across different tests or populations.

Linking Question Difficulty and Student Ability with Probability

Adaptive testing tailors the difficulty of items presented to each test-taker by selecting items near their current estimated ability level. This helps to maximize measurement precision and efficiency (Kostikov et al., 2022). The Rasch model, a type of Item Response Theory (IRT), uses probability to estimate how likely a person with a given ability is to answer an item of specific difficulty correctly. It assumes that a correct response is a logistic function of the difference between a person’s ability and an item’s difficulty (Bond, 2015).

Where P(X_ni = 1) is the probability that person n answers item i correctly, θ_n is the person’s latent ability, and b_i is the item difficulty.

According to Bond (2015), Rasch analysis yields item difficulty estimates that remain consistent regardless of the sample, as well as person ability estimates that are not dependent on the specific test form, assuming the model demonstrates adequate fit. However, the Rasch model assumes that all test items measure the same latent trait known as unidimensionality (EUGENIO & SILVIA, 2013). This means every item on the test is expected to reflect only one ability or trait, such as mathematical skill or reading comprehension. The model insists that variation in item response is explained solely by differences in the single trait among individuals, not by multiple abilities or secondary factors. Thus, if real data violate s the assumptions of unidimensionality (i.e., items are measuring more than one trait) or equal discrimination (some items are much better at discriminating than others), the model fit will deteriorate (EUGENIO & SILVIA, 2013). In these situations, item and person measures cannot be trusted, undermining the Rasch model fairness and generalizability.

To deepen the understanding, Rasch measurement was implemented on a student response data collected from two sessions of a classroom quiz. Student responses were coded dichotomously as correct=1, incorrect=0. In total, 83 students attempted the quizzes, producing a response matrix of 83×13. The datasets from both sessions were merged into a unified student × item matrix. Missing responses were treated as incorrect (coded as 0). Student identifiers were retained for mapping ability estimates back to individuals.

import pandas as pd
import numpy as np

# Load the file
file_path = "psy_scores.xlsx"
xls = pd.ExcelFile(file_path)
xls.sheet_names

['Session 1', 'Session 2', 'Question bank']

#Load the data

s1 = pd.read_excel ("psy_scores.xlsx", sheet_name ="Session 1")
s2 = pd.read_excel ("psy_scores.xlsx", sheet_name ="Session 2")

# Merge by student ID (outer join in case some students only appear in one session)

df = pd.merge(s1, s2, on="ID", how="outer").fillna(0)

# Separate IDs and response matrix 

student_ids = df["ID"].astype(int).values

item_cols = [c for c in df.columns if c ! = "ID"]

X = df[item_cols].apply(pd.to_numeric,errors="coerce").fillna(0).clip(0,1).astype (int).values

N,I = X.shape

N, I

(93, 13)

The Rasch model (1PL IRT) was implemented using Newton–Raphson updates for both student abilities (θ) and item difficulties (β). As discussed earlier, the Rasch (1PL) model is defined by a sigmoid (logistic) function . This is the probability of a correct response given learner ability (θ) and item difficulty (β).

EXAMPLE

At θ_n = β_n , P=0.5. Thus, difficulty is “where the item turns from unlikely to likely”, and ability is “where the learner sits on that same scale”.

Thus, we fit θ and β with an alternating Newton (MLE) routine and mild ridge regularization for stability with short tests or extreme patterns.

def sigmoid(x):
   x = np.clip (x, -35, 35) # numeric stability
   return 1.0 / (1.0 + np.exp(-x))

# Initialize parameters

theta = np.zeros(N) #abilities
beta = np.zeros (I) # difficulties
lam = 0.25 # small L2 penalty prevents blow-ups with all-1/all-0 rows/cols

def update_theta (theta0, b_vec, x_row, lam=0.25, iters=30, tol=1e-6): 
  t = float(theta0) 
  for_in range(iters): 
  p = sigmoid(t - b_vec) 
  grad = np.sum (x_row - p) - lam*t 
  hess = -np.sum(p*(1-p)) - lam 
  if hess == 0: break 
  step = grad / hess 
  t -= step 
  if abs(step) < tol : break 
  return float(np.clip (t, -6, 6))

def update_beta (b0, t_vec, x_col, lam=0.25, iters=30, tol=1e-6):
  b = float(b0)
  for _ in range(iters):
    p = sigmoid(t_vec - b)
    grad = np.sum ( p - x_col ) - lam*b
    hess = np.sum (p*(1-p)) + lam
    if hess == 0: break
    step = grad / hess
    b -= step
    if abs(step) < tol : break
  return float(np.clip(b, -6, 6))

Parameters were then iteratively updated until convergence, with both sets of estimates centred at mean zero for identifiability.

Person ability estimates (θ) represent the latent trait of each student.
Item difficulty estimates (β) represent the location on the ability scale at which a student has a 50% probability of answering correctly.

# Alternating MLE with centering for identifiability

for _ in range(60):
  for n in range(N):
    theta[n] = update_theta (theta[n], beta, X[n,:], lam=lam)
  theta -= theta.mean()
  for i in range(I):
    beta[i] = update_beta(beta[i], theta, X[:, i], lam=lam)

beta -= beta.mean() 
theta -= theta.mean()

df_abilities = pd.DataFrame ({"StudentID":student_ids, "Ability (theta)":theta}).sort_values ("StudentID") 

df_difficulties = pd.DataFrame({"QID": item_cols , "Difficulty (beta)": beta}) 

df_abilities.head(), df_difficulties

(StudentID Ability (theta) 
     0     2.530966 
     0     6.461878 
     1    -5.538122 
     2     6.461878 
     4    -0.057348, 

QID  Difficulty(beta) 
 Q1   3.692308 
 Q2   3.692308 
 Q3   3.692308 
 Q4   3.692308 
 Q5   3.692308 
 Q6   3.692308 
 Q7  -8.307692 
 Q8   3.692308 
 Q9   3.692308 
Q10   3.692308 
Q11  -8.307692 
Q12  -8.307692 
Q13  -8.307692)

While these logits indicate meaningful contrasts in ability and difficulty, the unusually wide spread is due to the fact that the model is operating under constraints such as limited numbers of items and responses.

In Rasch modeling, both person ability (θ) and item difficulty (β) are placed on the same logit scale, allowing for a direct probabilistic comparison. The estimates are derived using Maximum Likelihood Estimation (MLE), which iteratively seeks the parameter values that maximize the likelihood of the observed response patterns. This approach ensures that the model provides the “most probable” estimates of ability and difficulty given the limited dataset.

Typically, ability estimates for educational data cluster within –2.0 to +2.5 logits, while item difficulties are often within –1.5 to +1.8 logits (Bond, 2015). However, in the current dataset a much wider spread is observed.

Person abilities (θ) range from –5.54 to +6.46 logits (SD = 4.20).
Item difficulties (β) range from –8.31 to +3.69 logits (SD = 5.76).

This suggests two key phenomena:

Heterogeneity in the sample: students differ substantially in their mastery levels, with some far below and others far above the expected Rasch range.
High variance in item difficulties: the test includes both extremely easy items (β ≈ –8.31) and comparatively difficult items (β ≈ +3.69).

The questions with β ≈ –8.31 are so easy that nearly all students, regardless of ability, would have a high probability (>95%) of answering them correctly. These items contribute little to differentiating higher-ability students. Whereas t he items with β ≈ +3.69 fall well above the typical classroom test range. Only the most advanced students (θ > +3) are likely to succeed consistently on these questions. Students with θ between –2 and +2 logits can be considered within the conventional range of classroom performance. Those at θ < –2 require significant scaffolding and remediation, while those at θ > +2 demonstrate advanced competence and may benefit from enrichment tasks. Furthermore, with only a handful of items, students who answer all items correctly (or none) are pushed to extreme θ values . Similarly, items that are always answered correctly (or always missed) are given extreme β values (very easy or very hard), even if they may not be that polarizing.

Rasch modeling effectively aligns student ability with item difficulty, but its accuracy depends on having enough items and responses. Adding more items, particularly mid-range difficulties would smooth out extremes and provide more reliable inferences about both student ability and item quality. In practice, instructors can construct adaptive assessments by assigning items that cluster around β mean. Such targeting increases both measurement precision and instructional relevance (Boone et al., 2014) .

However, the estimation of learner ’s ability was derived from raw scores rather than full maximum likelihood estimation (MLE) or Bayesian methods. This introduces bias at the extremes, particularly for students who answered all items correctly or incorrectly. In this analysis, all questions were assumed to be equally effective at distinguishing between students. In reality, some questions are better at separating strong from weak students than others (Embretson & Reise, 2013).

Combining Bloom’s Taxonomy with Question Difficulty and Student Ability

While Rasch provides a statistical picture of ability and difficulty, it does not capture the cognitive demand of tasks. This is where Bloom’s Taxonomy can complement psychometric calibration. Items mapped to lower levels such as Remember & Understand are expected to cluster at lower logits. W hile those requiring higher-order skills such as Apply, Analyse, Evaluate & Create should align with higher logits. Integrating Bloom’s taxonomy into item design ensures that difficulty is not reduced merely to statistical rarity of correct responses but also reflects intended learning outcomes (Krathwohl, 2002).

Bloom’s Taxonomy Levels Mapped to Question Difficulty (Anderson, 2009)

Bloom’s Taxonomy provides a structured and hierarchical framework for defining, planning, and assessing educational goals, learning activities, and outcomes. It ensures that both teaching and assessment progress from basic recall to higher-order thinking skills like analysis, evaluation, and creation.

Step1: Linking Cognitive Demand to Question Difficulty

Classify quiz questions into one of Bloom’s six levels based on the mental process required.

EXAMPLE

Q1 (β = -1.83) → simple recall → Remember.
Q2 (β = -1.11) → recognition with basic comprehension → Understand.
Q3 (β = 0.52) → applying a rule in a new situation → Apply.
Q4 (β = 0.91) → comparing alternatives or identifying patterns → Analyse.
Q5 (β = 1.51) → making a judgment with justification → Evaluate.

Step 2: Align Items with Rasch Difficulty (β)

Once items are tagged with Bloom’s levels, compare their placement on the Rasch difficulty scale.

Lower-order skills (Remember, Understand) generally appear at negative logits, confirming their accessibility.
Higher-order skills (Apply, Analyse, Evaluate) appear at positive logits, showing increased challenge.

This dual calibration ensures that item difficulty is understood both statistically (probability of success) and cognitively (type of thinking required).

Step 3: Interpreting Student Ability with Bloom’s Levels

Each student’s Rasch ability estimate (θ) can now be interpreted in terms of Bloom’s levels:

Learners at θ ≈ -0.3 logits → suitable for Remember and Understand items (Q1, Q5).
Learners at θ ≈ 0.3–0.9 logits → appropriate for Apply and Analyse items (Q2, Q3).
Learners at θ > 1.5 logits → capable of tackling Evaluate level items (Q4).

This framework enables instructors to tailor assessments to individual needs, allowing students who require additional support to focus on foundational skills while providing more advanced learners with opportunities to engage in higher-order cognitive tasks.

Step 4: Adapting Questions to Student Ability

Using both β (difficulty) and Bloom’s tags:

Assign items that are within ±0.5 logits of a student’s θ for statistical fairness.
Within this range, ensure a balanced selection of Bloom’s levels to effectively scaffold learning.
A weaker learner might receive 2 “Remember” items and 1 “Understand” item.
A stronger student might receive 1 “Analyse” and 2 “Evaluate” items.

This approach avoids overloading learners with items either too easy or too cognitively demanding and promotes stepwise growth through Bloom’s hierarchy.

Step 5: Reflect and Revise

After administering Bloom-tagged, Rasch-calibrated tests:

Review item fit and student performance.
Adjust Bloom’s classification if items do not function as expected. For example a supposedly Apply item is too easy that it is actually “Understand).
Expand the item bank.

When combined with Bloom’s Taxonomy, MLE-based Rasch estimates can be used not only for statistical scaling but also for pedagogical mapping. Instructors can link estimated item difficulties to specific cognitive levels such as easier items with “Remembering, ” harder items with “ Analyzing ” or “Creating”. This ensures that adaptive assessments support both fair measurement and deeper learning progression. In other words, coupling Rasch with Bloom’s Taxonomy ensures that difficulty is not treated as a purely statistical construct, but as one rooted in cognitive demand and learning progression.

References

Anderson, L. W. (Ed.). (2009). A taxonomy for learning, teaching, and assessing: A revision of Bloom’s taxonomy of educational objectives (Abridged ed., [Nachdr.]). Longman.
Bond, T. (2015). Applying the Rasch Model (0 ed.). Routledge. https://doi.org/10.4324/9781315814698
Boone, W. J., Staver, J. R., & Yale, M. S. (2014). Rasch Analysis in the Human Sciences . Springer Netherlands. https://doi.org/10.1007/978-94-007-6857-4
Embretson, S. E., & Reise, S. P. (2013). Item Response Theory (0 ed.). Psychology Press. https://doi.org/10.4324/9781410605269
EUGENIO, B., & SILVIA, G. (2013). Unidimensionality in the Rasch model: How to detect and interpret [Application/pdf]. Statistica; Vol 67 , No 3 (2007); 253261. https://doi.org/10.6092/ISSN.1973-2201/3508
Haladyna, T. M., & Rodriguez, M. C. (2013). Developing and Validating Test Items (0 ed.). Routledge. https://doi.org/10.4324/9780203850381
Kostikov, A. A., Vlasenko, K. V., Lovianova, I. V., Volkov, S. V., & Avramov, E. O. (2022). Rusch model-based knowledge assessment algorithm. Educational Dimension , 6 , 40–54. https://doi.org/10.31812/educdim.4482
Krathwohl, D. R. (2002). A Revision of Bloom’s Taxonomy: An Overview. Theory Into Practice , 41 (4), 212–218. https://doi.org/10.1207/s15430421tip4104_2
Magno, C. (2009). Demonstrating the Difference between Classical Test Theory and Item Response Theory Using Derived Test Data. CSN: General Cognitive Social Science (Topic) , 1 .
Omer, H. (2017). Parental Vigilant Care: A Guide for Clinicians and Caretakers (1st ed.). Routledge. https://doi.org/10.4324/9781315624976
Tavakol, M., & Dennick, R. (2011). Making Sense of Cronbach’s Alpha. International Journal of Medical Education , 2 , 53–55. https://doi.org/10.5116/ijme.4dfb.8dfd
Virzi, R., Bozzi, M., Costigliolo, M., & Zani, M. (2025, June 17). Comparison between Rasch analysis and Classical Test Theory for a Physics questionnaire validation. 11th International Conference on Higher Education Advances (HEAd’25) . Eleventh International Conference on Higher Education Advances. https://doi.org/10.4995/HEAd25.2025.20062

Abhinash Jena

I am an interdisciplinary educator, researcher, and technologist with over a decade of experience in applied coding, educational design, and research mentorship in fields spanning management, marketing, behavioral science, machine learning, and natural language processing. I specialize in simplifying complex topics such as sentiment analysis, adaptive assessments and data visualizatiion. My training approach emphasizes real-world application, clear interpretation of results and the integration of data mining, processing, and modeling techniques to drive informed strategies across academic and industry domains.

Understanding Question Difficulty with Classical Test Theory (CTT)

Linking Question Difficulty and Student Ability with Probability

Combining Bloom’s Taxonomy with Question Difficulty and Student Ability

Step1: Linking Cognitive Demand to Question Difficulty

Step 2: Align Items with Rasch Difficulty (β)

Step 3: Interpreting Student Ability with Bloom’s Levels

Step 4: Adapting Questions to Student Ability

Step 5: Reflect and Revise

References

Discuss

proofreading