Ph.D: Thesis Colloquium: 102 : CDS: 05, June2026 “Towards Reliable Language Model Systems for Educational Assessment and Adaptive Learning”

When

5 Jun 26    
10:00 AM - 11:00 AM

Event Type

DEPARTMENT OF COMPUTATIONAL AND DATA SCIENCES
Ph.D. Thesis Colloquium


Speaker: Ms. Nicy Scaria
S.R. Number: 06-18-01-10-12-22-1-21645
Title: “Towards Reliable Language Model Systems for Educational Assessment and Adaptive Learning”
Research Supervisor: Dr. Deepak Subramani
Date & Time : June 05, 2026 (Friday), 10:00 AM
Venue : #102, CDS Seminar Hall


ABSTRACT
Deploying language model systems in education requires more than just fluent text generation or answer correctness. Such systems must support pedagogical alignment, curriculum relevance, reliable evaluation, sound reasoning, diagnosis of learner misconceptions and skill gaps, and transparent mechanisms for learner progression. These requirements become especially important in resource-constrained educational contexts, where teachers may have limited time to create high-quality assessments and learners may lack continuous expert feedback. This thesis investigates how language model systems can be designed, evaluated, and integrated into pedagogically grounded workflows for educational assessment and adaptive learning. The thesis is structured into three parts. The first part studies automated educational question generation using Large Language Models (LLM), including curriculum-aligned question generation, Bloom’s taxonomy-based prompting, and structured knowledge-guided MCQ generation. The second part examines the reliability and educational utility of Small Language Models (SLM) for reasoning and learner assessment. The final part develops and positions Learning in Blocks as a structured framework for personalized adaptive learning, integrating rubric-aligned assessment, diagnostic recommendation, spaced review, and mastery-based progression.

PART-I
Automated Educational Question Generation: The first part of the thesis focuses on Automated Educational Question Generation (AEQG) using LLMs. In many school systems, including Indian high-school social science education, assessments often emphasize rote memorization rather than higher-order cognitive skills. To address this limitation, we examine whether modern LLMs can generate curriculum-relevant and pedagogically sound questions across Bloom’s taxonomy. The work first studies question generation for the social science curriculum of an Indian state educational board and then extends the investigation to prompting strategies for generating questions across cognitive levels more broadly. Expert evaluation shows that LLMs can generate high-quality questions when provided with adequate context and instructions. However, the results also reveal variation across models of different sizes and show that automated evaluation is not yet on par with human expert judgment. These findings demonstrate that LLMs can support scalable assessment creation, but their use in educational assessment requires careful prompt design, pedagogical validation, and human-grounded evaluation.

Structured Knowledge-Guided MCQ Generation with Effective Distractors: The thesis further extends AEQG from general question generation to structured assessment design. High-quality MCQs must assess conceptual understanding, target different cognitive levels, and include plausible distractors that reflect common learner misconceptions. Existing automated approaches often struggle to incorporate such domain-specific misconceptions. To address this limitation, we develop a hierarchical concept map-based framework for generating MCQs in high-school physics. The framework represents major physics topics and their interconnections through a structured concept map, retrieves topic-relevant sections, and provides this context to an LLM for question and distractor generation. Expert and student evaluations show that the concept map-guided approach outperforms baseline methods and generates questions that more effectively assess conceptual understanding. These results demonstrate that reliable AEQG requires not only powerful language models, but also structured representations of domain knowledge and learner misconceptions.

PART-II
Reliability of Small Language Models for Educational Reasoning: The second part of the thesis focuses on the reliability and educational utility of SLMs. SLMs are attractive for education because they offer efficiency, privacy, cost, and deployability advantages, but their usefulness depends on whether they can reason and evaluate learner performance reliably. In learning contexts, models that produce correct final answers through incorrect procedures may reinforce misconceptions and provide misleading feedback. To study this issue, we introduce PhysBench, a benchmark of high-school and AP-level physics questions with structured reference solutions, Bloom’s taxonomy annotations, and culturally contextualized variants. Using a stage-wise evaluation rubric, we assess SLM responses to examine reasoning reliability, failure modes, and robustness under contextual variations. The results show that many correct final answer solutions still contain reasoning errors, demonstrating that answer accuracy alone is insufficient for evaluating educational AI systems.

SLM-Based CEFR Speaking Assessment: The thesis further examines the potential of SLMs for automated learner assessment when adapted using high-quality, criterion-aligned data. In language learning, human evaluation of CEFR speaking assessments creates scalability challenges in e-learning environments. To address this problem, we develop EvalYaks, a family of instruction-tuned models for automated evaluation of CEFR B2 English speaking assessment transcripts. The work evaluates open-source and commercial language models for CEFR-aligned scoring, creates expert-validated synthetic conversational datasets, and uses parameter-efficient instruction tuning to adapt Mistral for speaking assessment, vocabulary-level identification and generation, and text-level identification and generation. EvalYaks achieves performance competitive with frontier models, and pilot validation on real-world learner transcripts verifies its transferability to practical assessment contexts. This work demonstrates that carefully adapted SLMs can support scalable language proficiency evaluation when trained with expert-validated, criterion-aligned data.

PART-III
Learning in Blocks for Personalized Adaptive Language Learning: The final part of the thesis develops Learning in Blocks, an adaptive learning framework that connects learner assessment with targeted review and mastery-based progression. In digital language learning, learners can often advance through quiz-based curricula despite persistent gaps in using grammar and vocabulary during interaction. To address this limitation, Learning in Blocks grounds progression in demonstrated conversational competence evaluated through CEFR-aligned rubrics. The framework uses heterogeneous multi-agent debate to evaluate Grammar, Vocabulary, and Interactive Communication, resolve conflicting judgments, and identify specific grammar skills and vocabulary topics for targeted review. Learners progress only after demonstrating mastery, while spaced review targets identify weaknesses to counter skill weakening. Expert-annotated conversation benchmarks and a learner study show that combining rubric-aligned scoring, diagnostic recommendation, spaced review, and mastery-based progression improves learning outcomes.

Learning in Blocks as a Design Pattern for Adaptive Learning Systems: The thesis concludes by positioning Learning in Blocks as a broader design pattern for responsible language model supported learning. Open-ended chatbot interfaces are flexible, but they can make it difficult to constrain system behavior, align interactions with curriculum goals, and connect learner activity to demonstrated skill performance. In contrast, Learning in Blocks organizes learning into blocks of target and prerequisite skills, where bounded pedagogical agents support assessment generation, assessment evaluation, diagnostic recommendation, spaced review, and mastery-based progression. Together, these components define Learning in Blocks as a transparent, auditable, and pedagogically aligned adaptive learning framework supported by language model systems.


ALL ARE WELCOME