大模型辅助自动命题
核心生成技术、提示工程与模型微调
侧重于大模型生成题目的底层实现,包括Few-shot、CoT提示策略优化,以及通过微调(如T5, Llama)和流水线设计提升生成内容的结构化与指令遵循能力。
- Leveraging Large Language Model for Automatic Translation of Educational Content: Exploring the Effectiveness of Curriculum-Aware Prompt Engineering(Euigyum Kim, Hyo Jeong Shin, 2025, Korean Educational Research Association)
- Exploring prompt pattern for generative artificial intelligence in automatic question generation(Lili Wang, Ruiyuan Song, Weitong Guo, Hongwu Yang, 2024, Interactive Learning Environments)
- Prompting Strategies for Language Model-Based Item Generation in K-12 Education: Bridging the Gap Between Small and Large Language Models(M. Amini, Babak Ahmadi, Xi Xiong, Yilin Zhang, Christopher Qiao, 2025, arXiv.org)
- Large Language Model-based Pipeline for Item Difficulty and Response Time Estimation for Educational Assessments(Hariram Veeramani, Surendrabikram Thapa, Natarajan Balaji Shankar, Abeer Alwan, 2024, Workshop on Innovative Use of NLP for Building Educational Applications)
- Few-shot is enough: exploring ChatGPT prompt engineering method for automatic question generation in english education(Unggi Lee, Haewon Jung, Younghoon Jeon, Y.K. Sohn, Wonhee Hwang, Jewoong Moon, Hyeoncheol Kim, 2023, Education and Information Technologies)
- Automatic Large Language Models Creation of Interactive Learning Lessons(Jionghao Lin, Jiarui Rao, Yiyang Zhao, Yuting Wang, Ashish Gurung, Amanda Barany, Jaclyn L. Ocumpaugh, Ryan S. Baker, Ken Koedinger, 2025, ArXiv)
- Towards automatic question generation using pre-trained model in academic field for Bahasa Indonesia(Derwin Suhartono, Muhammad Rizki Nur Majiid, Renaldy Fredyan, 2024, Education and Information Technologies)
- Fine-Tuned T5 Transformer with LSTM and Spider Monkey Optimizer for Redundancy Reduction in Automatic Question Generation(R. Tharaniya sairaj, S. R. Balasundaram, 2024, SN Computer Science)
- Fine-Tuning a Large Language Model with Reinforcement Learning for Educational Question Generation(Salima Lamsiyah, Abdelkader El Mahdaouy, A. Nourbakhsh, Christoph Schommer, 2024, Lecture Notes in Computer Science)
- Optimizing Automated Question Generation for Educational Assessments(Sumayyah Alamoudi, Lama A. Al Khuzayem, Amani Jamal, 2025, Engineering, Technology & Applied Science Research)
- Hybrid NLP–Deep Learning Framework for Automatic MCQ Generation(V. Raju, Madri, Dr. V. Lokeswara Reddy, 2026, 2026 International Conference on AI-Driven Smart Systems and Ubiquitous Computing (ICAUC))
知识增强、RAG与多模态命题框架
探讨如何利用检索增强生成(RAG)、知识图谱和外部语料库解决幻觉问题,确保题目真实性,并扩展到视频、图像等多模态命题场景。
- Automatic Question Generation for Intuitive Learning Utilizing Causal Graph Guided Chain of Thought Reasoning(Nicholas X. Wang, Neel V. Parpia, Aaryan D. Parikh, A. Katsaggelos, 2025, 2025 IEEE 8th International Conference on Multimedia Information Processing and Retrieval (MIPR))
- Leveraging In-Context Learning and Retrieval-Augmented Generation for Automatic Question Generation in Educational Domains(Subhankar Maity, Aniket Deroy, Sudeshna Sarkar, 2024, Proceedings of the 16th Annual Meeting of the Forum for Information Retrieval Evaluation)
- A Transformer-Based Framework for Automated Content Retrieval and Dynamic Response Generation: A Pedagogical Advancement(Aaditya K. Singh, Mehul Lamba, Maadhav Lal, Rudresh Dwivedi, Ruchika Sharma, 2025, IETE Journal of Education)
- Automated Generation of Curriculum-Aligned Multiple-Choice Questions for Malaysian Secondary Mathematics Using Generative AI(Rohaizah Abdul Wahid, Muhamad Said Nizamuddin Nadim, Suliana Sulaiman, Syahmi Akmal Shaharudin, Muhammad Danial Jupikil, Iqqwan Jasman Su Azlan Su, 2025, arXiv.org)
- An Open-Source Dual-Loss Embedding Model for Semantic Retrieval in Higher Education(Ramteja Sajja, Y. Sermet, Ibrahim Demir, 2025, arXiv.org)
- GISedu-GPT: a large language model framework with prior knowledge for GIS education question bank generation(Zhiyun Wang, Yifan Zhang, Wen Min, Qingfeng Guan, Wenhao Yu, 2025, Journal of Geography in Higher Education)
- Development and evaluation of a retrieval-augmented large language model framework for enhancing endodontic education(Xiaowei Xu, Siyi Liu, Lin Zhu, Yunzi Long, Yin Zeng, Xudong Lu, Jiao Li, Yanmei Dong, 2025, International Journal of Medical Informatics)
- Automatic Question Generation with Knowledge Graph for Panoramic Learning(Fumika Okuhara, S. Egami, Y. Sei, Yasuyuki Tahara, Akihiko Ohsuga, 2024, 2024 21st International Conference on Information Technology Based Higher Education and Training (ITHET))
- Multimodal Quiz Generation via RAG with LLM-as-Judge Evaluation(M. T. Kunuku, N. Dehbozorgi, 2025, 2025 IEEE Frontiers in Education Conference (FIE))
- Beyond Static Question Banks: Dynamic Knowledge Expansion via LLM-Automated Graph Construction and Adaptive Generation(Yingquan Wang, Tianyu Wei, Qinsi Li, Li Zeng, 2026, arXiv.org)
- Impact of retrieval augmented generation and large language model complexity on undergraduate exams created and taken by AI agents(Erick Tyndall, Colleen Gayheart, Alexandre Some, Joseph Genz, Torrey Wagner, Brent Langhals, 2025, Data & Policy)
- Automatic Question Generation from Youtube Lectures using Deep Learning(Himanshu Jasuja, Ujjwal Negi, Vibhav, Gull Kaur, 2024, 2024 15th International Conference on Computing Communication and Networking Technologies (ICCCNT))
质量验证、心理测量学评估与难度预测
利用项目反应理论(IRT)、Rasch模型及模拟学生技术,对生成题目的信效度、难度、写作缺陷及区分度进行自动化分析与校验。
- Estimating Exam Item Difficulty with LLMs: A Benchmark on Brazil's ENEM Corpus(Thiago Brant, Julien Kühn, Jun Pang, 2026, arXiv.org)
- Synthetic Student Responses: LLM-Extracted Features for IRT Difficulty Parameter Estimation(Matias Hoyl, 2026, arXiv.org)
- Instruction‐Tuned Large‐Language Models for Quality Control in Automatic Item Generation: A Feasibility Study(Guher Gorgun, Okan Bulut, 2024, Educational Measurement: Issues and Practice)
- ExamQ-Gen: Instructor-in-the-Loop Generation of Self-Contained Exam Questions from Course Materials and Decision-Support Grading(Catalin Anghel, Emilia Pecheanu, A. Anghel, M. Craciun, A. Cocu, 2026, Computers)
- Evaluating and Validating Large Language Models for Health Education on Developmental Dysplasia of the Hip: 2-Phase Study With Expert Ratings and a Pilot Randomized Controlled Trial(Ouyang Hui, Gan Lin, Yiyuan Li, Zhixin Yao, Yating Li, Han Yan, Fang Qin, Jinghui Yao, Yun Chen, 2026, Journal of Medical Internet Research)
- Exploring Large Language Models for Evaluating Automatically Generated Questions(Jeffrey S. Dittel, Michelle W. Clark, R. V. Campenhout, Benny G. Johnson, 2024, No journal)
- Automatic Multiple-Choice Question Generation and Evaluation Systems Based on LLM: A Study Case With University Resolutions(S. S. Mucciaccia, T. M. Paixão, F. Mutz, C. Badue, A. F. D. Souza, Thiago Oliveira-Santos, 2025, International Conference on Computational Linguistics)
- LLM-Simulated Nonequivalent Groups With Anchor Test: A Novel Approach for Test Equating in the Absence of Traditional Anchor Items(Junlei Du, Yishen Song, Qinhua Zheng, 2026, IEEE Transactions on Learning Technologies)
- The Impact of Item-Writing Flaws on Difficulty and Discrimination in Item Response Theory(Robin Schmucker, Steven Moore, 2025, arXiv.org)
- Docimological Quality Analysis of LLM-Generated Multiple Choice Questions in Computer Science and Medicine(Christian Grévisse, M. A. S. Pavlou, Jochen G Schneider, 2024, SN Computer Science)
- Assessing LLM-generated vs. expert-created clinical anatomy MCQs: a student perception-based comparative study in medical education(Maram Elzayyat, Janatul Naeim Mohammad, S. Zaqout, 2025, Medical Education Online)
- SMART: Simulated Students Aligned with Item Response Theory for Question Difficulty Prediction(Alexander Scarlatos, Nigel Fernandez, Christopher Ormerod, Susan Lottridge, Andrew Lan, 2025, Conference on Empirical Methods in Natural Language Processing)
- Do LLMs Give Psychometrically Plausible Responses in Educational Assessments?(Andreas Säuberli, Diego Frassinelli, Barbara Plank, 2025, Workshop on Innovative Use of NLP for Building Educational Applications)
- QG-SMS: Enhancing Test Item Analysis via Student Modeling and Simulation(Bang Nguyen, TingTing Du, Mengxia Yu, Lawrence Angrave, Meng Jiang, 2025, Annual Meeting of the Association for Computational Linguistics)
- Can LLMs Estimate Student Struggles? Human-AI Difficulty Alignment with Proficiency Simulation for Item Difficulty Prediction(Ming Li, Han Chen, Yunze Xiao, Jian Chen, Hong Jiao, Tianyi Zhou, 2025, arXiv.org)
- STRIVE: A Think & Improve Approach with Iterative Refinement for Enhancing Question Quality Estimation(Aniket Deroy, Subhankar Maity, 2025, arXiv.org)
- Take Out Your Calculators: Estimating the Real Difficulty of Question Items with LLM Student Simulations(Christabel Acquaye, Yi Ting Huang, Marine Carpuat, Rachel Rudinger, 2026, arXiv.org)
特定学科深度定制与跨语言应用
聚焦医疗、STEM、编程、语言教学等垂直领域,研究领域知识的准确性以及针对不同语言环境的本地化命题技术。
- Generative AI Use in Dental Education: Efficient Exam Item Writing.(Margeaux C. Johnson, A. P. Ribeiro, Tiffany M Drew, P. R. Pereira, 2023, Journal of Dental Education)
- Language Assessment Using Word Family-Based Automated Item Generation: Evaluating Item Quality Using Teacher Ratings(S. Marandi, S. Hosseini, 2024, WorldCALL Official Conference Proceedings)
- AI-powered automated item generation for language testing(Dongkwang Shin, Jang Ho Lee, 2024, ELT Journal)
- Automatic Question Generation for Spanish Textbooks: Evaluating Spanish Questions Generated with the Parallel Construction Method(Benny G. Johnson, Rachel Van Campenhout, Bill Jerome, Maria Fernanda Castro, ro Bistolfi, Jeffrey S. Dittel, 2024, International Journal of Artificial Intelligence in Education)
- Automatic item generation in various STEM subjects using large language model prompting(K. W. Chan, Farhan Ali, Joonhyeong Park, Kah Shen Brandon Sham, Erdalyn Yeh Thong Tan, Francis Woon Chien Chong, Kun Qian, Guan Kheng Sze, 2024, Computers and Education: Artificial Intelligence)
- Math Multiple Choice Question Generation via Human-Large Language Model Collaboration(Jaewook Lee, Digory Smith, Simon Woodhead, Andrew Lan, 2024, Educational Data Mining)
- Automated Multilingual Translation of Exam Question Papers Using Generative AI(S. Venkatraman, Sumneet Kaur Bamrah, D. Pushgara, Rani Dept. of, C. Engg, 2025, 2025 International Conference on Computing and Communication Technologies (ICCCT))
- Training Optimus Prime, M.D.: Generating Medical Certification Items by Fine-Tuning OpenAI's gpt2 Transformer Model(M. Davier, 2019, arXiv.org)
- Artificial intelligence in radiology examinations: a psychometric comparison of question generation methods.(E. Emekli, B. N. Karahan, 2025, Diagnostic and Interventional Radiology)
- Research on the Construction of Medical Critical Thinking Assessment Gauge Driven by Generative AI(Liang Ying, Zixun Dai, Xiaoqing Qiu, Z. Ouyang, Yuzhu Pan, Xin Fang, Jiahe Li, Yutong Sun, Xiaona Guan, 2025, Academic Journal of Management and Social Sciences)
- Transforming Children's Python Turtle Graphics Learning with LLM Technology: A Design Proposal(Mondheera Pituxcoosuvarn, Yohei Murakami, 2024, 2024 9th International STEM Education Conference (iSTEM-Ed))
- Programming Assessment in E-Learning through Rule-Based Automatic Question Generation with Large Language Models(Halim Teguh Saputro, U. Nurhasan, Vivi Nur Wijayaningrum, 2025, Journal of Applied Informatics and Computing)
- Evaluation of automated vocabulary quiz generation with VocQGen(Qiao Wang, Ralph L. Rose, Ayaka Sugawara, Naho Orita, 2025, Vocabulary Learning and Instruction)
自动化评分、个性化反馈与教学支架
研究命题技术的下游应用,包括开放性问题判分、生成即时反馈提示(Hints)、反思性问题,以及作为教学代理(Pedagogical Agents)支持自主学习。
- Designing Answer-Aware LLM Hints to Scaffold Deeper Learning in K–12 Programming Education(Sahana Bhaskar, Sally Hamouda, 2025, Proceedings of the 2025 ACM Conference on International Computing Education Research V.2)
- EXPLANATION-BASED AUTOMATED ASSESSMENT OF OPEN ENDED LEARNER RESPONSES(V. Rus, 2018, eLearning and Software for Education)
- AI-Powered Narrative Generation for Personalized Learning in Primary Schools(Oualid Ali, Karrar Abbas Yousif, Gulsanam Tillayeva, Mustafa M. Abd Zaid, Bhaskaruni Harini, Priyanka Priyadarshini, Zafar Allayev, 2025, 2025 International Conference on AI-Driven STEM Education and Learning Technologies (AISTEMEDU))
- Semantic analysis of test responses using synthetic data generation(B. Polyakov, 2025, Modelling and Data Analysis)
- GPT-3-Driven Pedagogical Agents to Train Children’s Curious Question-Asking Skills(Rania Abdelghani, Yen-Hsiang Wang, Xingdi Yuan, Tong Wang, H'elene Sauz'eon, Pierre-Yves Oudeyer, 2022, International Journal of Artificial Intelligence in Education)
- Bringing Interactive Learning to Industrial IDEs: Kotlin Notebook and LLM-Generated Exercises(Daniil Karol, Ksenia Shneyveys, Roman Belov, Anastasiia Birillo, 2026, Proceedings of the 57th ACM Technical Symposium on Computer Science Education V.2)
- Integrating LLM Usage in Gamified Systems(Carlos J. Costa, 2025, WSEAS TRANSACTIONS ON MATHEMATICS)
- LLM-Driven Learner Modeling and Personalized Learning Pathways: A Closed-Loop Framework and Engineering Design for Virtual Laboratories(Ruijie Wang, Guangtao Xu, 2025, 2025 International Conference on Educational Technology Management (ICETM))
- Filling the Gap: LLMs as Scaffolds for Competency Question Instantiation(Clare McNamara, Lucy Hederman, Declan O'Sullivan, 2026, Proceedings of the 31st International Conference on Intelligent User Interfaces)
- Semi-automatic Construction of Bidirectional Dialogue Dataset for Dialogue-Based Reading Comprehension Tutoring System Using Generative AI(Sung-Kwon Choi, Jin-Xia Huang, Oh-Woog Kwon, 2024, Lecture Notes in Computer Science)
课程对齐、人机协作与教育伦理框架
探讨如何将AI生成内容与课程标准(Bloom分类法等)对齐,分析教师对AI工具的感知、人机协作模式及算法偏见、反作弊等伦理挑战。
- Scaling Item-to-Standard Alignment with Large Language Models: Accuracy, Limits, and Solutions(Farzan Karimi-Malekabadi, Pooya Razavi, Sonya J. Powers, 2025, arXiv.org)
- Scaling Up Mastery Learning with Generative AI: Exploring How Generative AI Can Assist in the Generation and Evaluation of Mastery Quiz Questions(Stephen Hutt, Grayson Hieb, 2024, Proceedings of the Eleventh ACM Conference on Learning @ Scale)
- Educational engineering in light of perceptual invariance theory: Semantic noise elimination and universal mathematical language construction(N. Demirkuş, 2026, World Journal of Advanced Engineering Technology and Sciences)
- Towards More Effective Automatic Question Generation: A Hybrid Approach for Extracting Informative Sentences(Engy Yehia, N. Hassan, Sayed AbdelGaber, 2025, International Journal of Advanced Computer Science and Applications)
- Enhancing AI-Driven Education: Integrating Cognitive Frameworks, Linguistic Feedback Analysis, and Ethical Considerations for Improved Content Generation(Antoun Yaacoub, Sansiri Tarnpradab, Phattara Khumprom, Z. Assaghir, Lionel Prevost, Jérôme Da Rugna, 2025, 2025 International Joint Conference on Neural Networks (IJCNN))
- PromptHive: Bringing Subject Matter Experts Back to the Forefront with Collaborative Prompt Engineering for Educational Content Creation(Mohi Reza, Ioannis Anastasopoulos, Shreya Bhandari, Zach A. Pardos, 2024, Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems)
- TutorCraftEase: Enhancing Pedagogical Question Creation with Large Language Models(Wenhui Kang, Lin Zhang, Xiaolan Peng, Hao Zhang, An Li, Mengyao Wang, Jin Huang, Feng Tian, Guozhong Dai, 2025, Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems)
- A survey study on pre-service teachers’ perceptions of AI generated texts(Hyekyung Jung, Yongsang Lee, Dongkwang Shin, 2022, The Korean Society of Bilingualism)
- LLM Cheat Prevention via Adversarial Question Paraphrasing(B. Balaji, M. D. Reddy, P. Pavankumar, A. Munna, Shaik Riyaz, R.Vijay Sai, 2026, 2026 International Conference on Intelligent and Innovative Technologies in Computing, Electrical and Electronics (IITCEE))
- Revolutionizing Assessment: Leveraging ChatGPT for Automated Item Generation: An AI Driven Exploratory Study with EFL Teachers(Ahmad A. Alsagoafi, Hanan S. Alomran, 2025, World Journal of English Language)
- The Impact of ChatGPT on Language Assessment in ELT(Gülden Tüm, 2026, Sınırsız Eğitim ve Araştırma Dergisi)
- Small Models, Big Support: A Local LLM Framework for Teacher-Centric Content Creation and Assessment using RAG and CAG(Zarreen Reza, Alexander Mazur, Michael T. Dugdale, Robin Ray-Chaudhuri, 2025, arXiv.org)
最终分组结果勾勒了大模型辅助自动命题的完整生态系统:从基础的提示工程与微调技术出发,通过检索增强(RAG)和多模态技术确保内容准确性与多样性;随后进入以心理测量学为核心的质量校验环节,确保试题具备科学的难度与区分度;在应用层,研究已深入特定学科定制并延伸至自动化评分与个性化支架生成;最后,通过课程对齐与人机协作框架,将技术落地于宏观教育治理与伦理监管之中。
总计136篇相关文献
No abstract available
Intuitive learning plays a vital role in building deep conceptual understanding, particularly in STEM education, where students often grapple with abstract and interdependent ideas. Automatic question generation has emerged as an effective strategy to support personalized and adaptive learning. However, its effectiveness is limited by hallucinations in large language models (LLMs), which can produce factually incorrect, ambiguous, or pedagogically inconsistent questions. To address this challenge, we propose a novel framework that combines causal-graph-guided Chain-of-Thought (CoT) reasoning with a multi-agent LLM architecture to ensure the generation of accurate, meaningful, and curriculum-aligned questions. In this approach, causal graphs offer an explicit representation of domain knowledge, while CoT reasoning enables structured, step-by-step traversal through related concepts. Dedicated LLM agents handle specific tasks such as graph pathfinding, reasoning, validation, and output, all operating under domain constraints. A dual validation mechanism-at both the conceptual and output stages-substantially reduces hallucinations. Experimental results show up to a 70% improvement in quality over reference methods and yielded highly favorable outcomes in subjective evaluations.
This study develops an evaluation instrument for Python programming using a Rule-Based Automatic Question Generation (AQG) system integrated with Large Language Models (LLMs), designed based on the Revised Bloom’s Taxonomy. The urgency of this research stems from the limitations of conventional programming evaluations, which are often time-consuming, less objective, and insufficiently aligned with cognitive learning levels. The proposed method applies assessment terms as rule-based constraints to guide LLM-generated questions, ensuring both pedagogical validity and structural consistency in JSON format. A total of 91 questions were produced, consisting of multiple-choice and coding items, which were then validated by three programming experts and tested on 32 vocational students. The findings indicate that the instrument achieved an overall validity of 77.66% (valid category), with the highest accuracy at the Apply (96.30%) and Create (100%) levels. The reliability test using Cronbach’s Alpha yielded 0.721, showing acceptable internal consistency. Item difficulty analysis revealed a strong dominance of easy questions (97.78%), with only 2.22% classified as moderate and none as difficult. Student performance also showed a fluctuating pattern: high in Remember (94.79%), Understand (95.83%), and Create (95.60%), but lower in Apply (86.11%), Analyze (90.97%), and Evaluate (87.15%). These results confirm that integrating Rule-Based AQG with LLMs can produce valid, reliable, and adaptive evaluation instruments that not only capture basic programming competencies but also partially address higher-order cognitive skills. This research contributes both practically by providing educators with an efficient tool for generating evaluation items and academically by enriching the growing body of literature on AI-assisted assessment in programming education.
High-quality question generation is crucial for ensuring the fairness and validity of examinations. To address the challenges of data scarcity and semantic complexity in automatic question generation (AQG) for niche subjects, including the arts, this study develops a domain-specific large language model (LLM) with a three-tiered optimization mechanism, incorporating prompt tuning, knowledge enhancement, and data augmentation. The model's effectiveness was validated through a case study conducted on a calligraphy course. The results showed that the generated questions achieved a usability rate of 91%, whereas the proposed data augmentation strategy expanded the question bank by 132.56%. This work provides both technical solutions and practical reference for automatic question generation methods targeting niche disciplines. The key contributions of this study encompass the creation of an innovative three-tiered optimization framework, the effective integration of external domain knowledge, and an iterative data augmentation approach that enhances question generation for niche subjects. This research offers a technological pathway and serves as a valuable reference for AQG in niche disciplines.
No abstract available
Large language models (LLMs) have significantly advanced smart education in the artificial general intelligence era. A promising application lies in the automatic generalization of instructional design for curriculum and learning activities, focusing on two key aspects: 1) customized generation: generating niche-targeted teaching content based on students' varying learning abilities and states and 2) intelligent optimization: iteratively optimizing content based on feedback from learning effectiveness or test scores. Currently, a single large LLM cannot effectively manage the entire process, posing a challenge for designing intelligent teaching plans. To address these issues, we developed EduPlanner, an LLM-based multiagent system comprising an evaluator agent, an optimizer agent, and a question analyst, working in adversarial collaboration to generate customized and intelligent instructional design for curriculum and learning activities. Taking mathematics lessons as our example, EduPlanner employs a novel Skill-Tree structure to accurately model the background mathematics knowledge of student groups, personalizing instructional design for curriculum and learning activities according to students' knowledge levels and learning abilities. In addition, we introduce the CIDDP, an LLM-based 5-D evaluation module encompassing Clarity, Integrity, Depth, Practicality, and Pertinence, to comprehensively assess mathematics lesson plan quality and bootstrap intelligent optimization. Experiments conducted on the GSM8K and Algebra datasets demonstrate that EduPlanner excels in evaluating and optimizing instructional design for curriculum and learning activities. Ablation studies further validate the significance and effectiveness of each component within the framework.
Assessment is an essential part of education, both for teachers who assess their students as well as learners who may evaluate themselves. Multiple-choice questions (MCQ) are one of the most popular types of knowledge assessment, e.g., in medical education, as they can be automatically graded and can cover a wide range of learning items. However, the creation of high-quality MCQ items is a time-consuming task. The recent advent of Large Language Models (LLM), such as Generative Pre-trained Transformer (GPT), caused a new momentum for automatic question generation solutions. Still, evaluating generated questions according to the best practices for MCQ item writing is needed to ensure docimological quality. In this article, we propose an analysis of the quality of LLM-generated MCQs. We employ zero-shot approaches in two domains, namely computer science and medicine. In the former, we make use of 3 GPT-based services to generate MCQs. In the latter, we developed a plugin for the Moodle learning management system that generates MCQs based on learning material. We compare the generated MCQs against common multiple-choice item writing guidelines. Among the major challenges, we determined that while LLMs are certainly useful in generating MCQs more efficiently, they sometimes create broad items with ambiguous keys or implausible distractors. Human oversight is also necessary to ensure instructional alignment between generated items and course contents. Finally, we propose solutions for AQG developers.
The paper introduces an intelligent system for educational enhancement that integrates two key modules: a Question and Answer (QnA) module and a novel Feedback Generation module. We create a robust automatic content retrieval and response generation framework using Retrieval Augmented Generation (RAG) and transformer-based models, specifically OpenAI GPT-3.5. The QnA module dynamically retrieves relevant content from documents through cosine similarity and produces answers aligned with educational material. Meanwhile, the Feedback Generation module is designed to handle subjective responses and evaluate students' answers by comparing them with LLM-generated responses via cosine similarity. This comparison yields a performance score for the student's response, supplemented by specific feedback that highlights strengths and areas for improvement. Our approach bridges the gap in current automated grading systems providing a scalable and adaptable solution for personalized learning in diverse educational contexts. This system is particularly beneficial for institutions managing extensive student cohorts, offering real-time, individualized feedback to enhance student engagement and learning outcomes. Results demonstrate the effectiveness of our system with the QnA module achieving high cosine similarity scores of 0.87 for theory questions and 0.81 for numerical when compared with a solution manual. The Feedback Generation module exhibited a strong correlation (r = 0.92) with professor-assigned marks validating its alignment with human evaluations, this empirical validation involved 150 student responses across diverse problem types in the Computer Architecture course. These results highlight the robustness and accuracy of our approach in real-world educational scenarios.
Reliable evaluation of large language models (LLMs) for educational use requires benchmarks that reflect exam constraints, instructor grading practices, and the operational consequences of thresholded decisions. This paper introduces ExamQ-Gen, an instructor-in-the-loop benchmark that couples two tasks: (i) an LLM answering university-style exam questions and (ii) decision-support grading aligned with an instructor reference. Automatic grading is used for triage and feedback; in practice, ExamQ-Gen supports instructor-led exam authoring and provides grading recommendations, while the instructor issues the final grade and pass/fail decision. ExamQ-Gen is constructed from the course content by using an LLM to generate exam-style questions directly from the lecture materials, producing a course-derived question set suitable for controlled experimentation. The benchmark then instantiates contrasting exam conditions, including instructor-authored (HUMAN) versus pipeline-generated (PIPELINE) artifacts, to evaluate robustness under distribution shifts that can occur when exam questions and answers are produced through different generation workflows. Using two LLM “students” (Llama3-8B-Instruct and Mistral-7B-Instruct) and an LLM-based grader, we compare automatic grading against an instructor reference on a 1–10 score scale and at the decision level induced by the operational pass policy (pass if score ≥ 9). Accordingly, our conclusions are conditioned on the two evaluated student models. Score-level agreement is strong under HUMAN conditions but degrades substantially under PIPELINE conditions, indicating condition-dependent stability. At the pass threshold, decision errors are highly asymmetric, with false fails dominating false passes, meaning that conservative grading may appear safe while producing credit denial. A severity-focused analysis isolates a high-stakes failure mode—denial of instructor-perfect answers—and shows that, in the most affected PIPELINE condition, the perfect-pass miss rate reaches 0.926 (50/54), consistent with systematic conservatism rather than borderline noise. Overall, the results highlight that aggregate score agreement and accuracy are insufficient for instructor-controlled exam deployment and motivate reporting practices that combine disaggregated score agreement, threshold-based error asymmetry with uncertainty, and severity-aware diagnostics under exam-relevant condition shifts.
Large Language Models (LLMs) have found several use cases in education, ranging from automatic question generation to essay evaluation. In this paper, we explore the potential of using LLMs to author Intelligent Tutoring Systems. A common pitfall of using LLMs as tutors is their straying from desired pedagogical strategies such as leaking the answer to the student, and in general, providing no guarantees on the validity or appropriateness of the tutor assistance. We argue that while LLMs with certain guardrails can take the place of subject experts, the overall pedagogical design still needs to be handcrafted for the best learning results. Based on this principle, we create a sample end-to-end tutoring system named MWPTutor, which uses LLMs to fill in the state space of a predefined finite state transducer. This approach retains the structure and the pedagogy of traditional tutoring systems that has been developed over the years by learning scientists but brings in additional flexibility of LLM-based approaches. Through a human evaluation study on two datasets with math word problems, we show that our hybrid approach achieves a better overall tutoring score than an instructed, but otherwise free-form, GPT-4. MWPTutor is completely modular and opens up the scope for the community to improve its performance by refining its individual modules or using different teaching strategies that it can follow.
No abstract available
— Informative Sentence Extraction (ISE) is one of the crucial components in Automatic Question Generation (AQG) and directly influences the quality and relevancy of the generated questions. Instructional texts often contain not only informative but also irrelevant sentences. This results in the creation of poor-quality or distorted questions when irrelevant, non-informative sentences have been used as input. Therefore, the basic problem discussed in this paper is how to provide a systematic method for filtering out such sentences and retaining those that are pedagogically valuable. The purpose of ISE is to filter out irrelevant, low-quality information and retain only the factually dense sentences, express key concepts and are contextually significant. This paper proposes a hybrid approach for extracting informative sentences that combines lexical, statistical, and semantic criteria to identify informative sentences suitable for generating educational questions. The proposed approach consists of two modules: the first module employs four techniques in order to evaluate the informativeness of sentences, which are the keyword-based scoring, Named Entity Recognition (NER), information gain (IG) and Sentence-BERT (SBERT). The second module utilizes multiple fusion strategies to integrate the results derived from the informative sentence extraction techniques. The preprocessed sentences extracted from educational materials were ranked and filtered based on their informativeness coverage. The evaluation results indicate that the hybrid approach can improve the extraction of informative sentences rather than using individual methods. Such a contribution is important for enhancing the performance of downstream tasks in AQG systems, such as distractor generation and question formulation.
No abstract available
This project introduces an innovative approach to advance the field of automatic question generation using natural language processing (NLP), with a specific focus on Bloom’s Taxonomy. With the increasing availability of resources and online learning platforms there is a need for efficient methods to create diverse and contextually relevant questions. The main goal of this project is to develop a system that can automatically generate questions using Natural Language Processing (NLP) techniques aligned with first three cognitive levels of Bloom’s Taxonomy: remembering, understanding, and applying. This project will make a contribution to the field of NLP by providing a framework for automatic question generation. The project follows stages; preprocessing the input text identifying concepts and information creating question rules and generating different versions of questions based on these rules. This project utilizes NLP techniques such as Named Entity Recognition (NER) Part of Speech tagging (POS), syntatic analysis and Discourse analysis. The overarching goal is to provide educators, content creators, and learners with an efficient and intelligent tool for generating questions that enhance comprehension and critical thinking. By automating this process, the project seeks to save time and effort while improving the overall learning and assessment experience.
The increasing workload of educators, particularly in manual question creation, poses a significant challenge in modern education. Manual question creation demands time, effort, and a deep understanding of the material to ensure contextual and curriculum-aligned questions. To address this, an Automatic Question Generation (AQG) system was developed using extractive summarization combined with the PEGASUS and TextRank methods. The system leverages Natural Language Processing (NLP) and transformer-based large language models (LLMs) to efficiently generate relevant questions. The primary data source for this system was digital social studies (IPS) books from the Indonesian Ministry of Education. The evaluation was conducted using ROUGE Score metrics and human assessments. ROUGE analysis yielded an average F1 score of 0.87 (ROUGE-1), 0.83 (ROUGE-2), and 0.84 (ROUGE-L), demonstrating the system's effectiveness in capturing essential information. Human evaluations involving educators and students highlighted the relevance and contextual accuracy of generated questions, particularly for structured materials. The system generated questions within 2 to 6 minutes, showcasing its efficiency in reducing educators' workload. However, challenges remain in handling materials with implicit semantic relationships or nonlinear narratives, as PEGASUS struggles to maintain contextual relevance. This limitation may lead to irrelevant questions and answers, indicating a need for improved semantic understanding. This study concludes that the PEGASUS+TextRank AQG system is a promising tool to streamline question generation. Future improvements in semantic algorithms and broader training data are crucial to enhancing the system's reliability and adaptability to diverse educational contexts.
Automated question generation in practical classroom usage saves teachers' time to develop various and individual questions, as well as the time they use in interactive learning. It also comes with immediate feedback and personalised results for students, hence improving their learning and comprehension. Also, it facilitates the implementation of differentiated learning by providing questions with levels of difficulty in order to address the needs of all learners. Automatic question generation and categorisation focus on the generation and classification of questions from text; it is used in developing education assistants, improving client support solutions, and creating other forms of learning and interaction aids. This technology is used in personalised tutoring agents and intelligent FAQ (Frequently Asked Question) databases to increase efficiency and effectiveness in knowledge management and acquisition. It can be time- and cost-effective for the organisations to implement and offer user-specific services. The rate of generated questions using the rule-based method was quite high, with 84.5 % accuracy. This development means the creation of solutions that are stronger and more suitable for different applications.
ABSTRACT The construction of questions is an essential component in educational assessment and student learning processes. However, manually constructing questions is a complex task that requires not only professional training, substantial experience, and extensive resources from teachers but is also time-consuming. This article introduces an Automatic Question Generation (AQG) technology based on a prompt pattern to alleviate this burden and address the ongoing need for new questions in education. The essence of this method lies in constructing a prompt pattern grounded on a collective knowledge base derived from teachers, thereby enhancing the quality of the questions produced. Practical applications and expert evaluations demonstrate that integrating a prompt pattern with a collective knowledge base into Large Language Models (LLMs) results in high-quality questions with statistically significant results. These questions not only meet educational standards but also approach the quality of manually constructed questions by teachers in certain aspects. Our research further emphasizes the feasibility of AI-teacher collaboration in education.
The current research aimed at creating a rule-based method for forming Wh- and Yes/No-type questions based on textual input. The study uses the rule-based method to automatically create Wh- and Yes/No-questions. The approach is based on the syntactic analysis of input sentences to identify the corresponding question forms to be used and apply certain rules for each type. For the proposed method, the achieved accuracy is 82. 20% in generating Wh- and Yes/No-type questions. The findings suggest that the rule-based approach produces appropriate questions corresponding to intervention aims, including checking for understanding and encouraging critical thinking. Such a method may be more effective than many other approaches that are currently used in practice in terms of ease, efficiency, and relevance to education environments. That provides a rich solution that can meet the needs of educators and students alike.
Indonesia is facing a significant shortage of teachers, particularly in remote areas, due to various contributing factors. This shortage exacerbates disparities in teaching quality and underscores the need for innovative solutions. This study proposes the use of Artificial Intelligence (AI), specifically Generative AI, to automate the creation of diverse test items. The proposed AI-powered tool focuses on generating questions aligned with Indonesia's Minimum Competency Assessment (MCA) in reading literacy and mathematics. By leveraging large language models, natural language processing techniques, and image generation for visual stimuli, the tool aims to support teachers in developing engaging and customized assessments tailored to students' needs. The outcome is expected to be an AI-based tool that not only reduces teacher workload but also improves the quality and effectiveness of student assessments in Indonesia.
No abstract available
Towards automatic question generation using pre-trained model in academic field for Bahasa Indonesia
No abstract available
Question generation in education is a time-consuming and cognitively demanding task, as it requires creating questions that are both contextually relevant and pedagogically sound. Current automated question generation methods often generate questions that are out of context. In this work, we explore advanced techniques for automated question generation in educational contexts, focusing on In-Context Learning (ICL), Retrieval-Augmented Generation (RAG), and a novel Hybrid Model that merges both methods. We implement GPT-4 for ICL using few-shot examples and BART with a retrieval module for RAG. The Hybrid Model combines RAG and ICL to address these issues and improve question quality. Evaluation is conducted using automated metrics, followed by human evaluation metrics. Our results show that both the ICL approach and the Hybrid Model consistently outperform other methods, including baseline models, by generating more contextually accurate and relevant questions.
No abstract available
This study investigates the application effectiveness of the Large Language Model (LLMs) ChatGLM in the automated generation of high school information technology exam questions. Through meticulously designed prompt engineering strategies, the model is guided to generate diverse questions, which are then comprehensively evaluated by domain experts. The evaluation dimensions include the Hitting(the degree of alignment with teaching content), Fitting (the degree of embodiment of core competencies), Clarity (the explicitness of question descriptions), and Willing to use (the teacher's willingness to use the question in teaching). The results indicate that ChatGLM outperforms human-generated questions in terms of clarity and teachers' willingness to use, although there is no significant difference in hit rate and fit. This finding suggests that ChatGLM has the potential to enhance the efficiency of question generation and alleviate the burden on teachers, providing a new perspective for the future development of educational assessment systems. Future research could explore further optimizations to the ChatGLM model to maintain high fit and hit rates while improving the clarity of questions and teachers' willingness to use them.
The development of Automatic Question Generation (QG) models has the potential to significantly improve educational practices by reducing the teacher workload associated with creating educational content. This paper introduces a novel approach to educational question generation that controls the topical focus of questions. The proposed Topic-Controlled Question Generation (T-CQG) method enhances the relevance and effectiveness of the generated content for educational purposes. Our approach uses fine-tuning on a pre-trained T5-small model, employing specially created datasets tailored to educational needs. The research further explores the impacts of pre-training strategies, quantisation, and data augmentation on the model’s performance. We specifically address the challenge of generating semantically aligned questions with paragraph-level contexts, thereby improving the topic specificity of the generated questions. In addition, we introduce and explore novel evaluation methods to assess the topical relatedness of the generated questions. Our results, validated through rigorous offline and human-backed evaluations, demonstrate that the proposed models effectively generate high-quality, topic-focused questions. These models have the potential to reduce teacher workload and support personalised tutoring systems by serving as bespoke question generators. With its relatively small number of parameters, the proposals not only advance the capabilities of question generation models for handling specific educational topics but also offer a scalable solution that reduces infrastructure costs. This scalability makes them feasible for widespread use in education without reliance on proprietary large language models like ChatGPT.
In this contemporary world full of information, online lecture videos are a big fountain of knowledge. Nevertheless, quizzes have to be developed based on these videos to make evaluation of knowledge acquisition much easier. The research describes a method for generating quizzes from online teaching videos that enhances self-learning through continuous assessment. Unlike existing approaches which are resource intensive and computationally demanding, we aim at providing a Video Question Generation model that is light weight and effective. We take advantage of state-of-the-art Natural Language Processing (NLP) technology to improve our model’s flexibility and allow it to be fine-tuned using T5 transformers. Our system also generates various forms of “Wh” questions such as who, when, where, what, which, why and how as well as Multiple Choice Questions (MCQs). Through this study we hope to give teachers and students alike a tool that can facilitate knowledge assessment and create an active learning environment.
In recent years, the global social landscape has become increasingly complex, requiring the ability to think from a wide range of diverse perspectives for effective problem-solving. In the field of education, panoramic learning, which implements interdisciplinary and comprehensive education, has become essential. Also, there has been recent research on various aspects of automatic question generation (AGQ), with some studies focusing on generating panoramic questions, which provide a comprehensive understanding, across different genres using knowledge graph (KG). KG is a knowledge base that uses a graph-structured data model and consists of entities and relationships between entities. On the other hand, research on generating panoramic questions for specific subjects with educational purposes has been limited, and this study aims to address that. In this work, we specifically targeted the field of history for question generation and used complemented entities to enhance the inclusion of panoramic knowledge in the field of history. The approach involves enhancing subgraphs with link prediction, which complements missing relationships in KGs, particularly in historical contexts requiring temporal and spatial insights. Through evaluation, it was validated that the proposed method could generate questions containing more panoramic knowledge compared to existing methods.
In recent years, the educational system has seen numerous improvements, including the addition of assessment criteria to evaluate educational outcomes. However, the manual creation of test questions often fails to accurately assess students’ competency levels and is time-consuming. This paper addresses the need for automated question generation (AQG) in the context of outcome-based education (OBE), a student-centric approach that has yet to incorporate AQG techniques. OBE, grounded in Bloom’s taxonomy, encompasses three domains: cognitive, psychomotor, and affective. Research focuses on the cognitive domain and its six levels of question generation. OBE algorithms for AQG are based on accuracy, time consumption, and question quality. The DistilBERT question-answering model and transformers for error correction are used in our AQG model. The model is trained on QGSTEC and assessed using performance measures. Comparatively, the model has higher accuracy, precision, and F1-score.
No abstract available
No abstract available
This study explores automatic generation (AIG) using language models to create multiple choice questions (MCQs) for morphological assessment, aiming to reduce the cost and inconsistency of manual test development. The study used a two-fold approach. First, we compared a fine-tuned medium model (Gemma, 2B) with a larger untuned one (GPT-3.5, 175B). Second, we evaluated seven structured prompting strategies, including zero-shot, few-shot, chain-of-thought, role-based, sequential, and combinations. Generated items were assessed using automated metrics and expert scoring across five dimensions. We also used GPT-4.1, trained on expert-rated samples, to simulate human scoring at scale. Results show that structured prompting, especially strategies combining chain-of-thought and sequential design, significantly improved Gemma's outputs. Gemma generally produced more construct-aligned and instructionally appropriate items than GPT-3.5's zero-shot responses, with prompt design playing a key role in mid-size model performance. This study demonstrates that structured prompting and efficient fine-tuning can enhance midsized models for AIG under limited data conditions. We highlight the value of combining automated metrics, expert judgment, and large-model simulation to ensure alignment with assessment goals. The proposed workflow offers a practical and scalable way to develop and validate language assessment items for K-12.
No abstract available
Automatic item generation may supply many items instantly and efficiently to assessment and learning environments. Yet, the evaluation of item quality persists to be a bottleneck for deploying generated items in learning and assessment settings. In this study, we investigated the utility of using large‐language models, specifically Llama 3‐8B, for evaluating automatically generated cloze items. The trained large‐language model was able to filter out majority of good and bad items accurately. Evaluating items automatically with instruction‐tuned LLMs may aid educators and test developers in understanding the quality of items created in an efficient and scalable manner. The item evaluation process with LLMs may also act as an intermediate step between item creation and field testing to reduce the cost and time associated with multiple rounds of revision.
The widespread usage of computer-based assessments and individualized learning platforms has resulted in an increased demand for the rapid production of high-quality items. Automated item generation (AIG), the process of using item models to generate new items with the help of computer technology, was proposed to reduce reliance on human subject experts at each step of the process. AIG has been used in test development for some time. Still, the use of machine learning algorithms has introduced the potential to improve the efficiency and effectiveness of the process greatly. The approach presented in this paper utilizes OpenAI's latest transformer-based language model, GPT-3, to generate reading passages. Existing reading passages were used in carefully engineered prompts to ensure the AI-generated text has similar content and structure to a fourth-grade reading passage. For each prompt, we generated multiple passages, the final passage was selected according to the Lexile score agreement with the original passage. In the final round, the selected passage went through a simple revision by a human editor to ensure the text was free of any grammatical and factual errors. All AI-generated passages, along with original passages were evaluated by human judges according to their coherence, appropriateness to fourth graders, and readability.
This study explores the use of retrieval-augmented generation (RAG) combined with one-shot prompting to automatically generate reading comprehension questions aligned with the Portuguese secondary-school literature curriculum. Focusing on inference-type questions based on Padre António Vieira’s Sermão de Santo António aos Peixes, the system generated 50 open-ended items evaluated by two experts in literary education. The results show strong curricular alignment (92%) and moderate usability (64%), indicating that the model can reproduce exam-style formulations anchored in authentic textual material. These findings suggest that RAG effectively constrains generation to curricular content while maintaining linguistic and pedagogical coherence. Future work will expand the evaluation to additional literary texts, question types, and expert raters, as well as compare alternative models, chunking strategies, and prompting configurations to enhance the generalization of results.
Automatic item generation (AIG) has the potential to greatly expand the number of items for educational assessments, while simultaneously allowing for a more construct-driven approach to item development. However, the traditional item modeling approach in AIG is limited in scope to content areas that are relatively easy to model (such as math problems), and depends on highly skilled content experts to create each model. In this paper we describe the interactive reading task, a transformer-based deep language modeling approach for creating reading comprehension assessments. This approach allows a fully automated process for the creation of source passages together with a wide range of comprehension questions about the passages. The format of the questions allows automatic scoring of responses with high fidelity (e.g., selected response questions). We present the results of a large-scale pilot of the interactive reading task, with hundreds of passages and thousands of questions. These passages were administered as part of the practice test of the Duolingo English Test. Human review of the materials and psychometric analyses of test taker results demonstrate the feasibility of this approach for automatic creation of complex educational assessments.
A college English test generation model was constructed based on a corpus in this study. By combining common linguistic datasets, the automatic item generation method was adopted for large-scale testing. The corpus-based approach was applied for English language instruction. Corpus construction, preprocessing, vocabulary analysis, and other relevant components were integrated for effective test item generation. A methodology using word lists with word ratios and other new metrics was derived from preference words and levels of difficulty to calculate sentence difficulty and its text complexity index. To address the challenges of previous systems, challenges in multiple-choice tests were addressed. The developed model uses corpus processing and machine learning algorithms to generate test questions at all levels of difficulty. The developed system solves problems of the current college English systems.
Multiple choice questions (MCQs) are frequently used in medical education for assessment. Automated generation of MCQs in board-exam format could potentially save significant effort for faculty and generate a wider set of practice materials for student use. The goal of this study was to explore the feasibility of using ChatGPT by OpenAI to generate USMLE/COMLEX-USA-style practice quiz items as study aids. Researchers gave second year medical students studying renal physiology access to a set of practice quizzes with ChatGPT generated questions. The exam items generated were evaluated by independent experts for quality and adherence to NBME/NBOME guidelines. Forty-nine percent of questions contained item writing flaws, and 22% contained factual or conceptual errors. However, 59/65 (91%) were categorized as a reasonable starting point for revision. These results demonstrate the feasibility of large language model (LLM)-generated practice questions in medical education, but only when supervised by a subject matter expert with training in exam item writing.
While the Question Generation (QG) task has been increasingly adopted in educational assessments, its evaluation remains limited by approaches that lack a clear connection to the educational values of test items. In this work, we introduce test item analysis, a method frequently used by educators to assess test question quality, into QG evaluation. Specifically, we construct pairs of candidate questions that differ in quality across dimensions such as topic coverage, item difficulty, item discrimination, and distractor efficiency. We then examine whether existing QG evaluation approaches can effectively distinguish these differences. Our findings reveal significant shortcomings in these approaches with respect to accurately assessing test item quality in relation to student performance. To address this gap, we propose a novel QG evaluation framework, QG-SMS, which leverages Large Language Model for Student Modeling and Simulation to perform test item analysis. As demonstrated in our extensive experiments and human evaluation study, the additional perspectives introduced by the simulated student profiles lead to a more effective and robust assessment of test items.
PURPOSE This study aimed to evaluate the usability of artificial intelligence (AI)-based question generation methods-Chat Generative Pre-trained Transformer (ChatGPT)-4o (a non-template-based large language model) and a template-based automatic item generation (AIG) method-in the context of radiology education. The primary objective was to compare the psychometric properties, perceived quality, and educational applicability of generated multiple-choice questions (MCQs) with those written by a faculty member. METHODS Fifth-year medical students who participated in the radiology clerkship at Eskişehir Osmangazi University were invited to take a voluntary 15-question examination covering musculoskeletal and rheumatologic imaging. The examination included five MCQs from each of three sources: a radiologist educator, ChatGPT-4o, and the template-based AIG method. Student responses were evaluated in terms of difficulty and discrimination indices. Following the examination, students rated each question using a Likert scale based on clarity, difficulty, plausibility of distractors, and alignment with learning goals. Correlations between students' examination performance and their theoretical/practical radiology grades were analyzed using Pearson's correlation method. RESULTS A total of 115 students participated. Faculty-written questions had the highest mean correct response rate (2.91 ± 1.34), followed by template-based AIG (2.32 ± 1.66) and ChatGPT-4o (2.3 ± 1.14) questions (P < 0.001). The mean difficulty index was 0.58 for faculty, and 0.46 for both template- based AIG and ChatGPT-4o. Discrimination indices were acceptable (≥0.2) or very good (≥0.4) for template-based AIG questions. In contrast, four of the ChatGPT-generated questions were acceptable, and three were very good. Student evaluations of questions and the overall examination were favorable, particularly regarding question clarity and content alignment. Examination scores showed a weak correlation with practical examination performance (P = 0.041), but not with theoretical grades (P = 0.652). CONCLUSION Both the ChatGPT-4o and template-based AIG methods produced MCQs with acceptable psychometric properties. While faculty-written questions were most effective overall, AI-generated questions- especially those from the template-based AIG method-showed strong potential for use in radiology education. However, the small number of items per method and the single-institution context limit the robustness and generalizability of the findings. These results should be regarded as exploratory, and further validation in larger, multicenter studies is required. CLINICAL SIGNIFICANCE AI-based question generation may potentially support educators by enhancing efficiency and consistency in assessment item creation. These methods may complement traditional approaches to help scale up high-quality MCQ development in medical education, particularly in resource-limited settings; however, they should be applied with caution and expert oversight until further evidence is available, especially given the preliminary nature of the current findings.
Aims: The aim of this study is to systematically evaluate the performances of large language model-based generative Artificial Intelligence (Gen-AI) tools, Gemini and Copilot, in the generation and assessment of multiple-choice questions (MCQs) for use in medical education.Methods: A total of 335 MCQs were generated from two virtual patient cases using standardized prompts. Gen-AI tools selected the 56 best-quality items based on criteria encompassing the intended distributions regarding acceptable level of performance (ALP), Miller's competency pyramid (Miller) and Bloom's revised taxonomy (Bloom) levels, as well as alignment with learning objectives (LOs). Expert medical educators and current Gen-AI tools assessed these items based on the identification of misleading/confusing distractor(s) for borderline candidates -minimally competent examinees- (to calculate ALP values) and the identification of key(s), as well as Miller and Bloom levels, LO alignment, stem appropriateness, and technical item flaws. "AI-extended consensus" served as intersubjective consensus model (the gold standard). Generation performance was quantified by alignment with this consensus, and assessment performance by the degree to which Gen-AIs shifted or preserved Expert assessments. Analyses included ICC for reliability, Po/Cohen’s/Fleiss’ Kappa for categorical agreement, and inferential tests (Exact McNemar and Wilcoxon signed-rank) for detecting systematic bias and directional shifts.Results: Gen-AIs demonstrated markedly different performance patterns in assigning cognitive levels. For Miller, Gemini generated MCQs exhibited superior consistency with the intersubjective consensus (ICC(2,k)=0.82), whereas for Bloom, Copilot-generated MCQs demonstrated this superiority (ICC(2,k)=0.97). Both tools performed well in LO alignment and key identification, but their approaches to stem structure diverged substantially. Experts perceived the MCQs to be easier than the Gen-AIs claimed, and the current Gen-AI versions found them even easier than both the generating versions and the Experts did. In terms of assessment behaviour, Gen-AIs showed a systematic stringency tendency in Miller classifications, statistically significantly shifting Expert consensus from 'knows' to 'knows how' (p
The paper examines methodological foundations for integrating generative artificial intelligence in education in Ukraine amid digital transformation. It clarifies the notions of generative AI and large language model and delineates their didactic affordances and limits. The absence of coherent institution-level risk management and unified policies for data handling, academic integrity, and responsible deployment is noted. Opportunities are mapped across four domains. In teaching, GenAI enables personalization of content and pace, rapid formative feedback, writing support, and generation of lesson plans, tasks, and rubrics. In assessment, it supports criterion-referenced rubrics, item generation, and faster feedback cycles that free time for dialogue. In administration, GenAI assists with routine automation and document flows, including drafting official templates and validating consistency of program materials. In addition, accessibility services (text-to-speech, speech recognition, image analysis, and content adaptation) expand participation for learners with diverse needs and multilingual backgrounds. Alongside benefits, the study highlights challenges: protection of personal data and privacy, algorithmic bias, model hallucinations, and the need for fact checking, risks to academic integrity, unequal access, and total cost of ownership. To address these, the article proposes a practical framework that combines clear institutional policies and procedures with transparent consent and logging, development of digital and information literacy for teachers and students including task formulation, verification of claims, and correct citation of AI interactions, a human in the loop didactic design emphasizing pedagogical appropriateness, gradual adoption, and balance with traditional methods, and evidence based monitoring using pilots, measurable outcomes, and peer review. The novelty lies in consolidating fragmented guidance into a context-sensitive roadmap connecting governance, pedagogy, and infrastructure. Practical significance includes adaptable templates for course and policy design, recommendations for professional development, and scenarios for responsible classroom use. Boundary conditions are outlined, including reliable connectivity, secure platforms that meet data protection requirements, sustained support for educators through mentoring and micro learning, and equity mechanisms that ensure meaningful access across regions.
No abstract available
This study aims to introduce AI text generation using HyperCLOVA, a Korean-based super-large language model, and to examine whether AI text generation is applicable to the educational field. In detail, an example of text generation using HyperCLOVA was presented. Then survey data were collected from university students of education to examine the face validity of AI-generated texts compared with human texts. We also investigated opinions on the feasibility of AI text generation in teaching and learning environments. The survey results show that there was no statistically significant difference in AI-generated texts compared to the original text. Also the response rate was not high in the item that additional corrections were needed to use the AI-generated texts in educational practice, whereas the response rate was high for the opinion that the AI text generation would help reduce the burden on Korean language teachers.
The transformative capabilities of large language models (LLMs) are reshaping educational assessment and question design in higher education. This study proposes a systematic framework for leveraging LLMs to enhance question-centric tasks: aligning exam questions with course objectives, improving clarity and difficulty, and generating new items guided by learning goals. The research spans four university courses—two theory-focused and two application-focused—covering diverse cognitive levels according to Bloom’s taxonomy. A balanced dataset ensures representation of question categories and structures. Three LLM-based agents—VectorRAG, VectorGraphRAG, and a fine-tuned LLM—are developed and evaluated against a meta-evaluator, supervised by human experts, to assess alignment accuracy and explanation quality. Robust analytical methods, including mixed-effects modeling, yield actionable insights for integrating generative AI into university assessment processes. Beyond exam-specific applications, this methodology provides a foundational approach for the broader adoption of AI in post-secondary education, emphasizing fairness, contextual relevance, and collaboration. The findings offer a comprehensive framework for aligning AI-generated content with learning objectives, detailing effective integration strategies, and addressing challenges such as bias and contextual limitations. Overall, this work underscores the potential of generative AI to enhance educational assessment while identifying pathways for responsible implementation.
ABSTRACT The authors investigated using a large language model (LLM) for writing test questions for a real estate licensing exam. In Study 1 items were generated by GPT-4 and rated by subject matter experts (SMEs). These items were on-topic,relevant, and generally appropriate. Item difficulty manipulation was ineffective. Cognitive level matching was harder as cognitive level increased. Study 2 compared human and LLM items using SME and content developer ratings. Human and LLM items were similar in blueprint alignment, relevance, factual errors, and key quality. LLM items had better stem quality and cognitive level matching. Human distractors had an edge in quality. In Study 3 investigated content overlap and breadth of coverage. Similar prompts frequently generated overlapping content. The range of content represented in large sets of generated items did not cover the breadth of the generating content areas. Results suggest LLMs are as good as SMEs at generating first-draft items.
Abstract Introduction The validation of multiple-choice question (MCQ)-based assessments typically requires administration to a test population, which is resource-intensive and practically demanding. Large language models (LLMs) are a promising tool to aid in many aspects of assessment development, including the challenge of determining the psychometric properties of test items. This study investigated whether LLMs could predict the difficulty and point biserial indices of MCQs, potentially alleviating the need for preliminary analysis in a test population. Methods Sixty MCQs developed by subject matter experts in anesthesiology were presented one hundred times each to five different LLMs (ChatGPT-4o, o1-preview, Claude 3.5 Sonnet, Grok-2, and Llama 3.2) and to clinical fellows. Response patterns were analyzed, and difficulty indices (proportion of correct responses) and point biserial indices (item-test score correlation) were calculated. Spearman correlation coefficients were used to compare difficulty and point biserial indices between the LLMs and fellows. Results Marked differences in response patterns were observed among LLMs: ChatGPT-4o, o1-preview, and Grok-2 showed variable responses across trials, while Claude 3.5 Sonnet and Llama 3.2 gave consistent responses. The LLMs outperformed fellows with mean scores of 58% to 85% compared to 57% for the fellows. Three LLMs showed a weak correlation with fellow difficulty indices (r = 0.28–0.29), while the two highest scoring models showed no correlation. No LLM predicted the point biserial indices. Discussion These findings suggest LLMs have limited utility in predicting MCQ performance metrics. Notably, higher-scoring models showed less correlation with human performance, suggesting that as models become more powerful, their ability to predict human performance may decrease. Understanding the consistency of an LLM’s response pattern is critical for both research methodology and practical applications in test development. Future work should focus on leveraging the language-processing capabilities of LLMs for overall assessment optimization (e.g., inter-item correlation) rather than predicting item characteristics.
ABSTRACT Intelligent education relies on the generation of multi-level, comprehensive, and diverse question banks to assess student learning effectiveness and teaching efficacy. However, the development of professional question banks often presents challenges such as reliance on expert knowledge and experience, limited transferability, high workload, and subjective biases. In Geographical Information Systems (GIS), personalized question settings could be impacted by diverse knowledge sources and varying student orientations. To address this issue, we propose a novel large language model (LLM) framework guided by GIS prior knowledge for generating professional GIS question banks. Specifically, we tackle three major challenges in intelligent GIS question bank generation: incomplete knowledge coverage, skewed difficulty distribution, and limited adaptability of question types. This framework is founded upon the autonomous understanding, planning, and reasoning capabilities of LLMs, augmented by an elaborate retrieval strategy. It comprises three key modules: subtask matching and partitioning, subtask importance evaluation and quantity allocation, as well as adaptive scenario question generation. Together, these components enable the generation of personalized GIS question banks for learning and teaching tasks. Extensive experiments demonstrate its effectiveness across various metrics. Furthermore, our method with specialized knowledge organization can serve as a valuable resource for advancing research and applications in GIS education.
In the field of education, the traditional way of writing examination questions by hand has low efficiency and uneven quality. An examination question generation system based on soft knowledge prompt and large language model is proposed to address this difficulty. The large language model is used to enhance the examination question generation process of review materials. The key knowledge points are extracted and the soft knowledge prompt is generated through the feature representation and knowledge mining module. The soft knowledge prompt calculates the similarity score of external domain knowledge, selects the most relevant knowledge segment, and guides the large language model to generate examination questions through the adaptive fusion mechanism, so as to realize the automatic generation from review materials to examination questions, and improve the generation efficiency and quality of educational resources.
Abstract The capabilities of large language models (LLMs) have advanced to the point where entire textbooks can be queried using retrieval-augmented generation (RAG), enabling AI to integrate external, up-to-date information into its responses. This study evaluates the ability of two OpenAI models, GPT-3.5 Turbo and GPT-4 Turbo, to create and answer exam questions based on an undergraduate textbook. 14 exams were created with four true-false, four multiple-choice, and two short-answer questions derived from an open-source Pacific Studies textbook. Model performance was evaluated with and without access to the source material using text-similarity metrics such as ROUGE-1, cosine similarity, and word embeddings. Fifty-six exam scores were analyzed, revealing that RAG-assisted models significantly outperformed those relying solely on pre-trained knowledge. GPT-4 Turbo also consistently outperformed GPT-3.5 Turbo in accuracy and coherence, especially in short-answer responses. These findings demonstrate the potential of LLMs in automating exam generation while maintaining assessment quality. However, they also underscore the need for policy frameworks that promote fairness, transparency, and accessibility. Given regulatory considerations outlined in the European Union AI Act and the NIST AI Risk Management Framework, institutions using AI in education must establish governance protocols, bias mitigation strategies, and human oversight measures. The results of this study contribute to ongoing discussions on responsibly integrating AI in education, advocating for institutional policies that support AI-assisted assessment while preserving academic integrity. The empirical results suggest not only performance benefits but also actionable governance mechanisms, such as verifiable retrieval pipelines and oversight protocols, that can guide institutional policies.
No abstract available
No abstract available
Multiple choice questions (MCQs) are a popular method for evaluating students' knowledge due to their efficiency in administration and grading. Crafting high-quality math MCQs is a labor-intensive process that requires educators to formulate precise stems and plausible distractors. Recent advances in large language models (LLMs) have sparked interest in automating MCQ creation, but challenges persist in ensuring mathematical accuracy and addressing student errors. This paper introduces a prototype tool designed to facilitate collaboration between LLMs and educators for streamlining the math MCQ generation process. We conduct a pilot study involving math educators to investigate how the tool can help them simplify the process of crafting high-quality math MCQs. We found that while LLMs can generate well-formulated question stems, their ability to generate distractors that capture common student errors and misconceptions is limited. Nevertheless, a human-AI collaboration has the potential to enhance the efficiency and effectiveness of MCQ generation.
No abstract available
Primary mathematics education faces systemic challenges in translating curriculum reforms into classroom practice, exacerbated by teachers’ cognitive overload and limited support for pedagogical innovation. This study develops an Intelligent Teaching Design Assistant grounded in socio-constructivist and cognitive load theories to address these challenges. Thirty-four primary mathematics teachers participated in a quasi-experimental study. The Intelligent Teaching Design Assistant integrates Large Language Models with multi-dimensional knowledge bases (curriculum standards, teaching strategies, student profiles) and a multi-agent architecture (process planner, student simulator). The Intelligent Teaching Design Assistant significantly outperformed generic Large Language Models, improving overall lesson plan quality. This work pioneers a replicable pathway for AI to empower teacher agency and advance 21st-century educational transformation.
BACKGROUND Integrating domain-specific knowledge into large language models (LLMs) remains a critical challenge in medical education. In dental specialties such as endodontics, effective learning requires access to both textual clinical evidence and visual procedural demonstrations. However, generic LLMs often produce content that lacks clinical accuracy, contextual grounding, or pedagogical clarity, thereby limiting their applicability in specialized training environments. OBJECTIVE To develop and evaluate a Retrieval-Augmented Generation (RAG)-enhanced LLMs framework that addresses the challenge of integrating domain-specific knowledge in AI-driven endodontic education. METHOD We present Endodontics-KB, a multimodal knowledge integration platform that combines evidence-based dental literature (e.g., textbooks, clinical guidelines) with visual instructional materials (e.g., procedural videos) through a hierarchical RAG architecture. The system's core component, the EndoQ chatbot, utilizes LLMs augmented with multimodal dental datasets to enable context-aware clinical reasoning. Benchmarking was conducted against three general-purpose LLMs: GPT-4, Qwen2.5, and DeepSeek R1, using a structured question bank comprising 11 expert-validated endodontic questions. Two domain experts performed a blinded evaluation across five performance dimensions: clinical accuracy, contextual relevance, completeness, decision-making professionalism, and communication fluency. RESULTS The framework integrated 2,200 multimodal knowledge units through dynamic semantic indexing. EndoQ demonstrated statistically significant improvements across all evaluation metrics compared to general purpose LLMs: accuracy (4.45 ± 0.96), clinical relevance (4.59 ± 0.8), completeness (4.27 ± 0.83), professionalism judgment (4.45 ± 1.06), and language fluency (4.86 ± 0.47), as measured on a 5-point Likert scale. CONCLUSION This proposed framework improves educational outcomes through precise and context-aware knowledge delivery. Furthermore, it represents a scalable and transferable model for AI-enhanced clinical training across medical specialties, significantly advancing competency-based pedagogy in dental education.
The research article undertakes an experimental analysis of utilizing conversational/generative AI tools for translating question papers from English to other Indian languages, as frequently seen in the question papers of many Indian universities/colleges and competitive recruitment examinations. This automation of question paper translation shall offload a portion of the workload of academic teachers who are into preparing question papers for various types of examinations. A desktop application of GUI type is developed leveraging artificial intelligence backed ChatGPT and Claude AI as a ready to use zero cost application.
With the in-depth advancement of China's national strategy for the development of a new generation of artificial intelligence, generative artificial intelligence, as a key technology, is profoundly reshaping the educational ecosystem. This study focuses on the emerging interdisciplinary field of Big Data Management and Application, exploring the innovative challenges faced by talent cultivation in the digital-intelligent era. The research aims to analyze the intrinsic mechanisms of generative artificial intelligence (taking "ERNIE Bot" as an example) in promoting learners' innovative thinking and innovative skills, and further construct a "generative AI-empowered, innovation-oriented project-based curriculum model". This model integrates the entire process of "pre-class preparation - teaching implementation - project conclusion", covering core links such as learner profile construction, intelligent scenario creation, personalized task distribution, dynamic feedback, and intelligent evaluation. Finally, the paper analyzes the potential challenges in implementing the model and proposes corresponding strategies centered on the "teacher-AI-student" tripartite collaboration, aiming to provide an operable and iterable digital path for cultivating innovative talents in the Big Data Management and Application major.
With the rapid development of artificial intelligence (AI) technology, major changes have taken place in the field of medical education in China. In recent years, in order to respond to the training requirements of “new medicine” for compound talents, the demand for systematic evaluation of critical thinking ability of medical students in China is increasing. Based on SOAP clinical reasoning framework and integrating existing critical thinking theory, this study established a medical critical thinking assessment gauge covering six dimensions of “interpretation-analysis-evaluation-inference-self-adjustment-clinical adaptation”, each dimension has five levels, presenting a path from information processing to clinical decision-making ability, and introducing evidence-based medicine tools (such as AGREE II), cognitive bias and other professional concepts enhance the professionalism and consistency of evaluation, which can be used as the core quantitative basis of the generative AI-driven critical thinking education system. Meanwhile, the gauge realizes the paradigm transformation from static evaluation to dynamic diagnosis and from general scoring to personalized intervention, providing a reliable path for the cultivation of medical high-order thinking ability.
This study first investigates whether a “writing evaluator persona” modeling a professor’s writing-assessment perspective can be developed using ChatGPT’s customization and prompting strategies and then examines its potential and limits. Based on the vackground and publications of Emeritus Professor J, we employed iterative input, summarization, and Q&A to design the “Professor J” GPT persona. Across three experiments, comparisons with Professor J’s actual ratings showed that underspecified score-allocation instructions can distort score distributions and may elicit hallucinations. Although the approach increases procedural transparency, full score-level agreement is constrained because authentic grading incorporates contextual factors. Overall, the study frames generative AI not as a value-neutral automated grading tool but as a hybrid tool for locally instantiating instructor-specific evaluative norms and supporting reflective calibration of assessment practices.
Since the introduction of generative artificial intelligence (GAI) technology in the context of large language models (LLMs), it has been widely used for information extraction and/or extrapolation from different sources. In computer science education, a potential application of such technology is for automatic code review, i. e. shifting the burden of debugging non-compilable code, detecting overlooked optimization concerns such as poor memory management in code that otherwise passes automated tests, and other advanced tasks from a human grader to LLMs. However, LLMs are currently not capable of evaluating code or mathematical expressions with 100% reliability, i. e. beyond token pattern recognition and subsequent probabilistic answer generation. With that in mind, in this paper, we explore the risk of incorrect LLM code evaluation, both descriptive and numerical, as well as begin research on its mitigation and propose further work directions.
No abstract available
No abstract available
No abstract available
Supporting learning and teaching at scale requires access to large and high‐quality content and datasets for analysis and innovation. With rapid advances in artificial intelligence (AI) and the growing demand for data, synthetic data has emerged as a potential solution for addressing these challenges. This editorial introduces the contributions of five accepted articles to the special section AI for Synthetic Data Generation in Education: Scaling Teaching and Learning. These articles explore key themes in leveraging AI‐generated synthetic data to support learning and teaching as well as enhance educational practices at scale. The editorial emphasizes that hybrid strategies that leverage AI alongside human judgment are essential for scaling support for learning and teaching through synthetic data generation.
Abstract Background Template-based automatic item generation (AIG) is more efficient than traditional item writing but it still heavily relies on expert effort in model development. While nontemplate-based AIG, leveraging artificial intelligence (AI), offers efficiency, it faces accuracy challenges. Medical education, a field that relies heavily on both formative and summative assessments with multiple choice questions, is in dire need of AI-based support for the efficient automatic generation of items. Objective We aimed to propose a hybrid AIG to demonstrate whether it is possible to generate item templates using AI in the field of medical education. Methods This is a mixed-methods methodological study with proof-of-concept elements. We propose the hybrid AIG method as a structured series of interactions between a human subject matter expert and AI, designed as a collaborative authoring effort. The method leverages AI to generate item models (templates) and cognitive models to combine the advantages of the two AIG approaches. To demonstrate how to create item models using hybrid AIG, we used 2 medical multiple-choice questions: one on respiratory infections in adults and another on acute allergic reactions in the pediatric population. Results The hybrid AIG method we propose consists of 7 steps. The first 5 steps are performed by an expert in a customized AI environment. These involve providing a parent item, identifying elements for manipulation, selecting options and assigning values to elements, and generating the cognitive model. After a final expert review (Step 6), the content in the template can be used for item generation through a traditional (non-AI) software (Step 7). We showed that AI is capable of generating item templates for AIG under the control of a human expert in only 10 minutes. Leveraging AI in template development made it less challenging. Conclusions The hybrid AIG method transcends the traditional template-based approach by marrying the “art” that comes from AI as a “black box” with the “science” of algorithmic generation under the oversight of expert as a “marriage registrar”. It does not only capitalize on the strengths of both approaches but also mitigates their weaknesses, offering a human-AI collaboration to increase efficiency in medical education.
Artificial intelligence (AI) is rapidly transforming education, presenting unprecedented opportunities for personalized learning and streamlined content creation. However, realizing the full potential of AI in educational settings necessitates careful consideration of the quality, cognitive depth, and ethical implications of AI-generated materials. This paper synthesizes insights from four related studies to propose a comprehensive framework for enhancing AI-driven educational tools. We integrate cognitive assessment frameworks (Bloom’s Taxonomy and SOLO Taxonomy), linguistic analysis of AI-generated feedback, and ethical design principles to guide the development of effective and responsible AI tools. We outline a structured three-phase approach encompassing cognitive alignment, linguistic feedback integration, and ethical safeguards. The practical application of this framework is demonstrated through its integration into OneClickQuiz, an AI-powered Moodle plugin for quiz generation. This work contributes a comprehensive and actionable guide for educators, researchers, and developers aiming to harness AI’s potential while upholding pedagogical and ethical standards in educational content generation.
This paper addresses the critical need for scalable and high-quality educational assessment tools within the Malaysian education system. It highlights the potential of Generative AI (GenAI) while acknowledging the significant challenges of ensuring factual accuracy and curriculum alignment, especially for low-resource languages like Bahasa Melayu. This research introduces and compares four incremental pipelines for generating Form 1 Mathematics multiple-choice questions (MCQs) in Bahasa Melayu using OpenAI's GPT-4o. The methods range from non-grounded prompting (structured and basic) to Retrieval-Augmented Generation (RAG) approaches (one using the LangChain framework, one implemented manually). The system is grounded in official curriculum documents, including teacher-prepared notes and the yearly teaching plan (RPT). A dual-pronged automated evaluation framework is employed to assess the generated questions. Curriculum alignment is measured using Semantic Textual Similarity (STS) against the RPT, while contextual validity is verified through a novel RAG-based Question-Answering (RAG-QA) method. The results demonstrate that RAG-based pipelines significantly outperform non-grounded prompting methods, producing questions with higher curriculum alignment and factual validity. The study further analyzes the trade-offs between the ease of implementation of framework-based RAG and the fine-grained control offered by a manual pipeline. This work presents a validated methodology for generating curriculum-specific educational content in a low-resource language, introduces a symbiotic RAG-QA evaluation technique, and provides actionable insights for the development and deployment of practical EdTech solutions in Malaysia and similar regions.
No abstract available
Personalized storytelling in elementary school increases participation and retention by tailoring stories to each student's individual interests and learning style. But today's schools aren't flexible enough to make lessons that are tailored to each student. This study's neural text generation model is based on an improved GPT-2 architecture. It uses learner profiles that include interest vector, reading level, and emotional tone. The model uses Byte Pair Encoding for input formatting and token-level conditioning, ensuring that the narrative it generates is relevant and coherent. When BLEU, METEOR, and human-rated engagement metrics are used to measure performance, the results are better than the baselines for general storytelling. Specifically, personalized outputs boosted participation 24% and understanding by 18% in experimental classroom environments. The results show that AI-powered personalized stories work well in preschool and kindergarten. This method enables adaptive learning systems to change based on each student's needs.
This paper explores the development and integration of a system combining Augmented Reality (AR), Virtual Reality (VR), and gamification within a museum setting to enhance the presentation and interaction with cultural heritage. The technological framework employs AR for dynamic artifact interaction and in situ navigation, while VR capabilities facilitate virtual tours, broadening access for individuals with disabilities or those from distant geographies or socioeconomically disadvantaged backgrounds. Gamification transforms educational content into interactive experiences, fostering deeper engagement and learning. Moreover, aligning with the mission of museum institutions for cultural heritage preservation, a module for digital conservation and reconstruction was developed resorting to photogrammetry-based approaches. This module aims to create a virtual catalog accessible to both experts and the general public. Artificial Intelligence (AI) tools automate tasks such as generating thematic quizzes for gamification and cataloging scanned artifacts. The system aims to improve the interpretative and educational potential of museum exhibits, modernizing visitor engagement while preserving the integrity of physical artifacts and spaces. Its continuous evolution aims to bridge traditional forms of cultural preservation and promotion with contemporary digital interaction techniques, leveraged from cost-effective publicly accessible edge technologies.
By constructing a knowledge supply chain model with both theoretical and practical value, this study proposes a novel approach to integrating multimodal data—such as text, financial reports, video cases, and business models—to generate teaching cases. The experiment employs a privatized Deepseek32b system, utilizing multimodal knowledge embedding technology, cognitive logic injection mechanisms, and systematic design of a teaching logic enhancer to significantly improve interdisciplinary knowledge integration and extraction efficiency. The experimental results show that generative artificial intelligence consistently produces an excess of teaching cases, with a significantly higher coverage of knowledge points compared to traditional NLP and manual methods. While generative AI exhibits stable logical coherence, its content logic is slightly inferior to that of high-quality human-generated works. This study verifies the effectiveness of the cross-modal knowledge extraction training method and provides valuable reference insights.
In recent years, large language models (LLMs) and generative AI have revolutionized natural language processing (NLP), offering unprecedented capabilities in education. This chapter explores the transformative potential of LLMs in automated question generation and answer assessment. It begins by examining the mechanisms behind LLMs, emphasizing their ability to comprehend and generate human-like text. The chapter then discusses methodologies for creating diverse, contextually relevant questions, enhancing learning through tailored, adaptive strategies. Key prompting techniques, such as zero-shot and chain-of-thought prompting, are evaluated for their effectiveness in generating high-quality questions, including open-ended and multiple-choice formats in various languages. Advanced NLP methods like fine-tuning and prompt-tuning are explored for their role in generating task-specific questions, despite associated costs. The chapter also covers the human evaluation of generated questions, highlighting quality variations across different methods and areas for improvement. Furthermore, it delves into automated answer assessment, demonstrating how LLMs can accurately evaluate responses, provide constructive feedback, and identify nuanced understanding or misconceptions. Examples illustrate both successful assessments and areas needing improvement. The discussion underscores the potential of LLMs to replace costly, time-consuming human assessments when appropriately guided, showcasing their advanced understanding and reasoning capabilities in streamlining educational processes.
Generative AI has the potential to scale a number of educational practices, previously limited by resources. One such instructional approach is mastery learning, a pedagogy emphasizing proficiency before progression that is highly resource (teacher time, materials) intensive. The rise of computer-based instruction offered partial solutions, tailoring student progression and automating some facets of the mastery learning process. This work in progress considers the application of large language models for content generation tailored to mastery learning. We present a paired framework for analyzing and evaluating the generated content relative to rubrics designed by the teacher. Recognizing the potential of large language models, we critically assess the potential of improving mastery-based instruction. We close our discussion by considering the applications and limitations of this approach.
Although automated item generation has gained a considerable amount of attention in a variety of fields, it is still a relatively new technology in ELT contexts. Therefore, the present article aims to provide an accessible introduction to this powerful resource for language teachers based on a review of the available research. Particularly, it will give a brief introduction to different types of automated item generation approaches, provide a summary of previous ELT studies on this technology, and introduce three different AI-powered tools, along with practical tips for ELT practitioners. We conclude by calling for more empirical research on automated item generation from the ELT community and encouraging language teachers to take an interest in this technology themselves.
The AI-Enhanced Learning Assistant Platform is a revolutionary system designed to enhance learning, with cutting-edge features like question-and-answer generation, answer evaluation, identification of weak areas, recursive testing, an integrated query forum, and expert chat support. This platform makes use of artificial intelligence (AI) technology to try to satisfy the many needs that students and teachers have. Using natural language processing and machine learning, the platform’s question and answer generating feature generates relevant questions on its own from the provided content. This encourages participation and in-depth subject understanding. The answer evaluation section provides quick feedback for improvement by utilizing AI algorithms to assess the accuracy and caliber of student responses. One of this platform’s key advantages is its capacity to identify students' areas of weakness. Through the analysis of performance patterns and root causes, the system can generate customized recommendations and learning materials to help overcome those constraints. The property of recurring testing facilitates continuous assessment and reinforcement of knowledge. Through repeated practice, the program gradually pushes students to increase their understanding of the material by creating adaptive exams. Through` the integrated query forum, students can collaborate and ask for assistance from others by asking questions and receiving answers from teachers and their peers. Furthermore, by enabling real-time communication between users and subject matter experts, the expert chat support tool fosters an engaging and motivating learning environment. To sum up, the AI-Enhanced Learning Assistant Platform offers a wide range of features designed to maximize learning. With AI technology, it helps students learn more effectively and retain what they have learned, promotes active learning, and provides the support they need for a good educational experience.
Natural and idiomatic expressions are essential for fluent, everyday communication, yet many second-language learners struggle to acquire and spontaneously use casual slang despite strong formal proficiency. To address this gap, we designed and evaluated an LLM-powered, task-based role-playing game in which a GPT-4o-based Game Master guides learners through an immersive, three-phase spoken narrative. After selecting five unfamiliar slang phrases to practice, participants engage in open-ended dialogue with non-player characters; the Game Master naturally incorporates the target phrases in rich semantic contexts (implicit input enhancement) while a dedicated Practice Box provides real-time explicit tracking and encouragement. Post-session, learners receive multi-level formative feedback analyzing the entire interaction. We evaluated the system in a between-subjects study with 14 international graduate students, randomly assigned to either the RPG condition or a control condition consisting of a traditional AI-led virtual classroom. Results from an immediate post-test show that the RPG group achieved greater gains in both comprehension of the target phrases and their accurate, contextual use in sentences. Quantitative analysis of in-activity word-usage frequency, combined with qualitative survey responses, further indicates that the game-based approach provided more practice opportunities and higher perceived engagement, resulting in a more natural learning experience. These findings highlight the potential of narrative-driven LLM interactions in vocabulary acquisition.
No abstract available
The application of social cognitive theory has expanded to the boundaries of human-computer interaction research. However, existing research has scarcely addressed mutual cognitive facilitation between humans and personalized educational large language model (LLM) agents. This study explored how educational LLM agents influence teachers’ curriculum design and content creation, based on a sample of 464 teachers from coastal regions of China, along with semi-structured interviews with 23 participants. Quantitative analysis of the survey data revealed that the involvement of educational LLM agents positively predicts teachers’ ability to create content in curriculum design. Additionally, teachers’ self-efficacy mediated this relationship, while both school support and self-efficacy together created a chain mediation effect. Qualitative findings from the interviews supported the quantitative results and further highlighted individual differences and contextual nuances in teachers’ use of educational LLM agents. In summary, the findings indicated that educational LLM agents positively impact teachers’ curriculum design and content creation, with school support and teachers’ self-efficacy acting as a chain mediator in this process.
In this work, a thorough mathematical framework for incorporating Large Language Models (LLMs) into gamified systems is presented with an emphasis on improving task dynamics increasing user engagement, and improving reward systems. Personalized feedback adaptive learning and dynamic content creation are all made possible by the integration of LLMs and are crucial for improving user engagement and system performance. A simulated environment is used to test the framework's adaptability and demonstrate its potential for real-world applications in a variety of industries including business healthcare and education. The findings demonstrate how LLMs can offer customized experiences that raise system effectiveness and user retention. This study also examines the difficulties this framework aims to solve highlighting its importance in maximizing involvement and encouraging sustained behavioral change in a range of sectors.
We explore the automatic generation of interactive, scenario-based lessons designed to train novice human tutors who teach middle school mathematics online. Employing prompt engineering through a Retrieval-Augmented Generation approach with GPT-4o, we developed a system capable of creating structured tutor training lessons. Our study generated lessons in English for three key topics: Encouraging Students'Independence, Encouraging Help-Seeking Behavior, and Turning on Cameras, using a task decomposition prompting strategy that breaks lesson generation into sub-tasks. The generated lessons were evaluated by two human evaluators, who provided both quantitative and qualitative evaluations using a comprehensive rubric informed by lesson design research. Results demonstrate that the task decomposition strategy led to higher-rated lessons compared to single-step generation. Human evaluators identified several strengths in the LLM-generated lessons, including well-structured content and time-saving potential, while also noting limitations such as generic feedback and a lack of clarity in some instructional sections. These findings underscore the potential of hybrid human-AI approaches for generating effective lessons in tutor training.
STEM education, particularly programming and coding, is of great importance in today's technological landscape. Turtle graphics, an effective tool for teaching programming concepts to children, is widely used in languages such as Python, known for its simplicity and readability. However, coding can be challenging for young learners, necessitating individualized support from teachers. Large language models (LLMs), which are already employed in debugging, present an opportunity to enhance educational support systems by providing personalized hints without revealing answers, thus preserving the educational value. This proposal aims to explore the use of LLMs to generate tailored hints and explanations for different age groups and skill levels, creating a dynamic and responsive learning environment. Additionally, the proposed system includes task creation that adapts to the student's previous performance and completed tasks, ensuring continuous and appropriately challenging learning experiences. The goal of our research is to design a support system that leverages LLM technology to improve children and young students' learning in Python Turtle graphics. This system promises personalized educational support and adaptive task generation, enhancing the overall learning experience for young programmers. Future studies are necessary to test this system with real users, evaluate its effectiveness, and refine its design based on practical feedback.
Involving subject matter experts in prompt engineering can guide LLM outputs toward more helpful, accurate, and tailored content that meets the diverse needs of different domains. However, iterating towards effective prompts can be challenging without adequate interface support for systematic experimentation within specific task contexts. In this work, we introduce PromptHive, a collaborative interface for prompt authoring designed to better connect domain knowledge with prompt engineering through features that encourage rapid iteration on prompt variations. We conducted an evaluation study with ten subject matter experts in math and validated our design through two collaborative prompt writing sessions and a learning gain study with 358 learners. Our results elucidate the prompt iteration process and validate the tool’s usability, enabling non-AI experts to craft prompts that generate content comparable to human-authored materials while reducing perceived cognitive load by half and shortening the authoring process from several months to just a few hours.
With the rapid development of artificial intelligence, Large Language Models (LLMs) as ChatGPT utilize demonstrated strong capabilities in natural language understating at generation, providing new possibilities for innovative teaching in higher education. The explores integration of LLMs into task-based English teaching to enhance students’ language competence through interactive, meaningful in contextualized learning activities. Has analyzing the theoretical foundation of Task-Based Language Teaching (TBLT) at the pedagogical affordances of LLMs, the system design incorporates a modular pipeline consisting of a prompt pre-processor, an LLM-based task response engine with adaptive feedback module, allowing for seamless integration into existing teaching platforms. Experimental deployment in two undergraduate English courses, with one group using the LLM-enhanced system group relying on conventional task-based instruction. Quantitative results experimental group outperformed the control group in Task Performance Score and Learner Engagement Index with statistical significance. Furthermore, qualitative feedback from learners and instructors indicates increased engagement, confidence in linguistic creativity. The results suggest that LLMs support learner autonomy at engagement, improve linguistic accuracy fluency, over-reliance on AI in the teacher's evolving role in offers suggestions for future integration of LLMs in higher education English pedagogy.
No abstract available
Large Language Models (LLMs) have revolutionized the way natural language tasks are handled, with big potential applications in the context of education. LLMs can save educators time and effort, for instance, in content creation and exam generation. Although promising, LLMs’ integration into educational products brings some risks that companies must mitigate.In the context of an industrial project, we investigate the effectiveness of LLMs to generate educational multiple-choice questions. The experiments include 16 commercial and open-source LLMs, rely on standard metrics to assess the accuracy (F1 and BLEU) and linguistic quality (perplexity and diversity) of the generated questions, and compare with five specialized models. The results suggest that recent LLMs can outperform the fine-tuned models for question generation, open-source LLMs are very competitive with the commercial ones, with Meta Llama models being the best performing, and DeepSeek as performing as recent GPT4 models.This promising empirical evidence encourages us to focus on advanced prompting strategies, for which we report relevant open challenges we aim to address in the short term.
This paper addresses the challenge of improving interaction quality in dialogue based learning by detecting and recommending effective pedagogical strategies in tutor student conversations. We introduce PedagoSense, a pedology grounded system that combines a two stage strategy classifier with large language model generation. The system first detects whether a pedagogical strategy is present using a binary classifier, then performs fine grained classification to identify the specific strategy. In parallel, it recommends an appropriate strategy from the dialogue context and uses an LLM to generate a response aligned with that strategy. We evaluate on human annotated tutor student dialogues, augmented with additional non pedagogical conversations for the binary task. Results show high performance for pedagogical strategy detection and consistent gains when using data augmentation, while analysis highlights where fine grained classes remain challenging. Overall, PedagoSense bridges pedagogical theory and practical LLM based response generation for more adaptive educational technologies.
One key challenge for instructors is creating high-quality educational content, such as programming practice questions for introductory programming courses. While Large Language Models (LLMs) show promise for this task, their output quality can be inconsistent, and it is often unclear how to systematically improve their performance. In this experience report, we present the development process for ContentGen, an open-source tool that generates programming questions within the context of data science instructional materials. We describe our process of designing the tool and iteratively improving the tool through prompt engineering. To evaluate our changes, we designed and open-sourced a dataset of 91 test cases based on our course materials and developed three metrics to assess the generated questions: Correctness, Contextual Fit, and Coherence. We compare three prompting strategies and find that providing detailed instructions and an automatically generated summary of recently covered instructional materials to the LLM substantially improves the quality of the generated questions across our metrics. A usability study with six data science instructors further suggests that our final prototype is perceived as usable and effective. Our work contributes a case study of evidence-based prompt engineering for an educational tool and offers a practical approach for instructors and tool designers to evaluate and enhance LLM-based content generation.
Recent studies [48, 72] have demonstrated that Large Language Models (LLMs), like ChatGPT [3, 46] and LLAMA [59], can assist with routine teaching tasks and have the potential to revolutionize traditional education. However, other studies [35] highlight that LLMs often contain inaccuracies and demonstrate limited effectiveness in educational contexts. To address this issue, we propose a unified Education LLM Framework that integrates LLM into classroom teaching practice to enrich high-quality dialogical content and teacher-student interactions. Unlike complex data-driven models that require vast amounts of data, our framework can quickly enhance educational engagement and teaching strategies by utilizing a few carefully selected teaching examples from master teachers with our prompting techniques. We focus on two typical classroom teaching scenarios that require AI-generated content: Dialogue Completion and Expertise Transfer Learning. The former scenario requires generating contextually appropriate dialogues, while the latter scenario requires migrating the instructional styles and organization to new teaching topics. We demonstrate the effectiveness of our data quality-centered approach in generating semantically clear and factually accurate content as organized instructions for teaching materials. We comprehensively evaluate these materials by utilizing Perplexity-based Statistical Evaluation, Human Evaluation with Questionnaires, BertScore, Rouge, and BLEU. Experiments on two self-collected datasets show that our method significantly improves various metrics in Dialogue Completion and Expertise Transfer Learning tasks, enhancing the overall utility of AI for educational purposes.
In-IDE learning became a popular approach, integrating programming education with professional development tools in a seamless environment. Kotlin Notebook extends this concept by enabling highly interactive lessons within an industrial IDE while leveraging its capabilities, such as code quality inspections or refactorings. Kotlin Notebook structures programming content into interactive sections, enhancing both engagement and comprehension. This talk explores the combination of in-IDE learning and Kotlin Notebook with the integration of LLMs to create a powerful tool for interactive learning within an industrial-grade IDE. We propose a method for automatically generating exercises, visual materials, and contextual explanations directly within Kotlin Notebook. This approach not only streamlines lesson creation but also allows students to stay within the IDE and interact with its professional features. Additionally, previous research has shown that integrating LLMs with IDE functionality can enhance the quality and control of LLM outputs through static analysis and validation. This combination represents a novel and scalable approach to improving programming education and interactive learning experiences.
This paper presents a novel multimodal quiz generation framework that integrates audio, visual, and textual data using a Retrieval-Augmented Generation (RAG) architecture. The system leverages LLaVA for vision-language understanding and LLaMA 3.1 for text generation to produce contextually relevant and pedagogically meaningful multiple-choice questions (MCQs) from lecture videos. This approach addresses key limitations of traditional text-only quiz generation models by capturing richer, multimodal information. The system was tested on a real-world use case, generating 15 MCQs from the first lecture in an introductory computer science course. To evaluate the effectiveness of the generated quizzes, we designed a two-stage evaluation framework. In the first stage, we assessed retrieval and generation performance using standard metrics such as Hit Rate, Mean Reciprocal Rank (MRR), Correctness, Relevance, and Faithfulness. In the second stage, we examined how closely AI evaluations align with human expert judgments. We involved four human raters and three LLM-as-Judge models—Claude 3 Sonnet, GPT-4, and LLaMA 3.1—to evaluate each question. To analyze agreement, we used Percentage Agreement, Cohen's Kappa, Spearman's Rho, and Krippendorff's Alpha, capturing both exact matches and ordinal consistency. Our results show high retrieval accuracy and reasonable alignment between LLM based and human assessments, particularly in factual and procedural questions. However, discrepancies emerged in questions requiring deeper reasoning or visual interpretation, where human raters exhibited stronger consistency. These findings highlight the strengths of LLMs in scalable content generation, while reinforcing the need for human oversight in evaluating complex educational tasks. This work takes a significant step toward more human-aligned and effective AI-driven assessment systems.
This article describes new results of an application using transformer-based language models to automated item generation (AIG), an area of ongoing interest in the domain of certification testing as well as in educational measurement and psychological testing. OpenAI's gpt2 pre-trained 345M parameter language model was retrained using the public domain text mining set of PubMed articles and subsequently used to generate item stems (case vignettes) as well as distractor proposals for multiple-choice items. This case study shows promise and produces draft text that can be used by human item writers as input for authoring. Future experiments with more recent transformer models (such as Grover, TransformerXL) using existing item pools are expected to improve results further and to facilitate the development of assessment materials.
ChatGPT is gaining widespread acceptance in many disciplines since its launch at the end of 2022. The impact of ChatGPT on education is evident, but there is a dearth of knowledge on how English as a Foreign Language (EFL) teachers benefit from this technology. Therefore, this study investigates the use of ChatGPT to generate exam questions among EFL educators in Saudi Arabia. Through a mixed-methods approach that included an online questionnaire and an experimental design, the study attempted to gain insights from educators on using artificial intelligence (AI) technology for assessment. An online questionnaire was shared with 200 public school EFL teachers at various grade levels in the Eastern Province of Saudi Arabia. The findings revealed a varied landscape of perspectives, with some educators approving ChatGPT’s efficiency in generating exam questions, whereas others expressed concerns about its limited application. A further examination of the instructor-designed and ChatGPT-generated test items revealed that ChatGPT has the potential to stimulate critical thinking and expand assessment formats. The results indicate that educators require professional development to leverage AI technology responsibly. Furthermore, this study highlights the importance of navigating the emerging ChatGPT in EFL classrooms to ensure reliability and consistency of the evaluation process.
The integration of Artificial Intelligence (AI) technologies has initiated a new era in language assessment practices, revolutionizing the field with its innovative approaches. This study introduces an advanced Automated Item Generation (AIG) system that utilizes word families as a foundation to automatically generate test items. The primary objective of this research is to investigate the effectiveness of the AIG system in producing high-quality questions through a comprehensive evaluation that combines both quantitative and qualitative measures. The AIG system is developed using cutting-edge machine learning and deep learning techniques, enabling it to enhance and facilitate the language assessment process by generating a substantial number of items. To assess the quality of the generated questions, a group of 30 experienced English teachers participated in the evaluation process. The participants assessed the quality of multiple-choice and fill-in-the-blank questions generated by the AIG system using a 4-point scale. To supplement the quantitative analysis, interviews were conducted to capture the perspectives of the teachers concerning the integration of AIG in language assessment. The findings demonstrate highly promising outcomes in terms of question quality, validating the efficacy of employing word families as a linguistic basis for generating test items. By shedding light on the advantages and effectiveness of utilizing word families as a fundamental lexical unit for AIG, this study contributes to the field of automated item generation in language assessment.
Given the increasing interest in automated item generation in the second language assessment field, this study investigated the potential of two automated item generators for L2 reading assessment. The first generator, KR-Item-Generator, was developed by the authors, who used a free chatbot builder. The second, Q-Craft, was developed using GPT-4 API and employs an all-in-one method to generate questions and passages. A total of 83 pre-service teachers at a college of education in South Korea were asked to generate English reading passages and test items using both generators. They were then given a post-task survey on varying aspects of the two generators. The results of the study demonstrated that both generators were positively perceived regarding the naturalness of the sentences in the passages and the level of completion of the test items, although Q-Craft was rated significantly more positively in terms of the latter. Given these findings, we discuss the pedagogical implications and offer key directives for further L2 AIG research.
The use of generative AI, specifically large language models (LLMs), in test development presents an innovative approach to efficiently creating technical, knowledge‐based assessment items. This study evaluates the efficacy of AI‐generated items compared to human‐authored counterparts within the context of employee selection testing, focusing on data science knowledge areas. Through a paired comparison approach, subject matter experts (SMEs) were asked to evaluate items produced by both LLMs and human item writers. Findings revealed a significant preference for LLM‐generated items, particularly in specific knowledge domains such as Statistical Foundations and Scientific Data Analysis. However, despite the promise of generative AI in accelerating item development, human review remains critical. Issues such as multiple correct answers or ineffective distractors in AI‐generated items necessitate thorough SME review and revision to ensure quality and validity. The study highlights the potential of integrating AI with human expertise to enhance the efficiency of item generation while maintaining psychometric standards in high‐stakes environments. The implications for psychometric practice and the necessity of domain‐specific validation are discussed, offering a framework for future research and application of AI in test development.
Open ended assessment items require students to freely articulate their thinking as opposed to, for instance, multiple choice questions. Such free generation of answers by students enables what we may call true assessment because these answers offer a direct view of learners’ mental models. Nevertheless, assessing open ended learner responses is extremely challenging, e.g., if done manually by experts it becomes prohibitively expensive to scale up to millions of learners. To address this scalability challenge, automated methods to assess students' free responses are being explored. To this end, we present a novel solution to automatically assess open ended learner responses based on recent advances in computational linguistics and optimization algorithms. Our proposed solution accounts for linguistic phenomena such as anaphora resolution and negation in order to reach a deeper level of semantic interpretation of student answers. This is a key advantage compared to previous methods that focus primarily on distributional semantic representations of texts. Furthermore, our method provides both a holistic score as well as a detailed explanation of the score by performing a concept-level analysis of student responses. We present results obtained with the proposed method on a dataset that is widely used to evaluate automated methods for assessing open ended learner responses. The results indicate that our method is extremely competitive or surpasses the performance of previously proposed methods. Furthermore, by being able to pick on concepts students have yet to articulate, it enables the development of more personalized and dynamic generation of feedback in intelligent tutoring systems.
The educational assessment is an essential task within the educational process. The generation of right and correct assessment content is a determinant process within the assessment. The creation of an automated method of generation similar to a human experienced operator (teacher) deals with a complex series of issues. This paper presents a compiled set of methods and tools used to generate educational assessment content in the form of assessment tests. The methods include the usage of various structures (e.g., trees, chromosomes and genes, and genetic operators) and algorithms (graph-based, evolutionary, and genetic) in the automated generation of educational assessment tests. This main purpose of the research is developed in the context of the existence of several requirements (e.g., degree of difficulty, item topic), which gives a higher degree of complexity to the issue. The paper presents a short literature review related to the issue. Next, the description of the models generated in the authors’ previous research is presented. In the final part of the paper, the results related to the implementations of the models are presented, as well as results and performance. Several conclusions were drawn based on this compilation, the most important of them being that tree and genetic-based approaches to the issue have promising results related to performance and assessment content generation.
High-quality test items are essential for educational assessments, particularly within Item Response Theory (IRT). Traditional validation methods rely on resource-intensive pilot testing to estimate item difficulty and discrimination. More recently, Item-Writing Flaw (IWF) rubrics emerged as a domain-general approach for evaluating test items based on textual features. This method offers a scalable, pre-deployment evaluation without requiring student data, but its predictive validity concerning empirical IRT parameters is underexplored. To address this gap, we conducted a study involving 7,126 multiple-choice questions across various STEM subjects (physical science, mathematics, and life/earth sciences). Using an automated approach, we annotated each question with a 19-criteria IWF rubric and studied relationships to data-driven IRT parameters. Our analysis revealed statistically significant links between the number of IWFs and IRT difficulty and discrimination parameters, particularly in life/earth and physical science domains. We further observed how specific IWF criteria can impact item quality more and less severely (e.g., negative wording vs. implausible distractors) and how they might make a question more or less challenging. Overall, our findings establish automated IWF analysis as a valuable supplement to traditional validation, providing an efficient method for initial item screening, particularly for flagging low-difficulty MCQs. Our findings show the need for further research on domain-general evaluation rubrics and algorithms that understand domain-specific content for robust item validation.
Practice tests for high-stakes assessment are intended to build test familiarity, and reduce construct-irrelevant variance which can interfere with valid score interpretation. Generative AI-driven, automated item generation (AIG) scales the creation of large item banks and multiple practice tests, enabling repeated practice opportunities. We conducted a large-scale observational study (N = 25,969) using the Duolingo English Test (DET) -- a digital, high-stakes, computer-adaptive English language proficiency test to examine how increased access to repeated test practice relates to official DETscores, test-taker affect (e.g., confidence), and score-sharing for university admissions. To our knowledge, this is the first large-scale study exploring the use of AIG-enabled practice tests in high-stakes language assessment. Results showed that taking 1-3 practice tests was associated with better performance (scores), positive affect (e.g., confidence) toward the official DET, and increased likelihood of sharing scores for university admissions for those who also expressed positive affect. Taking more than 3 practice tests was related to lower performance, potentially reflecting washback -- i.e., using the practice test for purposes other than test familiarity, such as language learning or developing test-taking strategies. Findings can inform best practices regarding AI-supported test readiness. Study findings also raise new questions about test-taker preparation behaviors and relationships to test-taker performance, affect, and behaviorial outcomes.
Accurate estimation of item (question or task) difficulty is critical for educational assessment but suffers from the cold start problem. While Large Language Models demonstrate superhuman problem-solving capabilities, it remains an open question whether they can perceive the cognitive struggles of human learners. In this work, we present a large-scale empirical analysis of Human-AI Difficulty Alignment for over 20 models across diverse domains such as medical knowledge and mathematical reasoning. Our findings reveal a systematic misalignment where scaling up model size is not reliably helpful; instead of aligning with humans, models converge toward a shared machine consensus. We observe that high performance often impedes accurate difficulty estimation, as models struggle to simulate the capability limitations of students even when being explicitly prompted to adopt specific proficiency levels. Furthermore, we identify a critical lack of introspection, as models fail to predict their own limitations. These results suggest that general problem-solving capability does not imply an understanding of human cognitive struggles, highlighting the challenge of using current models for automated difficulty prediction.
VocQGen is an automated tool designed to generate multiple-choice cloze (MCC) questions for vocabulary assessment in second language learning contexts. It leverages several natural language processing (NLP) tools and OpenAI’s GPT-4 model to produce MCC items quickly from user-specified word lists. To evaluate its effectiveness, we used the first sublist in the Academic Word List (AWL) to generate 60 questions with VocQGen. Then we compared the quality of 60 autogenerated questions with 40 manually created ones through expert reviews and through pilot testing with 68 students. Expert review results indicate that automatically generated questions exhibit higher grammatical accuracy and clearer contexts in question stems. However, the tool occasionally produces distractors that are acceptable as correct responses. Pilot testing results show that in general the number of correct responses is higher in autogenerated questions, indicating the less challenging nature of these questions. The study concludes that manual check is still required for questions generated by VocQGen and future work should focus on improving distractor effectiveness.
This descriptive study scrutinizes the impact of ChatGPT on English Language Teaching (ELT) assessment and examines the extent to which it presents both opportunities and threats. A systematic review including 963 academic publications published between December 2023 and December 2024 was carried out so as to see the opportunities and threats. Out of 963 publications, the most relevant 150 articles were filtered to address the use of ChatGPT in online language assessment in ELT. Document analysis and thematic coding were utilized and 12 recurring themes were identified, including six classified opportunities and six threats. The findings reveal that six opportunities identified by ChatGPT were of automated grading, personalized feedback, practice partner simulation in speaking and writing, assessment item generation, engagement and motivation and multimodal & inclusive assessment, and six threats were academic dishonesty, validity & reliability concerns, algorithmic bias, overdependence & de-skilling, data privacy & institutional gaps, and adaptability. This study determines that ChatGPT's utilization in ELT assessment is dual-promising and problematic. The findings suggest this duality could be solved by pedagogical guidelines, interdisciplinary collaboration, curriculum calibration and ethical frameworks to harness its potential while safeguarding educational integrity.
We present an empirical study evaluating the quality of multiple-choice questions (MCQs) generated by Large Language Models (LLMs) from a corpus of video transcripts of course lectures in an online data science degree program. With our database of thousands of generated questions, we conducted both human and automated judging of question quality on a representative sample using a broad set of criteria, including well-established Item Writing Flaw (IWF) categories. We found the number of average IWFs per MCQ ranged from 1.6 (rule-based verification) to 2.18 (LLM-based). Among the most frequently identified MCQ flaws were lack of enough context (17%) or answer choices with at least one implausible distractor (57%). Both human and automated assessment identified implausible distractors as one of the most frequent flaw categories. Results from our human annotation study were generally more positive (51--65% good items) compared to our automated assessment study results, which tended toward greater flaw identification (15--25% good items), depending on evaluation method.
Automatically assessing question quality is crucial for educators as it saves time, ensures consistency, and provides immediate feedback for refining teaching materials. We propose a novel methodology called STRIVE (Structured Thinking and Refinement with multiLLMs for Improving Verified Question Estimation) using a series of Large Language Models (LLMs) for automatic question evaluation. This approach aims to improve the accuracy and depth of question quality assessment, ultimately supporting diverse learners and enhancing educational practices. The method estimates question quality in an automated manner by generating multiple evaluations based on the strengths and weaknesses of the provided question and then choosing the best solution generated by the LLM. Then the process is improved by iterative review and response with another LLM until the evaluation metric values converge. This sophisticated method of evaluating question quality improves the estimation of question quality by automating the task of question quality evaluation. Correlation scores show that using this proposed method helps to improve correlation with human judgments compared to the baseline method. Error analysis shows that metrics like relevance and appropriateness improve significantly relative to human judgments by using STRIVE.
Personalized education systems increasingly rely on structured knowledge representations to support adaptive learning and question generation. However, existing approaches face two fundamental limitations. First, constructing and maintaining knowledge graphs for educational content largely depends on manual curation, resulting in high cost and poor scalability. Second, most personalized education systems lack effective support for state-aware and systematic reasoning over learners'knowledge, and therefore rely on static question banks with limited adaptability. To address these challenges, this paper proposes a Generative GraphRAG framework for automated knowledge modeling and personalized exercise generation. It consists of two core modules. The first module, Automated Hierarchical Knowledge Graph Constructor (Auto-HKG), leverages LLMs to automatically construct hierarchical knowledge graphs that capture structured concepts and their semantic relations from educational resources. The second module, Cognitive GraphRAG (CG-RAG), performs graph-based reasoning over a learner mastery graph and combines it with retrieval-augmented generation to produce personalized exercises that adapt to individual learning states. The proposed framework has been deployed in real-world educational scenarios, where it receives favorable user feedback, suggesting its potential to support practical personalized education systems.
As Large Language Model (LLM) chatbots have become increasingly accessible and their misuse for academic dishonesty has raised growing concern. Current methods that attempt to detect LLM-generated text are unreliable and risk producing false positives, which can unfairly harm genuine students. This paper offers an alternative by developing an “inoculation” process that generated paraphrased questions to find semantically similar ones that LLMs answer incorrectly. Using Llama 3.2 3B, create and evaluate paraphrases for MMLU questions, then test GPT-4o mini on them to identify effective inoculated questions. The approach successfully finds inoculations for 26.7% of correctly answered questions, requiring review of on more than 20 paraphrases per question. Detects weakness in instructor LLM responses with low cost.
Focusing on virtual experiment teaching, this paper proposes a personalized learning closed-loop with LLM as the core. A simulation engine provides a verifiable factual baseline, while the LLM undertakes semantic interpretation, two-phase path way generation (skeleton-verification-refinement), fact-grounded judgement and feedback, and explanatory summarization. To enhance robustness and compliance, the framework employs retrieval-augmented generation (RAG), structured outputs, and a second-pass verifier as guardrails. At the learner-modeling layer, we fuse LLM semantic increments with BKT/IRT steady estimates to obtain a fine grained yet stable representation that drives adaptive replanning. The engineering design covers windowed reporting and fact checks, an orchestration service with template interfaces, result caching and tiered inference (small model first), minimal-necessary data collection with anonymization, and classroom-orien ted batching and rate limiting. Although large scale evaluation re mains for future work, the framework connects the key chain “interpretation—modeling—path—judgement—explanation,” demonstrating interpretability, controllability, and deployment feasibility.
Activities that engage learners to articulate their answers often make them reflect. However, evaluating such activities and providing feedback is time-consuming for teachers. For text analysis, various data-driven indicators, such as cohesion and coherence, evaluate linguistic measures and the semantic understanding of artefacts created. However, for drawing-based activities, defining such indicators is still underexplored. In this research, we conducted a draw-and-write activity that engaged students to express their understanding of a concept through writing and drawing. The question was, “What is data science?”. The human raters analyzed the artefacts generated (n=40), and then a learning analytics approach was taken to define data-driven indicators. The study proposes a data processing pipeline involving a large language model (LLM) and defines indicators to understand the coherence of written text and drawn diagrams. Further, a clustering analysis of the collected artefacts highlighted differences in the participants' expressions of data science (task context). The discussion compares automated and human classification and its implications for assessment and feedback. Future work aims to integrate the pipeline in an online learning environment that affords drawing and text input from the learners.
The IEEE P2807.6 Education Knowledge Graph (EduKG) standard defines a semantic infrastructure to represent educational knowledge, resources, and pedagogy in a unified graph format. This paper expands on the core EduKG architecture, detailing its ontology design and key entities-Learning Points, Resource Items, and Pedagogical Rules-that collectively model the domain, content, and instructional strategies of learning systems. We further explore how EduKG can be integrated with advanced AI technologies, including large language models (LLMs) and retrieval-augmented generation (Graph-RAG) via embedding databases, to enable intelligent behavior such as semantic search, question answering, and dynamic content generation. These integrations position EduKG as a central component in next-generation smart education systems, wherein knowledge graphs work in concert with intelligent agents and adaptive instructional systems to deliver fully automated, personalized, and interactive learning experiences. By leveraging the standardized graph-structured representation and semantic reasoning capabilities of EduKG, such systems can achieve interoperability across platforms and support complex AI-driven tutoring and training scenarios. This work provides a comprehensive overview of the EduKG framework and highlights its role in empowering adaptive, cognitive, and collaborative learning solutions for the future of digital education.
Recent advances in AI have catalyzed the adoption of intelligent educational tools, yet many semantic retrieval systems remain ill-suited to the unique linguistic and structural characteristics of academic content. This study presents two open-source embedding models fine-tuned for educational question answering, particularly in the context of course syllabi. A synthetic dataset of 3,197 sentence pairs, spanning synonymous terminology, paraphrased questions, and implicit-explicit mappings, was constructed through a combination of manual curation and large language model (LLM)-assisted generation. Two training strategies were evaluated: (1) a baseline model fine-tuned using MultipleNegativesRankingLoss (MNRL), and (2) a dual-loss model that combines MNRL with CosineSimilarityLoss to improve both semantic ranking and similarity calibration. Evaluations were conducted on 28 university course syllabi using a fixed set of natural language questions categorized into course, faculty, and teaching assistant information. Results demonstrate that both fine-tuned models outperform strong open-source baselines, including all-MiniLM-L6-v2 and multi-qa-MiniLM-L6-cos-v1, and that the dual-loss model narrows the performance gap with high-performing proprietary embeddings such as OpenAI's text-embedding-3 series. This work contributes reusable, domain-aligned embedding models and provides a replicable framework for educational semantic retrieval, supporting downstream applications such as academic chatbots, retrieval-augmented generation (RAG) systems, and learning management system (LMS) integrations.
Pedagogical questions are crucial for fostering student engagement and learning. In daily teaching, teachers pose hundreds of questions to assess understanding, enhance learning outcomes, and facilitate the transfer of theory-rich content. However, even experienced teachers often struggle to generate a large volume of effective pedagogical questions. To address this, we introduce TutorCraftEase, an interactive generation system that leverages large language models (LLMs) to assist teachers in creating pedagogical questions. TutorCraftEase enables the rapid generation of questions at varying difficulty levels with a single click, while also allowing for manual review and refinement. In a comparative user study with 39 participants, we evaluated TutorCraftEase against a traditional manual authoring tool and a basic LLM tool. The results show that TutorCraftEase can generate pedagogical questions comparable in quality to those created by experienced teachers, while significantly reducing their workload and time.
This study explores the optimization of Automated Question Generation (AQG) for educational assessments using Large Language Models (LLMs) and ontologies. Three approaches are evaluated: template-based structured ontology question generation, LLM-based structured ontology question generation, and LLM-based flat concept list question generation, using BERT Precision, Recall, F1-score, and Semantic Similarity as performance metrics. The results show that: i) the template-based structured ontology approach achieved a BERT Precision of 0.833, Recall of 0.844, and F1-score of 0.838, with a Semantic Similarity of 0.563, ii) the LLM-based structured ontology method showed improvements with a BERT Precision of 0.856, Recall of 0.863, and F1-score of 0.859, but a lower Semantic Similarity of 0.534, and iii) the LLM-based flat concept list approach provided the best results, achieving BERT Precision, Recall, and F1-score of 0.859, along with the highest Semantic Similarity of 0.567. Despite the higher semantic similarity of the LLM-based flat concept list, qualitative analysis revealed that the unstructured ontology sometimes produced hallucinated or unrelated questions. These findings suggest that LLM-based methods provide a balance of relevance and diversity in question generation, with LLM-based flat concept list offering the most optimal results for question generation, while LLM-based structured ontology strikes a balance between Precision and Recall.
Motivation and Background. Many K–12 students struggle with programming concepts. While LLMs offer scalable, timely support, overly direct answers can reduce reasoning and engagement [8], prompting the question: How can LLMs support learning without encouraging overreliance? In our study with 105 students, 31.4% showed misconceptions about variable assignment and data types, and in another survey, only 20% correctly solved conditional problems. This highlights the need for scaffolding to address conceptual gaps in K–12 programming. To address these gaps, we designed an answer-aware hint generation system using LLMs to support learning without reducing cognitive demand. We developed the system for CodeKids—an open-source, curriculum-aligned platform built with Virginia Tech and local public schools. It helps students practice grade-level programming through interactive activities, using LLM-generated hints to guide thinking without revealing answers [1, 11]. Based on Vygotsky’s Zone of Proximal Development [12], our approach balances support and autonomy through structured prompting that preserves productive struggle. Methodology. Building on research showing that machine learning supports K–12 learners without compromising cognitive development [15], we implemented a mindful answer-aware prompting approach [5, 7] grounded in two principles. The first principle, cognitive scaffolding, draws from ZPD and ITS research [10, 12], and ensures hints progress from general to specific while preserving learner autonomy. The second principle, technical safeguards, applies semantic similarity thresholds and constraint-based prompting to prevent answer leakage [13]. The system is deployed across 12 advanced CodeKids books covering core topics like variables, data types, conditionals, loops, and logical operators. Hints are concise, pedagogically sound, and generated by GPT-4 when students request help or load a page. Each request includes the topic, question, answer choices, and correct answer sent to the LLM, enabling context-aware adaptation to the activity and content. Our prompt design constrains hints to one sentence, emphasizes conceptual clarity, and gradually increases specificity to preserve student agency. This aligns with research on scaffold types—such as sense-making, elaboration, and motivational cues—that support self-regulated learning [9]. To support diverse learners, the system includes text-to-speech for reading hints aloud. Our approach combines learning sciences and prompt engineering to foster scalable support, student agency, and conceptual understanding. Evaluation. We evaluated semantic hint alignment using sentence embeddings: 98.1% of hints scored ≥ 0.30 in content alignment and 44.2% ≥ 0.20 in answer alignment, indicating strong relevance with minimal over-reliance. GPT-4, used as an LLM-as-a-judge due to its > 85% agreement with human ratings [14], gave an average score of 0.958 for hints on convergence, pedagogical value, and context. Combining LLM and cosine scores (0.7/0.3), we computed a Hint Quality Score of 0.749 [3]. To assess real-world impact, we developed surveys to collect feedback on clarity, usefulness, and learning [4]. Ongoing Work and Vision. We are investigating hint convergence across LLMs (e.g., Claude 3, Gemini 1.5 Pro) and exploring alternative prompting strategies to improve diversity. Future work includes personalizing hints through difficulty adaptation and embedding-based models for curriculum-aligned scaffolding [6], reducing reliance on proprietary LLMs, and incorporating retrieval-augmented generation (RAG) for contextualization [2].
Abstract Background Crafting quality assessment questions in medical education is a crucial yet time-consuming, expertise-driven undertaking that calls for innovative solutions. Large language models (LLMs), such as ChatGPT (Chat Generative Pre-Trained Transformer), present a promising yet underexplored avenue for such innovations. Aims This study explores the utility of ChatGPT to generate diverse, high-quality medical questions, focusing on multiple-choice questions (MCQs) as an illustrative example, to increase educator’s productivity and enable self-directed learning for students. Description Leveraging 12 strategies, we demonstrate how ChatGPT can be effectively used to generate assessment questions aligned with Bloom’s taxonomy and core knowledge domains while promoting best practices in assessment design. Conclusion Integrating LLM tools like ChatGPT into generating medical assessment questions like MCQs augments but does not replace human expertise. With continual instruction refinement, AI can produce high-standard questions. Yet, the onus of ensuring ultimate quality and accuracy remains with subject matter experts, affirming the irreplaceable value of human involvement in the artificial intelligence-driven education paradigm.
Purpose. To evaluate the feasibility of using synthetic data generated by large language models for training automated classifiers of text responses in educational and professional testing. Methods. The experiment involved generating 100 response examples using LLMs, followed by text preprocessing (tokenization, stemming, TF-IDF) and training two classification models - logistic regression and RBF network, with subsequent evaluation on a test dataset. Results. The models achieved accuracy of 80% and 65-90% respectively. Systematic limitations were identified: high keywords dependency, insensitivity to semantic inversions, and contextual blindness in classification. Conclusions. The approach shows promise for developing auxiliary assessment tools, though current limitations prevent complete replacement of human evaluators. Further refinement is needed for practical implementation.
The ability of children to ask curiosity-driven questions is an important skill that helps improve their learning. For this reason, previous research has explored designing specific exercises to train this skill. Several of these studies relied on providing semantic and linguistic cues to train them to ask more of such questions (also called divergent questions ). But despite showing pedagogical efficiency, this method is still limited as it relies on generating the said cues by hand, which can be a very long and costly process. In this context, we propose to leverage advances in the natural language processing field (NLP) and investigate the efficiency of using a large language model (LLM) for automating the production of key parts of pedagogical content within a curious question-asking (QA) training. We study generating the said content using the "prompt-based" method that consists of explaining the task to the LLM in natural text. We evaluate the output using human experts annotations and comparisons with hand-generated content. Results suggested indeed the relevance and usefulness of this content. We then conduct a field study in primary school (75 children aged 9–10), where we evaluate children’s QA performance when having this training. We compare 3 types of content: 1) hand-generated content that proposes "closed" cues leading to predefined questions; 2) GPT-3-generated content that proposes the same type of cues; 3) GPT-3-generated content that proposes "open" cues leading to several possible questions. Children were assigned to either one of these groups. Based on human annotations of the questions generated, we see a similar QA performance between the two "closed" trainings (showing the scalability of the approach using GPT-3), and a better one for participants with the "open" training. These results suggest the efficiency of using LLMs to support children in generating more curious questions, using a natural language prompting approach that affords usability by teachers and other users not specialists of AI techniques. Furthermore, results also show that open-ended content may be more suitable for training curious question-asking skills.
We present Owlgorithm, an educational platform that supports Self-Regulated Learning (SRL) in competitive programming (CP) through AI-generated reflective questions. Leveraging GPT-4o, Owlgorithm produces context-aware, metacognitive prompts tailored to individual student submissions. Integrated into a second- and third-year CP course, the system-provided reflective prompts adapted to student outcomes: guiding deeper conceptual insight for correct solutions and structured debugging for partial or failed ones. Our exploratory assessment of student ratings and TA feedback revealed both promising benefits and notable limitations. While many found the generated questions useful for reflection and debugging, concerns were raised about feedback accuracy and classroom usability. These results suggest advantages of LLM-supported reflection for novice programmers, though refinements are needed to ensure reliability and pedagogical value for advanced learners. From our experience, several key insights emerged: GenAI can effectively support structured reflection, but careful prompt design, dynamic adaptation, and usability improvements are critical to realizing their potential in education. We offer specific recommendations for educators using similar tools and outline next steps to enhance Owlgorithm's educational impact. The underlying framework may also generalize to other reflective learning contexts.
Schema Study: A Large Language Model (LLM) Application for Asynchronous Student Learning and Inquiry
Undergraduate biology educators face a critical challenge: providing immediate, personalized formative feedback to increasingly large, diverse classes. Large Language Models (LLMs) offer potential solutions, but open-ended chat interfaces pose challenges including curricular misalignment and equity gaps. We developed Schema Study, a free, no-code, open-source web application where instructors upload course terms and context via a single spreadsheet to create an AI-powered chatbot. Our LLM tutor uses evidence-based teaching practices and Socratic questioning to deepen understanding, correct misconceptions, and encourage students to find connections among course concepts. During Winter 2025, we integrated Schema Study into an introductory biology course, embedding it within structured assignments and updating content weekly. Pre-and post-surveys (N=225) indicated strong student satisfaction; 72% would reuse Schema Study in future biology courses. Each additional day per week students used Schema Study more than doubled the likelihood they would recommend it. Schema Study enhanced students’ AI self-efficacy and their belief that AI is relevant to their education and careers. Through iterative, classroom-based refinement, we updated the application based on student feedback, highlighting best practices for integrating LLM chatbots: clear structured messaging, AI literacy training, curricular alignment, and scaffolded active learning opportunities. The tool provides formative practice through question-led dialogue; independent performance is evaluated in secure assessments outside the app. Schema Study offers a scalable, accessible strategy for biology educators to leverage generative AI’s benefits while mitigating its risks.
Knowledge graphs (KGs) are a powerful way of representing information for digital humanities. However, non-technical users often struggle at the outset of exploration, a challenge defined as the Initial Exploration Problem. The Tús Maith framework addresses this issue through curated natural language questions and answers (CuQAs) created from Competency Questions (CQs) that aim to convey the scope of a KG and provide meaningful entry points into it. While prior work has explored using large language models (LLMs) for CQ template generation, the template-filling step, where questions and answers are instantiated with entity information, remains a key challenge. In this paper, we evaluate whether LLMs have the capacity to support domain experts in this stage, focusing on the Virtual Record Treasury of Ireland (VRTI) KG, where accuracy, provenance, and robustness are crucial for practical use. Using structured JSON inputs derived from popular search terms and expert-authored templates, we generated and assessed 24,900 question-answer pairs across four LLMs (GPT-5, DeepSeek-V3.1, Gemini 2.0 Flash, Qwen-2.5-72B) under two provenance conditions (basic vs. full). Our evaluation considers slot fidelity, semantic similarity, completeness, hallucination rates, and runtime efficiency, with statistical tests conducted per run per LLM, and additional batch-level analysis (n = 68) to isolate provenance requirement effects. We further show that a lightweight JSON validation check is an effective proxy for ground truth semantic evaluation of factual question-answer pairs. These LLM-generated, validated questions form an intermediate step in the lifecycle from abstract CQ templates to filled-in questions and answers intended to be reviewed and refined by the VRTI KG’s domain experts (historians) to produce the final user-facing questions (CuQAs). To demonstrate the practical impact, we present a prototype (TMv1) of the Tús Maith framework and highlight the design implications for curator-facing interfaces: provenance-transparent interaction, validation-integrated workflows, and performance-transparent model selection.
Background Developmental dysplasia of the hip (DDH) is a common pediatric orthopedic disease, and health education is vital to disease management and rehabilitation. The emergence of large language models (LLMs) has provided new opportunities for health education. However, the effectiveness and applicability of LLMs in education with DDH have not been systematically evaluated. Objective This study conducted an integrated 2-phase evaluation to assess the quality and educational effectiveness of LLM-generated educational materials. Methods This study comprised 2 phases. Based on Bloom’s taxonomy, a 16-item DDH question bank was created through literature analysis and collaboration. Four LLMs (ChatGPT-4 [OpenAI], DeepSeek-V3, Gemini 2.0 Flash [Google], and Copilot [Microsoft Corp]) were questioned using standardized prompts. All responses were independently evaluated by 5 pediatric orthopedic experts using 5-point Likert measures of accuracy, fluency, and richness, the scales of Patient Education Materials Assessment Tool for Printable Materials, and DISCERN. The readability was measured by a formula. The data were examined using Kruskal-Wallis tests, ANOVA, and post hoc comparisons. In phase 2, an assessor-blinded, 2-arm pilot randomized controlled trial was conducted. A total of 127 caregivers were randomized into an LLM-assisted education group or a web search control group. The intervention included structured LLM training, supervised practice, and 2 weeks of reinforcement training. Measured at baseline, postintervention, and 2 weeks following, the outcomes were eHealth literacy (primary), DDH knowledge, health risk perception, perceived usefulness, information self-efficacy, and health information-seeking behavior. Cohen d effect sizes and linear mixed-effects models were used in an intention-to-treat manner. Results There were significant differences between the 4 LLMs concerning accuracy, richness, fluency, Patient Education Materials Assessment Tool for Printable Materials Understandability, and DISCERN (P<.05). ChatGPT-4 (median 63.67, IQR 63.67-64.67) and DeepSeek-V3 (median 63.67, IQR 63.33-64.67) generate more accurate text than Copilot (median 59.00, IQR 58.67-59.67). DeepSeek-V3 (median 64.00, IQR 64.00-64.00) was language richer than Copilot (median 52.33, IQR 51.33-52.67). Gemini 2.0 Flash (median 72.67, IQR 72.33-73.00) was more fluent than Copilot (median 65.67, IQR 63.33-65.67). In phase 2, the intervention group showed higher eHealth literacy at T1 (33.62, 95% CI 32.76-34.49; d=0.20, 95% CI 0.13-0.56) and T2 (33.27, 95% CI 32.38-34.17; d=0.36, 95% CI 0.01-0.80), greater DDH knowledge at T1 (7.87, 95% CI 7.48-8.25, d=0.71, 95% CI 0.33-1.11) and T2 (7.12, 95% CI 6.72-7.51; d=0.54, 95% CI 0.17-0.96), and slight improvements in health risk prediction and perceived usefulness. Conclusions Mainstream LLMs demonstrate varying capacities in generating educational content for DDH. They generated DDH caregiver education materials that were associated with modest improvements in eHealth literacy and knowledge. Although LLMs can address general informational needs, they cannot completely substitute clinical evaluation. Future research should focus on optimizing plain language, refining dialogue design, and enhancing audience personalization to improve the quality of LLMs’ materials. Trial Registration Chinese Clinical Trial Registry ChiCTR2500108410; https://www.chictr.org.cn/showproj.html?proj=271987
Item (question) difficulties play a crucial role in educational assessments, enabling accurate and efficient assessment of student abilities and personalization to maximize learning outcomes. Traditionally, estimating item difficulties can be costly, requiring real students to respond to items, followed by fitting an item response theory (IRT) model to get difficulty estimates. This approach cannot be applied to the cold-start setting for previously unseen items either. In this work, we present SMART (Simulated Students Aligned with IRT), a novel method for aligning simulated students with instructed ability, which can then be used in simulations to predict the difficulty of open-ended items. We achieve this alignment using direct preference optimization (DPO), where we form preference pairs based on how likely responses are under a ground-truth IRT model. We perform a simulation by generating thousands of responses, evaluating them with a large language model (LLM)-based scoring model, and fit the resulting data to an IRT model to obtain item difficulty estimates. Through extensive experiments on two real-world student response datasets, we show that SMART outperforms other item difficulty prediction methods by leveraging its improved ability alignment.
This study aims to compare the quality assessment of Thai reading comprehension diagnostic tests created by Claude AI versus those developed by humans. The sample consisted of 735 seventh-grade students from Secondary educational service areas in the central northeastern region of Thailand. The methodology applied Rasch Model Analysis integrated with Turing Test procedures. The findings revealed that diagnostic tests created by both Claude AI and humans demonstrated comparable measurement quality in terms of validity, reliability, and item-model fit. Both tests exhibited low measurement error, allowing for accurate estimation of students' Thai reading abilities close to their true proficiency levels. Furthermore, both test versions showed good distribution of difficulty levels, covering nearly the full spectrum of student ability levels. These characteristics make the tests particularly suitable for students with slightly above-average proficiency. Nevertheless, certain test items require refinement to enhance their assessment efficiency according to established criteria. While some educators may remain hesitant about implementing AI-generated tests for formal student evaluation, Claude AI-created tests can effectively serve as practice exercises for student development.
In this article, we explore the transformative impact of advanced, parameter-rich Large Language Models (LLMs) on the production of instructional materials in higher education, with a focus on the automated generation of both formative and summative assessments for learners in the field of mathematics. We introduce a novel LLM-driven process and application, called ItemForge, tailored specifically for the automatic generation of e-assessment items in mathematics. The approach is thoroughly aligned with the levels and hierarchy of cognitive learning objectives as developed by Anderson and Krathwohl, and takes specific mathematical concepts from the considered courses into consideration. The quality of the generated free-text items, along with their corresponding answers (sample solutions), as well as their appropriateness to the designated cognitive level and subject matter, were evaluated in a small-scale study. In this study, three mathematical experts reviewed a total of 240 generated items, providing a comprehensive analysis of their effectiveness and relevance. Our findings demonstrate that the tool is proficient in producing high-quality items that align with the chosen concepts and targeted cognitive levels, indicating its potential suitability for educational purposes. However, it was observed that the provided answers (sample solutions) occasionally exhibited inaccuracies or were not entirely complete, signalling a necessity for additional refinement of the tool's processes.
Standardized math assessments require expensive human pilot studies to establish the difficulty of test items. We investigate the predictive value of open-source large language models (LLMs) for evaluating the difficulty of multiple-choice math questions for real-world students. We show that, while LLMs are poor direct judges of problem difficulty, simulation-based approaches with LLMs yield promising results under the right conditions. Under the proposed approach, we simulate a"classroom"of 4th, 8th, or 12th grade students by prompting the LLM to role-play students of varying proficiency levels. We use the outcomes of these simulations to fit Item Response Theory (IRT) models, comparing learned difficulty parameters for items to their real-world difficulties, as determined by item-level statistics furnished by the National Assessment of Educational Progress (NAEP). We observe correlations as high as 0.75, 0.76, and 0.82 for grades 4, 8, and 12, respectively. In our simulations, we experiment with different"classroom sizes,"showing tradeoffs between computation size and accuracy. We find that role-plays with named students improves predictions (compared to student ids), and stratifying names across gender and race further improves predictions. Our results show that LLMs with relatively weaker mathematical abilities (Gemma) actually yield better real-world difficulty predictions than mathematically stronger models (Llama and Qwen), further underscoring the suitability of open-source models for the task.
Nonanchor equating presents a significant challenge in educational assessment when test forms lack common items, requiring innovative solutions to ensure score comparability across different test administrations. This study proposes a novel large language model-simulated nonequivalent groups with anchor test (LLM-SNGAT) method that leverages large language models (LLMs) to simulate test-taking samples and generate common item sets for equating purposes. The approach eliminates traditional dependencies on specialized test design and extensive demographic data collection by utilizing the inherent capabilities of LLMs to simulate diverse response patterns. We evaluated the method using Tucker and Levine equating approaches across multiple LLMs, including generative pre-trained transformer 4o (GPT-4o), O1-preview, and DeepSeek-R1. Results demonstrated the feasibility of the proposed approach, with the Tucker method showing superior performance and consistent improvements as common item coverage increased. Sensitivity analysis confirmed that model performance rankings remained consistent across varying prompt formulations. The study revealed characteristic that standard errors were smallest near the mean and became larger farther away from the mean, and identified optimal common item proportions of 30%–50% for stable equating performance. While current limitations include the capacity of LLMs to accurately simulate human cognitive and behavioral diversity, this proof-of-concept study provides preliminary evidence for the feasibility of the LLM-SNGAT methodology. The approach represents a paradigm shift from resource-intensive traditional methods to computationally driven solutions, offering promising prospects for addressing nonanchor equating challenges in the digital age.
Educational assessment relies heavily on knowing question difficulty, traditionally determined through resource-intensive pre-testing with students. This creates significant barriers for both classroom teachers and assessment developers. We investigate whether Item Response Theory (IRT) difficulty parameters can be accurately estimated without student testing by modeling the response process and explore the relative contribution of different feature types to prediction accuracy. Our approach combines traditional linguistic features with pedagogical insights extracted using Large Language Models (LLMs), including solution step count, cognitive complexity, and potential misconceptions. We implement a two-stage process: first training a neural network to predict how students would respond to questions, then deriving difficulty parameters from these simulated response patterns. Using a dataset of over 250,000 student responses to mathematics questions, our model achieves a Pearson correlation of approximately 0.78 between predicted and actual difficulty parameters on completely unseen questions.
This theoretical framework addresses the chronic educational crisis of "semantic entropy"—the systematic degradation of meaning in knowledge transmission. Drawing on Cognitive Load Theory (Sweller, 2024) and Schema Theory (Anderson, 2020), PIT (Perceptual Invariance Theory) proposes that educational failure stems not from student deficits but from "semantic noise" in instructional materials and assessments. The paper introduces three engineering solutions: (1) Generalization and Uniqueness principles for material design to achieve ≥99% comprehension fidelity; (2) Clarity-Indexed Scoring System that replaces difficulty-based assessment with clarity-based metrics; and (3) Edu Code Protocol—a universal mathematical language to eliminate natural language ambiguity. Analysis of PISA 2022 and World Bank 2024 data reveals that 40% global reading failure and 70% learning poverty in Turkey correlate more strongly with item ambiguity (r = -0.67) than student SES (r = -0.42), supporting semantic noise as the primary pathogen. The proposed Randomized Controlled Trial (N=500) framework predicts 15-20% comprehension improvement and 30% cognitive load reduction with PI-engineered materials. PIT reframes education from a probabilistic selection mechanism to a deterministic engineering discipline where 99% success becomes a design target. Implementation requires systematic material redesign, teacher training, and digital infrastructure (QR codes, AI-powered ontologies) to realize Leibniz's vision of universal language.
As educational systems evolve, ensuring that assessment items remain aligned with content standards is essential for maintaining fairness and instructional relevance. Traditional human alignment reviews are accurate but slow and labor-intensive, especially across large item banks. This study examines whether Large Language Models (LLMs) can accelerate this process without sacrificing accuracy. Using over 12,000 item-skill pairs in grades K-5, we tested three LLMs (GPT-3.5 Turbo, GPT-4o-mini, and GPT-4o) across three tasks that mirror real-world challenges: identifying misaligned items, selecting the correct skill from the full set of standards, and narrowing candidate lists prior to classification. In Study 1, GPT-4o-mini correctly identified alignment status in approximately 83-94% of cases, including subtle misalignments. In Study 2, performance remained strong in mathematics but was lower for reading, where standards are more semantically overlapping. Study 3 demonstrated that pre-filtering candidate skills substantially improved results, with the correct skill appearing among the top five suggestions more than 95% of the time. These findings suggest that LLMs, particularly when paired with candidate filtering strategies, can significantly reduce the manual burden of item review while preserving alignment accuracy. We recommend the development of hybrid pipelines that combine LLM-based screening with human review in ambiguous cases, offering a scalable solution for ongoing item validation and instructional alignment.
Knowing how test takers answer items in educational assessments is essential for test development, to evaluate item quality, and to improve test validity. However, this process usually requires extensive pilot studies with human participants. If large language models (LLMs) exhibit human-like response behavior to test items, this could open up the possibility of using them as pilot participants to accelerate test development. In this paper, we evaluate the human-likeness or psychometric plausibility of responses from 18 instruction-tuned LLMs with two publicly available datasets of multiple-choice test items across three subjects: reading, U.S. history, and economics. Our methodology builds on two theoretical frameworks from psychometrics which are commonly used in educational assessment, classical test theory and item response theory. The results show that while larger models are excessively confident, their response distributions can be more human-like when calibrated with temperature scaling. In addition, we find that LLMs tend to correlate better with humans in reading comprehension items compared to other subjects. However, the correlations are not very strong overall, indicating that LLMs should not be used for piloting educational assessments in a zero-shot setting.
Despite the globalization of educational content, language remains a significant barrier. When translating educational content, multilingual translation has become crucial to meet this challenge, with an emphasis on incorporating the cultural context of the target country and the educational context of the learners. However, existing machine translation systems often fail to adequately account for these contextual factors. This study explores the potential of the Large Language Model(LLM) to improve the translation of assessment items through In-context Learning. Two prompt engineering strategies are compared: the ‘assessment-aware prompt’, which includes only the specifications of the assessment, and the ‘curriculum-aware prompt’, which includes the educational and cultural context of the target country in addition to the assessment specifications. From the comparison of linguistic features and the expert reviews, we found that the curriculum-aware translation produced more valid and feasible results, highlighting the effectiveness of LLM-based automatic translation methods that integrate curriculum context.
ABSTRACT Large language models (LLMs) such as ChatGPT and Gemini are increasingly used to generate educational content in medical education, including multiple-choice questions (MCQs), but their effectiveness compared to expert-written questions remains underexplored, particularly in anatomy. We conducted a cross-sectional, mixed-methods study involving Year 2–4 medical students at Qatar University, where participants completed and evaluated three anonymized MCQ sets—authored by ChatGPT, Google-Gemini, and a clinical anatomist—across 17 quality criteria. Descriptive and chi-square analyses were performed, and optional feedback was reviewed thematically. Among 48 participants, most rated the three MCQ sources as equally effective, although ChatGPT was more often preferred for helping students identify and confront their knowledge gaps through challenging distractors and diagnostic insight, while expert-written questions were rated highest for deeper analytical thinking. A significant variation in preferences was observed across sources (χ² (64) = 688.79, p < .001). Qualitative feedback emphasized the need for better difficulty calibration and clearer distractors in some AI-generated items. Overall, LLM-generated anatomy MCQs can closely match expert-authored ones in learner-perceived value and may support deeper engagement, but expert review remains critical to ensure clarity and alignment with curricular goals. A hybrid AI-human workflow may provide a promising path for scalable, high-quality assessment design in medical education.
As Large Language Models (LLMs) are increasingly deployed to generate educational content, a critical safety question arises: can these models reliably estimate the difficulty of the questions they produce? Using Brazil's high-stakes ENEM exam as a testbed, we benchmark ten proprietary and open-weight LLMs against official Item Response Theory (IRT) parameters for 1,031 questions. We evaluate performance along three axes: absolute calibration, rank fidelity, and context sensitivity across learner backgrounds. Our results reveal a significant trade-off: while the best models achieve moderate rank correlation, they systematically underestimate difficulty and degrade significantly on multimodal items. Crucially, we find that models exhibit limited and inconsistent plasticity when prompted with student demographic cues, suggesting they are not yet ready for context-adaptive personalization. We conclude that LLMs function best as calibrated screeners rather than authoritative oracles, supporting an"evaluation-before-generation"pipeline for responsible assessment design.
With the growing integration of artificial intelligence in medical education, this study compares the quality and educational robustness of content generated by two large language models (LLMs), DeepSeek-V3 and ChatGPT 4.0, on the emerging, non-conventional topic (and not present in textbooks) of gender-affirming hormone therapy (GAHT) across three educational phases: preclerkship and clerkship phases in undergraduate medical curriculum, and master’s level in pharmacology. A total of 23 prompts were designed to generate Specific Learning Objectives (SLOs), reading materials, assessment items (MCQs, SAQs, and OSPEs), and case-based learning (CBL) scenarios across the three learner stages. The outputs from both LLMs were evaluated independently using rubric-based frameworks assessing content appropriateness, pedagogical structure, assessment alignment, and inclusivity. Both LLMs produced pedagogically sound outputs; however, DeepSeek consistently demonstrated superior adherence to rubric criteria. For SLOs, DeepSeek maintained a clear hierarchical progression across phases and showed greater precision, contextual alignment, and time-bound formulation. Its objectives were more assessable and reflective of increasing cognitive complexity. ChatGPT’s SLOs were inclusive and coherent but occasionally lacked time-specificity and structural clarity. In reading materials, DeepSeek outperformed by integrating clinical relevance, scaffolded structure, and interactive learning tools across all phases. It included visual aids, case vignettes, and phase-specific assessments, while ChatGPT’s content was accurate and readable but leaned toward text-heavy exposition with fewer embedded learning activities. MCQs from both models adhered to core psychometric principles. DeepSeek avoided testwiseness cues more consistently and offered better stratification of difficulty and realism, especially at the master’s level. ChatGPT demonstrated strong pharmacological accuracy but occasionally showed testwiseness cues and illogical distractor sequencing. In CBL and OSPE outputs, DeepSeek showed stronger alignment with instructional and assessment criteria through modular formatting, diverse patient representation, and integration of formative tools. ChatGPT’s cases and OSPEs were realistic and engaging but more narrative and occasionally less standardized. While both LLMs demonstrated educational utility, DeepSeek produced more rubric-aligned, contextually rich, and assessment-ready content across all learner stages. This study supports the integration of advanced LLMs like DeepSeek and ChatGPT in curriculum design, provided there is oversight to ensure alignment with pedagogical goals and learner needs.
The growing reliance on digital learning platforms has increased the need for automated, scalable and pedagogically aligned assessment systems. Current approaches to automated question generation (QG) and grading remain fragmented, focusing on either objective items or short-answer evaluation, with limited attention to difficulty calibration and educator supervision. This paper introduces an AI-driven assessment framework that unifies question generation, automated grading and performance analytics into a single workflow. The framework accepts two input modes: (i) structured content extracted from PDF-based learning resources, with optional optical character recognition (OCR) for scanned or image-based materials and (ii) teacher-specified topics for targeted assessments. Large language models (LLMs) produce a variety of question formats, including formats that involve option selection or text completion, and case-based questions, while a Difficulty Index (DI) guarantees alignment with the intended cognitive levels. Objective responses are graded instantly and AI-assisted evaluation of subjective answers is proposed as a future enhancement with teacher verification. All generated assessments and student outcomes are stored in a Supabase-backed repository enabling real-time analytics such as difficulty-wise performance, progress tracking and cohort comparisons. By integrating content parsing difficulty-aware QG, automated grading and analytics, the proposed system reduces manual workload, corroborates adaptive learning and provides educators with actionable insights for classroom and online environments.
Inference making is an essential but complex skill in reading comprehension (RC). Some inferences require resolving references across sentences, and some rely on using prior knowledge to fill in the detail that is not explicitly written in the text. Diagnostic RC questions can help educators provide more effective and targeted reading instruction and interventions for school-age students. We introduce a taxonomy of inference types for RC and use it to analyze the distribution of items within a diagnostic RC item bank. Next, we present experiments using GPT-4o to generate bridging-inference RC items for given reading passages via few-shot prompting, comparing conditions with and without chain-of-thought prompts. Generated items were evaluated on three aspects: overall item quality, appropriate inference type, and LLM reasoning, achieving high inter-rater agreements above 0.90. Our results show that GPT-4o produced 93.8% good-quality questions suitable for operational use in grade 3-12 contexts; however, only 42.6% of the generated questions accurately matched the targeted inference type. We conclude that combining automatic item generation with human judgment offers a promising path toward scalable, high-quality diagnostic RC assessments.
最终分组结果勾勒了大模型辅助自动命题的完整生态系统:从基础的提示工程与微调技术出发,通过检索增强(RAG)和多模态技术确保内容准确性与多样性;随后进入以心理测量学为核心的质量校验环节,确保试题具备科学的难度与区分度;在应用层,研究已深入特定学科定制并延伸至自动化评分与个性化支架生成;最后,通过课程对齐与人机协作框架,将技术落地于宏观教育治理与伦理监管之中。