大模型辅助自动命题 - Acadwrite

大模型辅助自动命题

最终分组结果勾勒了大模型辅助自动命题的完整生态系统：从基础的提示工程与微调技术出发，通过检索增强（RAG）和多模态技术确保内容准确性与多样性；随后进入以心理测量学为核心的质量校验环节，确保试题具备科学的难度与区分度；在应用层，研究已深入特定学科定制并延伸至自动化评分与个性化支架生成；最后，通过课程对齐与人机协作框架，将技术落地于宏观教育治理与伦理监管之中。

共 75 篇文献，6 个研究方向

核心生成技术、提示工程与模型微调

侧重于大模型生成题目的底层实现，包括Few-shot、CoT提示策略优化，以及通过微调（如T5, Llama）和流水线设计提升生成内容的结构化与指令遵循能力。相关文献: Euigyum Kim et. al, 2025 等 11 篇文献

知识增强、RAG与多模态命题框架

探讨如何利用检索增强生成（RAG）、知识图谱和外部语料库解决幻觉问题，确保题目真实性，并扩展到视频、图像等多模态命题场景。相关文献: Nicholas X. Wang et. al, 2025 等 12 篇文献

质量验证、心理测量学评估与难度预测

利用项目反应理论（IRT）、Rasch模型及模拟学生技术，对生成题目的信效度、难度、写作缺陷及区分度进行自动化分析与校验。相关文献: Thiago Brant et. al, 2026 等 17 篇文献

特定学科深度定制与跨语言应用

聚焦医疗、STEM、编程、语言教学等垂直领域，研究领域知识的准确性以及针对不同语言环境的本地化命题技术。相关文献: Margeaux C. Johnson et. al, 2023 等 13 篇文献

自动化评分、个性化反馈与教学支架

研究命题技术的下游应用，包括开放性问题判分、生成即时反馈提示（Hints）、反思性问题，以及作为教学代理（Pedagogical Agents）支持自主学习。相关文献: Sahana Bhaskar et. al, 2025 等 10 篇文献

课程对齐、人机协作与教育伦理框架

探讨如何将AI生成内容与课程标准（Bloom分类法等）对齐，分析教师对AI工具的感知、人机协作模式及算法偏见、反作弊等伦理挑战。相关文献: Farzan Karimi-Malekabadi et. al, 2025 等 12 篇文献

总计136篇相关文献

Automatic Multiple-Choice Question Generation and Evaluation Systems Based on LLM: A Study Case With University Resolutions

基于大型语言模型的自动多选题生成与评估系统：以大学决议为例的研究

S. S. Mucciaccia, T. M. Paixão, F. Mutz 等, 2025-International Conference on Computational Linguistics

No abstract available

安装插件收集

被引 21

Automatic Question Generation for Intuitive Learning Utilizing Causal Graph Guided Chain of Thought Reasoning

基于因果图引导的思维链推理的自动问题生成，以实现直观学习

Nicholas X. Wang, Neel V. Parpia, Aaryan D. Parikh 等, 2025-2025 IEEE 8th International Conference on Multimedia Information Processing and Retrieval (MIPR)

Intuitive learning plays a vital role in building deep conceptual understanding, particularly in STEM education, where students often grapple with abstract and interdependent ideas. Automatic question generation has emerged as an effective strategy to support personalized and adaptive learning. However, its effectiveness is limited by hallucinations in large language models (LLMs), which can produce factually incorrect, ambiguous, or pedagogically inconsistent questions. To address this challenge, we propose a novel framework that combines causal-graph-guided Chain-of-Thought (CoT) reasoning with a multi-agent LLM architecture to ensure the generation of accurate, meaningful, and curriculum-aligned questions. In this approach, causal graphs offer an explicit representation of domain knowledge, while CoT reasoning enables structured, step-by-step traversal through related concepts. Dedicated LLM agents handle specific tasks such as graph pathfinding, reasoning, validation, and output, all operating under domain constraints. A dual validation mechanism-at both the conceptual and output stages-substantially reduces hallucinations. Experimental results show up to a 70% improvement in quality over reference methods and yielded highly favorable outcomes in subjective evaluations.

安装插件收集

被引 2

Programming Assessment in E-Learning through Rule-Based Automatic Question Generation with Large Language Models

基于规则自动生成问题和大型语言模型的在线学习编程评估

Halim Teguh Saputro, U. Nurhasan, Vivi Nur Wijayaningrum, 2025-Journal of Applied Informatics and Computing

This study develops an evaluation instrument for Python programming using a Rule-Based Automatic Question Generation (AQG) system integrated with Large Language Models (LLMs), designed based on the Revised Bloom’s Taxonomy. The urgency of this research stems from the limitations of conventional programming evaluations, which are often time-consuming, less objective, and insufficiently aligned with cognitive learning levels. The proposed method applies assessment terms as rule-based constraints to guide LLM-generated questions, ensuring both pedagogical validity and structural consistency in JSON format. A total of 91 questions were produced, consisting of multiple-choice and coding items, which were then validated by three programming experts and tested on 32 vocational students. The findings indicate that the instrument achieved an overall validity of 77.66% (valid category), with the highest accuracy at the Apply (96.30%) and Create (100%) levels. The reliability test using Cronbach’s Alpha yielded 0.721, showing acceptable internal consistency. Item difficulty analysis revealed a strong dominance of easy questions (97.78%), with only 2.22% classified as moderate and none as difficult. Student performance also showed a fluctuating pattern: high in Remember (94.79%), Understand (95.83%), and Create (95.60%), but lower in Apply (86.11%), Analyze (90.97%), and Evaluate (87.15%). These results confirm that integrating Rule-Based AQG with LLMs can produce valid, reliable, and adaptive evaluation instruments that not only capture basic programming competencies but also partially address higher-order cognitive skills. This research contributes both practically by providing educators with an efficient tool for generating evaluation items and academically by enriching the growing body of literature on AI-assisted assessment in programming education.

安装插件收集

Research on Automatic Question Generation Methods for Niche Subjects Based on Large Language Models

基于大型语言模型的细分学科自动问答生成方法研究

Meng Guo, Bo Sun, Jun He 等, 2025-2025 14th International Conference on Educational and Information Technology (ICEIT)

High-quality question generation is crucial for ensuring the fairness and validity of examinations. To address the challenges of data scarcity and semantic complexity in automatic question generation (AQG) for niche subjects, including the arts, this study develops a domain-specific large language model (LLM) with a three-tiered optimization mechanism, incorporating prompt tuning, knowledge enhancement, and data augmentation. The model's effectiveness was validated through a case study conducted on a calligraphy course. The results showed that the generated questions achieved a usability rate of 91%, whereas the proposed data augmentation strategy expanded the question bank by 132.56%. This work provides both technical solutions and practical reference for automatic question generation methods targeting niche disciplines. The key contributions of this study encompass the creation of an innovative three-tiered optimization framework, the effective integration of external domain knowledge, and an iterative data augmentation approach that enhances question generation for niche subjects. This research offers a technological pathway and serves as a valuable reference for AQG in niche disciplines.

安装插件收集

Hybrid NLP–Deep Learning Framework for Automatic MCQ Generation

混合自然语言处理-深度学习框架在自动选择题生成中的应用

V. Raju, Madri, Dr. V. Lokeswara Reddy, 2026-2026 International Conference on AI-Driven Smart Systems and Ubiquitous Computing (ICAUC)

No abstract available

安装插件收集

EduPlanner: LLM-Based Multiagent Systems for Customized and Intelligent Instructional Design

基于大型语言模型的智能教学设计多智能体系统：EduPlanner

Xueqiao Zhang, Chao Zhang, Jianwen Sun 等, 2025-IEEE Transactions on Learning Technologies

Large language models (LLMs) have significantly advanced smart education in the artificial general intelligence era. A promising application lies in the automatic generalization of instructional design for curriculum and learning activities, focusing on two key aspects: 1) customized generation: generating niche-targeted teaching content based on students' varying learning abilities and states and 2) intelligent optimization: iteratively optimizing content based on feedback from learning effectiveness or test scores. Currently, a single large LLM cannot effectively manage the entire process, posing a challenge for designing intelligent teaching plans. To address these issues, we developed EduPlanner, an LLM-based multiagent system comprising an evaluator agent, an optimizer agent, and a question analyst, working in adversarial collaboration to generate customized and intelligent instructional design for curriculum and learning activities. Taking mathematics lessons as our example, EduPlanner employs a novel Skill-Tree structure to accurately model the background mathematics knowledge of student groups, personalizing instructional design for curriculum and learning activities according to students' knowledge levels and learning abilities. In addition, we introduce the CIDDP, an LLM-based 5-D evaluation module encompassing Clarity, Integrity, Depth, Practicality, and Pertinence, to comprehensively assess mathematics lesson plan quality and bootstrap intelligent optimization. Experiments conducted on the GSM8K and Algebra datasets demonstrate that EduPlanner excels in evaluating and optimizing instructional design for curriculum and learning activities. Ablation studies further validate the significance and effectiveness of each component within the framework.

安装插件收集

被引 23

Docimological Quality Analysis of LLM-Generated Multiple Choice Questions in Computer Science and Medicine

基于大型语言模型的计算机科学与医学选择题的质化分析

Christian Grévisse, M. A. S. Pavlou, Jochen G Schneider, 2024-SN Computer Science

Assessment is an essential part of education, both for teachers who assess their students as well as learners who may evaluate themselves. Multiple-choice questions (MCQ) are one of the most popular types of knowledge assessment, e.g., in medical education, as they can be automatically graded and can cover a wide range of learning items. However, the creation of high-quality MCQ items is a time-consuming task. The recent advent of Large Language Models (LLM), such as Generative Pre-trained Transformer (GPT), caused a new momentum for automatic question generation solutions. Still, evaluating generated questions according to the best practices for MCQ item writing is needed to ensure docimological quality. In this article, we propose an analysis of the quality of LLM-generated MCQs. We employ zero-shot approaches in two domains, namely computer science and medicine. In the former, we make use of 3 GPT-based services to generate MCQs. In the latter, we developed a plugin for the Moodle learning management system that generates MCQs based on learning material. We compare the generated MCQs against common multiple-choice item writing guidelines. Among the major challenges, we determined that while LLMs are certainly useful in generating MCQs more efficiently, they sometimes create broad items with ambiguous keys or implausible distractors. Human oversight is also necessary to ensure instructional alignment between generated items and course contents. Finally, we propose solutions for AQG developers.

安装插件收集

被引 23

A Transformer-Based Framework for Automated Content Retrieval and Dynamic Response Generation: A Pedagogical Advancement

基于Transformer的自动内容检索与动态响应生成框架：教学法的进步

Aaditya K. Singh, Mehul Lamba, Maadhav Lal 等, 2025-IETE Journal of Education

The paper introduces an intelligent system for educational enhancement that integrates two key modules: a Question and Answer (QnA) module and a novel Feedback Generation module. We create a robust automatic content retrieval and response generation framework using Retrieval Augmented Generation (RAG) and transformer-based models, specifically OpenAI GPT-3.5. The QnA module dynamically retrieves relevant content from documents through cosine similarity and produces answers aligned with educational material. Meanwhile, the Feedback Generation module is designed to handle subjective responses and evaluate students' answers by comparing them with LLM-generated responses via cosine similarity. This comparison yields a performance score for the student's response, supplemented by specific feedback that highlights strengths and areas for improvement. Our approach bridges the gap in current automated grading systems providing a scalable and adaptable solution for personalized learning in diverse educational contexts. This system is particularly beneficial for institutions managing extensive student cohorts, offering real-time, individualized feedback to enhance student engagement and learning outcomes. Results demonstrate the effectiveness of our system with the QnA module achieving high cosine similarity scores of 0.87 for theory questions and 0.81 for numerical when compared with a solution manual. The Feedback Generation module exhibited a strong correlation (r = 0.92) with professor-assigned marks validating its alignment with human evaluations, this empirical validation involved 150 student responses across diverse problem types in the Computer Architecture course. These results highlight the robustness and accuracy of our approach in real-world educational scenarios.

安装插件收集

ExamQ-Gen: Instructor-in-the-Loop Generation of Self-Contained Exam Questions from Course Materials and Decision-Support Grading

ExamQ-Gen：基于课程材料和决策支持评分的循环式自包含考试题目生成

Catalin Anghel, Emilia Pecheanu, A. Anghel 等, 2026-Computers

Reliable evaluation of large language models (LLMs) for educational use requires benchmarks that reflect exam constraints, instructor grading practices, and the operational consequences of thresholded decisions. This paper introduces ExamQ-Gen, an instructor-in-the-loop benchmark that couples two tasks: (i) an LLM answering university-style exam questions and (ii) decision-support grading aligned with an instructor reference. Automatic grading is used for triage and feedback; in practice, ExamQ-Gen supports instructor-led exam authoring and provides grading recommendations, while the instructor issues the final grade and pass/fail decision. ExamQ-Gen is constructed from the course content by using an LLM to generate exam-style questions directly from the lecture materials, producing a course-derived question set suitable for controlled experimentation. The benchmark then instantiates contrasting exam conditions, including instructor-authored (HUMAN) versus pipeline-generated (PIPELINE) artifacts, to evaluate robustness under distribution shifts that can occur when exam questions and answers are produced through different generation workflows. Using two LLM “students” (Llama3-8B-Instruct and Mistral-7B-Instruct) and an LLM-based grader, we compare automatic grading against an instructor reference on a 1–10 score scale and at the decision level induced by the operational pass policy (pass if score ≥ 9). Accordingly, our conclusions are conditioned on the two evaluated student models. Score-level agreement is strong under HUMAN conditions but degrades substantially under PIPELINE conditions, indicating condition-dependent stability. At the pass threshold, decision errors are highly asymmetric, with false fails dominating false passes, meaning that conservative grading may appear safe while producing credit denial. A severity-focused analysis isolates a high-stakes failure mode—denial of instructor-perfect answers—and shows that, in the most affected PIPELINE condition, the perfect-pass miss rate reaches 0.926 (50/54), consistent with systematic conservatism rather than borderline noise. Overall, the results highlight that aggregate score agreement and accuracy are insufficient for instructor-controlled exam deployment and motivate reporting practices that combine disaggregated score agreement, threshold-based error asymmetry with uncertainty, and severity-aware diagnostics under exam-relevant condition shifts.

安装插件收集

AutoTutor meets Large Language Models: A Language Model Tutor with Rich Pedagogy and Guardrails

AutoTutor遇见大型语言模型：具有丰富教学策略和引导机制的语言模型辅导系统

Sankalan Pal Chowdhury, Vilém Zouhar, Mrinmaya Sachan, 2024-Proceedings of the Eleventh ACM Conference on Learning @ Scale

Large Language Models (LLMs) have found several use cases in education, ranging from automatic question generation to essay evaluation. In this paper, we explore the potential of using LLMs to author Intelligent Tutoring Systems. A common pitfall of using LLMs as tutors is their straying from desired pedagogical strategies such as leaking the answer to the student, and in general, providing no guarantees on the validity or appropriateness of the tutor assistance. We argue that while LLMs with certain guardrails can take the place of subject experts, the overall pedagogical design still needs to be handcrafted for the best learning results. Based on this principle, we create a sample end-to-end tutoring system named MWPTutor, which uses LLMs to fill in the state space of a predefined finite state transducer. This approach retains the structure and the pedagogy of traditional tutoring systems that has been developed over the years by learning scientists but brings in additional flexibility of LLM-based approaches. Through a human evaluation study on two datasets with math word problems, we show that our hybrid approach achieves a better overall tutoring score than an instructed, but otherwise free-form, GPT-4. MWPTutor is completely modular and opens up the scope for the community to improve its performance by refining its individual modules or using different teaching strategies that it can follow.

安装插件收集

被引 53

Exploring Large Language Models for Evaluating Automatically Generated Questions

探索大型语言模型在自动生成问题评估中的应用

Jeffrey S. Dittel, Michelle W. Clark, R. V. Campenhout 等, 2024-No journal

No abstract available

安装插件收集

被引 3

Towards More Effective Automatic Question Generation: A Hybrid Approach for Extracting Informative Sentences

迈向更有效的自动问答生成：一种提取信息句子的混合方法

Engy Yehia, N. Hassan, Sayed AbdelGaber, 2025-International Journal of Advanced Computer Science and Applications

— Informative Sentence Extraction (ISE) is one of the crucial components in Automatic Question Generation (AQG) and directly influences the quality and relevancy of the generated questions. Instructional texts often contain not only informative but also irrelevant sentences. This results in the creation of poor-quality or distorted questions when irrelevant, non-informative sentences have been used as input. Therefore, the basic problem discussed in this paper is how to provide a systematic method for filtering out such sentences and retaining those that are pedagogically valuable. The purpose of ISE is to filter out irrelevant, low-quality information and retain only the factually dense sentences, express key concepts and are contextually significant. This paper proposes a hybrid approach for extracting informative sentences that combines lexical, statistical, and semantic criteria to identify informative sentences suitable for generating educational questions. The proposed approach consists of two modules: the first module employs four techniques in order to evaluate the informativeness of sentences, which are the keyword-based scoring, Named Entity Recognition (NER), information gain (IG) and Sentence-BERT (SBERT). The second module utilizes multiple fusion strategies to integrate the results derived from the informative sentence extraction techniques. The preprocessed sentences extracted from educational materials were ranked and filtered based on their informativeness coverage. The evaluation results indicate that the hybrid approach can improve the extraction of informative sentences rather than using individual methods. Such a contribution is important for enhancing the performance of downstream tasks in AQG systems, such as distractor generation and question formulation.

安装插件收集

Automatic Question Generation from Postpositional Phrases of Marathi

从马拉地语后置短语自动生成问题

Pushpa M. Patil, R. Bhavsar, B. V. Pawar, 2025-SN Computer Science

No abstract available

安装插件收集

Automatic Question Generation System Using Natural Language Processing

基于自然语言处理的自动问答生成系统

Prof. Archana R. Ghuge, 2025-International Journal for Research in Applied Science and Engineering Technology

This project introduces an innovative approach to advance the field of automatic question generation using natural language processing (NLP), with a specific focus on Bloom’s Taxonomy. With the increasing availability of resources and online learning platforms there is a need for efficient methods to create diverse and contextually relevant questions. The main goal of this project is to develop a system that can automatically generate questions using Natural Language Processing (NLP) techniques aligned with first three cognitive levels of Bloom’s Taxonomy: remembering, understanding, and applying. This project will make a contribution to the field of NLP by providing a framework for automatic question generation. The project follows stages; preprocessing the input text identifying concepts and information creating question rules and generating different versions of questions based on these rules. This project utilizes NLP techniques such as Named Entity Recognition (NER) Part of Speech tagging (POS), syntatic analysis and Discourse analysis. The overarching goal is to provide educators, content creators, and learners with an efficient and intelligent tool for generating questions that enhance comprehension and critical thinking. By automating this process, the project seeks to save time and effort while improving the overall learning and assessment experience.

安装插件收集

Utilizing Large Language Models for Developing Automatic Question Generation in Education

利用大型语言模型开发教育领域的自动问题生成

Ardiansyah Jaya Winata, Daniel Jahja Surjawan, V. C. Mawardi, 2025-2025 International Conference on Advancement in Data Science, E-learning and Information System (ICADEIS)

The increasing workload of educators, particularly in manual question creation, poses a significant challenge in modern education. Manual question creation demands time, effort, and a deep understanding of the material to ensure contextual and curriculum-aligned questions. To address this, an Automatic Question Generation (AQG) system was developed using extractive summarization combined with the PEGASUS and TextRank methods. The system leverages Natural Language Processing (NLP) and transformer-based large language models (LLMs) to efficiently generate relevant questions. The primary data source for this system was digital social studies (IPS) books from the Indonesian Ministry of Education. The evaluation was conducted using ROUGE Score metrics and human assessments. ROUGE analysis yielded an average F1 score of 0.87 (ROUGE-1), 0.83 (ROUGE-2), and 0.84 (ROUGE-L), demonstrating the system's effectiveness in capturing essential information. Human evaluations involving educators and students highlighted the relevance and contextual accuracy of generated questions, particularly for structured materials. The system generated questions within 2 to 6 minutes, showcasing its efficiency in reducing educators' workload. However, challenges remain in handling materials with implicit semantic relationships or nonlinear narratives, as PEGASUS struggles to maintain contextual relevance. This limitation may lead to irrelevant questions and answers, indicating a need for improved semantic understanding. This study concludes that the PEGASUS+TextRank AQG system is a promising tool to streamline question generation. Future improvements in semantic algorithms and broader training data are crucial to enhancing the system's reliability and adaptability to diverse educational contexts.

安装插件收集

被引 1

Enhancing Learning System Using Automatic Question Generation and Classification

基于自动问题生成与分类的学习系统增强

S. Sayyed, Bharat Abhimanyu Shelke, C. Mahender, 2025-2025 2nd International Conference on Integration of Computational Intelligent System (ICICIS)

Automated question generation in practical classroom usage saves teachers' time to develop various and individual questions, as well as the time they use in interactive learning. It also comes with immediate feedback and personalised results for students, hence improving their learning and comprehension. Also, it facilitates the implementation of differentiated learning by providing questions with levels of difficulty in order to address the needs of all learners. Automatic question generation and categorisation focus on the generation and classification of questions from text; it is used in developing education assistants, improving client support solutions, and creating other forms of learning and interaction aids. This technology is used in personalised tutoring agents and intelligent FAQ (Frequently Asked Question) databases to increase efficiency and effectiveness in knowledge management and acquisition. It can be time- and cost-effective for the organisations to implement and offer user-specific services. The rate of generated questions using the rule-based method was quite high, with 84.5 % accuracy. This development means the creation of solutions that are stronger and more suitable for different applications.

安装插件收集

Exploring prompt pattern for generative artificial intelligence in automatic question generation

探索生成式人工智能在自动问答生成中的提示模式

Lili Wang, Ruiyuan Song, Weitong Guo 等, 2024-Interactive Learning Environments

ABSTRACT The construction of questions is an essential component in educational assessment and student learning processes. However, manually constructing questions is a complex task that requires not only professional training, substantial experience, and extensive resources from teachers but is also time-consuming. This article introduces an Automatic Question Generation (AQG) technology based on a prompt pattern to alleviate this burden and address the ongoing need for new questions in education. The essence of this method lies in constructing a prompt pattern grounded on a collective knowledge base derived from teachers, thereby enhancing the quality of the questions produced. Practical applications and expert evaluations demonstrate that integrating a prompt pattern with a collective knowledge base into Large Language Models (LLMs) results in high-quality questions with statistically significant results. These questions not only meet educational standards but also approach the quality of manually constructed questions by teachers in certain aspects. Our research further emphasizes the feasibility of AI-teacher collaboration in education.

安装插件收集

被引 29

Revolutionizing Education System with Automatic Question Generation Using Rule Based Method

基于规则方法的自动生成问题，革新教育系统

S. Sayyed, Namrata Mahender, Ramesh R Naik, 2025-Journal of Information Systems Engineering and Management

The current research aimed at creating a rule-based method for forming Wh- and Yes/No-type questions based on textual input. The study uses the rule-based method to automatically create Wh- and Yes/No-questions. The approach is based on the syntactic analysis of input sentences to identify the corresponding question forms to be used and apply certain rules for each type. For the proposed method, the achieved accuracy is 82. 20% in generating Wh- and Yes/No-type questions. The findings suggest that the rule-based approach produces appropriate questions corresponding to intervention aims, including checking for understanding and encouraging critical thinking. Such a method may be more effective than many other approaches that are currently used in practice in terms of ease, efficiency, and relevance to education environments. That provides a rich solution that can meet the needs of educators and students alike.

安装插件收集

AI-Powered Automatic Question Generation for Teachers

基于人工智能的教师自动生成问题系统

Sugiyanto Yoannatan Widjaja, A. Yohannis, 2025-2025 International Conference on Smart Computing, IoT and Machine Learning (SIML)

Indonesia is facing a significant shortage of teachers, particularly in remote areas, due to various contributing factors. This shortage exacerbates disparities in teaching quality and underscores the need for innovative solutions. This study proposes the use of Artificial Intelligence (AI), specifically Generative AI, to automate the creation of diverse test items. The proposed AI-powered tool focuses on generating questions aligned with Indonesia's Minimum Competency Assessment (MCA) in reading literacy and mathematics. By leveraging large language models, natural language processing techniques, and image generation for visual stimuli, the tool aims to support teachers in developing engaging and customized assessments tailored to students' needs. The outcome is expected to be an AI-based tool that not only reduces teacher workload but also improves the quality and effectiveness of student assessments in Indonesia.

安装插件收集

Automatic Question Generation for Spanish Textbooks: Evaluating Spanish Questions Generated with the Parallel Construction Method

西班牙教材自动问题生成：评估并行构建方法生成的西班牙语问题

Benny G. Johnson, Rachel Van Campenhout, Bill Jerome 等, 2024-International Journal of Artificial Intelligence in Education

No abstract available

安装插件收集

被引 9

Towards automatic question generation using pre-trained model in academic field for Bahasa Indonesia

基于预训练模型在学术领域为印尼语自动生成问题的研究

Derwin Suhartono, Muhammad Rizki Nur Majiid, Renaldy Fredyan, 2024-Education and Information Technologies

No abstract available

安装插件收集

被引 8

Leveraging In-Context Learning and Retrieval-Augmented Generation for Automatic Question Generation in Educational Domains

利用情境学习与检索增强生成在教育领域实现自动问答

Subhankar Maity, Aniket Deroy, Sudeshna Sarkar, 2024-Proceedings of the 16th Annual Meeting of the Forum for Information Retrieval Evaluation

Question generation in education is a time-consuming and cognitively demanding task, as it requires creating questions that are both contextually relevant and pedagogically sound. Current automated question generation methods often generate questions that are out of context. In this work, we explore advanced techniques for automated question generation in educational contexts, focusing on In-Context Learning (ICL), Retrieval-Augmented Generation (RAG), and a novel Hybrid Model that merges both methods. We implement GPT-4 for ICL using few-shot examples and BART with a retrieval module for RAG. The Hybrid Model combines RAG and ICL to address these issues and improve question quality. Evaluation is conducted using automated metrics, followed by human evaluation metrics. Our results show that both the ICL approach and the Hybrid Model consistently outperform other methods, including baseline models, by generating more contextually accurate and relevant questions.

安装插件收集

被引 10

Fine-Tuned T5 Transformer with LSTM and Spider Monkey Optimizer for Redundancy Reduction in Automatic Question Generation

基于LSTM和蜘蛛猴优化器的T5 Transformer微调用于自动问答中的冗余减少

R. Tharaniya sairaj, S. R. Balasundaram, 2024-SN Computer Science

No abstract available

安装插件收集

被引 4

Research on the Application of Large Language Models in Automatic Question Generation: A Case Study of ChatGLM in the Context of High School Information Technology Curriculum

大语言模型在自动生成高中信息技术课程考试题目中的应用研究：以ChatGLM为例

Yanxin Chen, Ling He, 2024-arXiv.org

This study investigates the application effectiveness of the Large Language Model (LLMs) ChatGLM in the automated generation of high school information technology exam questions. Through meticulously designed prompt engineering strategies, the model is guided to generate diverse questions, which are then comprehensively evaluated by domain experts. The evaluation dimensions include the Hitting(the degree of alignment with teaching content), Fitting (the degree of embodiment of core competencies), Clarity (the explicitness of question descriptions), and Willing to use (the teacher's willingness to use the question in teaching). The results indicate that ChatGLM outperforms human-generated questions in terms of clarity and teachers' willingness to use, although there is no significant difference in hit rate and fit. This finding suggests that ChatGLM has the potential to enhance the efficiency of question generation and alleviate the burden on teachers, providing a new perspective for the future development of educational assessment systems. Future research could explore further optimizations to the ChatGLM model to maintain high fit and hit rates while improving the clarity of questions and teachers' willingness to use them.

安装插件收集

被引 2

A Novel Approach to Scalable and Automatic Topic-Controlled Question Generation in Education

一种新型可扩展且自动化的教育领域主题控制问答生成方法

Ziqing Li, M. Cukurova, S. Bulathwela, 2025-Proceedings of the 15th International Learning Analytics and Knowledge Conference

The development of Automatic Question Generation (QG) models has the potential to significantly improve educational practices by reducing the teacher workload associated with creating educational content. This paper introduces a novel approach to educational question generation that controls the topical focus of questions. The proposed Topic-Controlled Question Generation (T-CQG) method enhances the relevance and effectiveness of the generated content for educational purposes. Our approach uses fine-tuning on a pre-trained T5-small model, employing specially created datasets tailored to educational needs. The research further explores the impacts of pre-training strategies, quantisation, and data augmentation on the model’s performance. We specifically address the challenge of generating semantically aligned questions with paragraph-level contexts, thereby improving the topic specificity of the generated questions. In addition, we introduce and explore novel evaluation methods to assess the topical relatedness of the generated questions. Our results, validated through rigorous offline and human-backed evaluations, demonstrate that the proposed models effectively generate high-quality, topic-focused questions. These models have the potential to reduce teacher workload and support personalised tutoring systems by serving as bespoke question generators. With its relatively small number of parameters, the proposals not only advance the capabilities of question generation models for handling specific educational topics but also offer a scalable solution that reduces infrastructure costs. This scalability makes them feasible for widespread use in education without reliance on proprietary large language models like ChatGPT.

安装插件收集

被引 13

Automatic Question Generation from Youtube Lectures using Deep Learning

基于深度学习的从YouTube讲座自动生成问题

Himanshu Jasuja, Ujjwal Negi, Vibhav 等, 2024-2024 15th International Conference on Computing Communication and Networking Technologies (ICCCNT)

In this contemporary world full of information, online lecture videos are a big fountain of knowledge. Nevertheless, quizzes have to be developed based on these videos to make evaluation of knowledge acquisition much easier. The research describes a method for generating quizzes from online teaching videos that enhances self-learning through continuous assessment. Unlike existing approaches which are resource intensive and computationally demanding, we aim at providing a Video Question Generation model that is light weight and effective. We take advantage of state-of-the-art Natural Language Processing (NLP) technology to improve our model’s flexibility and allow it to be fine-tuned using T5 transformers. Our system also generates various forms of “Wh” questions such as who, when, where, what, which, why and how as well as Multiple Choice Questions (MCQs). Through this study we hope to give teachers and students alike a tool that can facilitate knowledge assessment and create an active learning environment.

安装插件收集

被引 1

Automatic Question Generation with Knowledge Graph for Panoramic Learning

基于知识图谱的自动问答生成：全景式学习的应用

Fumika Okuhara, S. Egami, Y. Sei 等, 2024-2024 21st International Conference on Information Technology Based Higher Education and Training (ITHET)

In recent years, the global social landscape has become increasingly complex, requiring the ability to think from a wide range of diverse perspectives for effective problem-solving. In the field of education, panoramic learning, which implements interdisciplinary and comprehensive education, has become essential. Also, there has been recent research on various aspects of automatic question generation (AGQ), with some studies focusing on generating panoramic questions, which provide a comprehensive understanding, across different genres using knowledge graph (KG). KG is a knowledge base that uses a graph-structured data model and consists of entities and relationships between entities. On the other hand, research on generating panoramic questions for specific subjects with educational purposes has been limited, and this study aims to address that. In this work, we specifically targeted the field of history for question generation and used complemented entities to enhance the inclusion of panoramic knowledge in the field of history. The approach involves enhancing subgraphs with link prediction, which complements missing relationships in KGs, particularly in historical contexts requiring temporal and spatial insights. Through evaluation, it was validated that the proposed method could generate questions containing more panoramic knowledge compared to existing methods.

安装插件收集

Automatic Question Generation for Cognitive Domain of Outcome-Based Education System

基于结果导向教育系统的认知领域自动问题生成

Sheeza Zulfiqar, S. Bazai, Muhammad Imran Ghafoor 等, 2024-2024 International Conference on IT and Industrial Technologies (ICIT)

In recent years, the educational system has seen numerous improvements, including the addition of assessment criteria to evaluate educational outcomes. However, the manual creation of test questions often fails to accurately assess students’ competency levels and is time-consuming. This paper addresses the need for automated question generation (AQG) in the context of outcome-based education (OBE), a student-centric approach that has yet to incorporate AQG techniques. OBE, grounded in Bloom’s taxonomy, encompasses three domains: cognitive, psychomotor, and affective. Research focuses on the cognitive domain and its six levels of question generation. OBE algorithms for AQG are based on accuracy, time consumption, and question quality. The DistilBERT question-answering model and transformers for error correction are used in our AQG model. The model is trained on QGSTEC and assessed using performance measures. Comparatively, the model has higher accuracy, precision, and F1-score.

安装插件收集

Few-shot is enough: exploring ChatGPT prompt engineering method for automatic question generation in english education

少样本即可：探索ChatGPT提示工程方法在英语教育中自动生成问题的应用

Unggi Lee, Haewon Jung, Younghoon Jeon 等, 2023-Education and Information Technologies

No abstract available

安装插件收集

被引 191

Automatic item generation in various STEM subjects using large language model prompting

基于大型语言模型提示的多种STEM学科自动试题生成

K. W. Chan, Farhan Ali, Joonhyeong Park 等, 2024-Computers and Education: Artificial Intelligence

No abstract available

安装插件收集

被引 20

Prompting Strategies for Language Model-Based Item Generation in K-12 Education: Bridging the Gap Between Small and Large Language Models

基于语言模型的K-12教育项目生成中的提示策略：连接小规模与大规模语言模型之间的差距

M. Amini, Babak Ahmadi, Xi Xiong 等, 2025-arXiv.org

This study explores automatic generation (AIG) using language models to create multiple choice questions (MCQs) for morphological assessment, aiming to reduce the cost and inconsistency of manual test development. The study used a two-fold approach. First, we compared a fine-tuned medium model (Gemma, 2B) with a larger untuned one (GPT-3.5, 175B). Second, we evaluated seven structured prompting strategies, including zero-shot, few-shot, chain-of-thought, role-based, sequential, and combinations. Generated items were assessed using automated metrics and expert scoring across five dimensions. We also used GPT-4.1, trained on expert-rated samples, to simulate human scoring at scale. Results show that structured prompting, especially strategies combining chain-of-thought and sequential design, significantly improved Gemma's outputs. Gemma generally produced more construct-aligned and instructionally appropriate items than GPT-3.5's zero-shot responses, with prompt design playing a key role in mid-size model performance. This study demonstrates that structured prompting and efficient fine-tuning can enhance midsized models for AIG under limited data conditions. We highlight the value of combining automated metrics, expert judgment, and large-model simulation to ensure alignment with assessment goals. The proposed workflow offers a practical and scalable way to develop and validate language assessment items for K-12.

安装插件收集

Presidential address 2025: expansion of computer-based testing from 12 to 27 health professions by 2027 and adoption of a large language model for item generation

2025年总统致辞：到2027年将基于计算机的测试从12个扩展到27个健康职业，并采用大型语言模型进行试题生成

Hyunjoo Pai, 2025-Journal of Educational Evaluation for Health Professions

No abstract available

安装插件收集

Instruction‐Tuned Large‐Language Models for Quality Control in Automatic Item Generation: A Feasibility Study

基于指令调整的大型语言模型在自动命题质量控制中的应用：可行性研究

Guher Gorgun, Okan Bulut, 2024-Educational Measurement: Issues and Practice

Automatic item generation may supply many items instantly and efficiently to assessment and learning environments. Yet, the evaluation of item quality persists to be a bottleneck for deploying generated items in learning and assessment settings. In this study, we investigated the utility of using large‐language models, specifically Llama 3‐8B, for evaluating automatically generated cloze items. The trained large‐language model was able to filter out majority of good and bad items accurately. Evaluating items automatically with instruction‐tuned LLMs may aid educators and test developers in understanding the quality of items created in an efficient and scalable manner. The item evaluation process with LLMs may also act as an intermediate step between item creation and field testing to reduce the cost and time associated with multiple rounds of revision.

安装插件收集

被引 7

Automated Reading Passage Generation with OpenAI's Large Language Model

基于OpenAI大型语言模型的自动阅读篇章生成

Ummugul Bezirhan, M. Davier, 2023-ArXiv

The widespread usage of computer-based assessments and individualized learning platforms has resulted in an increased demand for the rapid production of high-quality items. Automated item generation (AIG), the process of using item models to generate new items with the help of computer technology, was proposed to reduce reliance on human subject experts at each step of the process. AIG has been used in test development for some time. Still, the use of machine learning algorithms has introduced the potential to improve the efficiency and effectiveness of the process greatly. The approach presented in this paper utilizes OpenAI's latest transformer-based language model, GPT-3, to generate reading passages. Existing reading passages were used in carefully engineered prompts to ensure the AI-generated text has similar content and structure to a fourth-grade reading passage. For each prompt, we generated multiple passages, the final passage was selected according to the Lexile score agreement with the original passage. In the final round, the selected passage went through a simple revision by a human editor to ensure the text was free of any grammatical and factual errors. All AI-generated passages, along with original passages were evaluated by human judges according to their coherence, appropriateness to fourth graders, and readability.

安装插件收集

被引 48

Automatic item generation for reading comprehension in Portuguese literature

基于检索增强生成和一次性提示的葡萄牙文学阅读理解自动命题

Francisco Lopes, Eduardo Mota, Sílvia Araújo, 2025-2025 3rd International Conference on Foundation and Large Language Models (FLLM)

This study explores the use of retrieval-augmented generation (RAG) combined with one-shot prompting to automatically generate reading comprehension questions aligned with the Portuguese secondary-school literature curriculum. Focusing on inference-type questions based on Padre António Vieira’s Sermão de Santo António aos Peixes, the system generated 50 open-ended items evaluated by two experts in literary education. The results show strong curricular alignment (92%) and moderate usability (64%), indicating that the model can reproduce exam-style formulations anchored in authentic textual material. These findings suggest that RAG effectively constrains generation to curricular content while maintaining linguistic and pedagogical coherence. Future work will expand the evaluation to additional literary texts, question types, and expert raters, as well as compare alternative models, chunking strategies, and prompting configurations to enhance the generalization of results.

安装插件收集

The interactive reading task: Transformer-based automatic item generation

交互式阅读任务：基于Transformer的自动试题生成

Y. Attali, Andrew Runge, Geoffrey T. LaFlair 等, 2022-Frontiers in Artificial Intelligence

Automatic item generation (AIG) has the potential to greatly expand the number of items for educational assessments, while simultaneously allowing for a more construct-driven approach to item development. However, the traditional item modeling approach in AIG is limited in scope to content areas that are relatively easy to model (such as math problems), and depends on highly skilled content experts to create each model. In this paper we describe the interactive reading task, a transformer-based deep language modeling approach for creating reading comprehension assessments. This approach allows a fully automated process for the creation of source passages together with a wide range of comprehension questions about the passages. The format of the questions allows automatic scoring of responses with high fidelity (e.g., selected response questions). We present the results of a large-scale pilot of the interactive reading task, with hundreds of passages and thousands of questions. These passages were administered as part of the practice test of the Duolingo English Test. Human review of the materials and psychometric analyses of test taker results demonstrate the feasibility of this approach for automatic creation of complex educational assessments.

安装插件收集

被引 70

Intelligent Generation Model Based on Corpus for College English Test

基于语料库的大学英语考试智能生成模型

Fangzhou Zhang, 2025-2025 IEEE International Conference on Computation, Big-Data and Engineering (ICCBE)

A college English test generation model was constructed based on a corpus in this study. By combining common linguistic datasets, the automatic item generation method was adopted for large-scale testing. The corpus-based approach was applied for English language instruction. Corpus construction, preprocessing, vocabulary analysis, and other relevant components were integrated for effective test item generation. A methodology using word lists with word ratios and other new metrics was derived from preference words and levels of difficulty to calculate sentence difficulty and its text complexity index. To address the challenges of previous systems, challenges in multiple-choice tests were addressed. The developed model uses corpus processing and machine learning algorithms to generate test questions at all levels of difficulty. The developed system solves problems of the current college English systems.

安装插件收集

LLM-Generated Multiple Choice Practice Quizzes for Pre-Clinical Medical Students; Prevalence of Item Writing Flaws.

基于大型语言模型（LLM）生成的临床前期医学生多项选择题练习：项目编写缺陷的普遍性

Troy Camarata, Lise McCoy, R. Rosenberg 等, 2025-Advances in Physiology Education

Multiple choice questions (MCQs) are frequently used in medical education for assessment. Automated generation of MCQs in board-exam format could potentially save significant effort for faculty and generate a wider set of practice materials for student use. The goal of this study was to explore the feasibility of using ChatGPT by OpenAI to generate USMLE/COMLEX-USA-style practice quiz items as study aids. Researchers gave second year medical students studying renal physiology access to a set of practice quizzes with ChatGPT generated questions. The exam items generated were evaluated by independent experts for quality and adherence to NBME/NBOME guidelines. Forty-nine percent of questions contained item writing flaws, and 22% contained factual or conceptual errors. However, 59/65 (91%) were categorized as a reasonable starting point for revision. These results demonstrate the feasibility of large language model (LLM)-generated practice questions in medical education, but only when supervised by a subject matter expert with training in exam item writing.

安装插件收集

被引 6

QG-SMS: Enhancing Test Item Analysis via Student Modeling and Simulation

QG-SMS：通过学生建模与仿真增强测试项目分析

Bang Nguyen, TingTing Du, Mengxia Yu 等, 2025-Annual Meeting of the Association for Computational Linguistics

While the Question Generation (QG) task has been increasingly adopted in educational assessments, its evaluation remains limited by approaches that lack a clear connection to the educational values of test items. In this work, we introduce test item analysis, a method frequently used by educators to assess test question quality, into QG evaluation. Specifically, we construct pairs of candidate questions that differ in quality across dimensions such as topic coverage, item difficulty, item discrimination, and distractor efficiency. We then examine whether existing QG evaluation approaches can effectively distinguish these differences. Our findings reveal significant shortcomings in these approaches with respect to accurately assessing test item quality in relation to student performance. To address this gap, we propose a novel QG evaluation framework, QG-SMS, which leverages Large Language Model for Student Modeling and Simulation to perform test item analysis. As demonstrated in our extensive experiments and human evaluation study, the additional perspectives introduced by the simulated student profiles lead to a more effective and robust assessment of test items.

安装插件收集

被引 2

Artificial intelligence in radiology examinations: a psychometric comparison of question generation methods.

E. Emekli, B. N. Karahan, 2025-Diagnostic and Interventional Radiology

PURPOSE This study aimed to evaluate the usability of artificial intelligence (AI)-based question generation methods-Chat Generative Pre-trained Transformer (ChatGPT)-4o (a non-template-based large language model) and a template-based automatic item generation (AIG) method-in the context of radiology education. The primary objective was to compare the psychometric properties, perceived quality, and educational applicability of generated multiple-choice questions (MCQs) with those written by a faculty member. METHODS Fifth-year medical students who participated in the radiology clerkship at Eskişehir Osmangazi University were invited to take a voluntary 15-question examination covering musculoskeletal and rheumatologic imaging. The examination included five MCQs from each of three sources: a radiologist educator, ChatGPT-4o, and the template-based AIG method. Student responses were evaluated in terms of difficulty and discrimination indices. Following the examination, students rated each question using a Likert scale based on clarity, difficulty, plausibility of distractors, and alignment with learning goals. Correlations between students' examination performance and their theoretical/practical radiology grades were analyzed using Pearson's correlation method. RESULTS A total of 115 students participated. Faculty-written questions had the highest mean correct response rate (2.91 ± 1.34), followed by template-based AIG (2.32 ± 1.66) and ChatGPT-4o (2.3 ± 1.14) questions (P < 0.001). The mean difficulty index was 0.58 for faculty, and 0.46 for both template- based AIG and ChatGPT-4o. Discrimination indices were acceptable (≥0.2) or very good (≥0.4) for template-based AIG questions. In contrast, four of the ChatGPT-generated questions were acceptable, and three were very good. Student evaluations of questions and the overall examination were favorable, particularly regarding question clarity and content alignment. Examination scores showed a weak correlation with practical examination performance (P = 0.041), but not with theoretical grades (P = 0.652). CONCLUSION Both the ChatGPT-4o and template-based AIG methods produced MCQs with acceptable psychometric properties. While faculty-written questions were most effective overall, AI-generated questions- especially those from the template-based AIG method-showed strong potential for use in radiology education. However, the small number of items per method and the single-institution context limit the robustness and generalizability of the findings. These results should be regarded as exploratory, and further validation in larger, multicenter studies is required. CLINICAL SIGNIFICANCE AI-based question generation may potentially support educators by enhancing efficiency and consistency in assessment item creation. These methods may complement traditional approaches to help scale up high-quality MCQ development in medical education, particularly in resource-limited settings; however, they should be applied with caution and expert oversight until further evidence is available, especially given the preliminary nature of the current findings.

安装插件收集

被引 2

Pushing the boundaries of generative AI: multiple-choice question generation and assessment performance within medical education

拓展生成式AI的边界：医学教育中多项选择题的生成与评估性能

Birsen G. Özdemir, M. O. Aydin, Esra Akdeniz, 2026-Journal of Health Sciences and Medicine

Aims: The aim of this study is to systematically evaluate the performances of large language model-based generative Artificial Intelligence (Gen-AI) tools, Gemini and Copilot, in the generation and assessment of multiple-choice questions (MCQs) for use in medical education.Methods: A total of 335 MCQs were generated from two virtual patient cases using standardized prompts. Gen-AI tools selected the 56 best-quality items based on criteria encompassing the intended distributions regarding acceptable level of performance (ALP), Miller's competency pyramid (Miller) and Bloom's revised taxonomy (Bloom) levels, as well as alignment with learning objectives (LOs). Expert medical educators and current Gen-AI tools assessed these items based on the identification of misleading/confusing distractor(s) for borderline candidates -minimally competent examinees- (to calculate ALP values) and the identification of key(s), as well as Miller and Bloom levels, LO alignment, stem appropriateness, and technical item flaws. "AI-extended consensus" served as intersubjective consensus model (the gold standard). Generation performance was quantified by alignment with this consensus, and assessment performance by the degree to which Gen-AIs shifted or preserved Expert assessments. Analyses included ICC for reliability, Po/Cohen’s/Fleiss’ Kappa for categorical agreement, and inferential tests (Exact McNemar and Wilcoxon signed-rank) for detecting systematic bias and directional shifts.Results: Gen-AIs demonstrated markedly different performance patterns in assigning cognitive levels. For Miller, Gemini generated MCQs exhibited superior consistency with the intersubjective consensus (ICC(2,k)=0.82), whereas for Bloom, Copilot-generated MCQs demonstrated this superiority (ICC(2,k)=0.97). Both tools performed well in LO alignment and key identification, but their approaches to stem structure diverged substantially. Experts perceived the MCQs to be easier than the Gen-AIs claimed, and the current Gen-AI versions found them even easier than both the generating versions and the Experts did. In terms of assessment behaviour, Gen-AIs showed a systematic stringency tendency in Miller classifications, statistically significantly shifting Expert consensus from 'knows' to 'knows how' (p

安装插件收集

МЕТОДИЧНІ ЗАСАДИ ВИКОРИСТАННЯ ГЕНЕРАТИВНОГО ШТУЧНОГО ІНТЕЛЕКТУ В ОСВІТНЬОМУ ПРОЦЕСІ: ВИКЛИКИ ТА МОЖЛИВОСТІ

在乌克兰教育过程中应用生成式人工智能的方法论基础：挑战与机遇

Сергій Уманець, 2025-Педагогічна наука і освіта ХХІ століття

The paper examines methodological foundations for integrating generative artificial intelligence in education in Ukraine amid digital transformation. It clarifies the notions of generative AI and large language model and delineates their didactic affordances and limits. The absence of coherent institution-level risk management and unified policies for data handling, academic integrity, and responsible deployment is noted. Opportunities are mapped across four domains. In teaching, GenAI enables personalization of content and pace, rapid formative feedback, writing support, and generation of lesson plans, tasks, and rubrics. In assessment, it supports criterion-referenced rubrics, item generation, and faster feedback cycles that free time for dialogue. In administration, GenAI assists with routine automation and document flows, including drafting official templates and validating consistency of program materials. In addition, accessibility services (text-to-speech, speech recognition, image analysis, and content adaptation) expand participation for learners with diverse needs and multilingual backgrounds. Alongside benefits, the study highlights challenges: protection of personal data and privacy, algorithmic bias, model hallucinations, and the need for fact checking, risks to academic integrity, unequal access, and total cost of ownership. To address these, the article proposes a practical framework that combines clear institutional policies and procedures with transparent consent and logging, development of digital and information literacy for teachers and students including task formulation, verification of claims, and correct citation of AI interactions, a human in the loop didactic design emphasizing pedagogical appropriateness, gradual adoption, and balance with traditional methods, and evidence based monitoring using pilots, measurable outcomes, and peer review. The novelty lies in consolidating fragmented guidance into a context-sensitive roadmap connecting governance, pedagogy, and infrastructure. Practical significance includes adaptable templates for course and policy design, recommendations for professional development, and scenarios for responsible classroom use. Boundary conditions are outlined, including reliable connectivity, secure platforms that meet data protection requirements, sustained support for educators through mentoring and micro learning, and equity mechanisms that ensure meaningful access across regions.

安装插件收集

How Good are Modern LLMs in Generating Relevant and High-Quality Questions at Different Bloom’s Skill Levels for Indian High School Social Science Curriculum?

现代大型语言模型在生成不同布鲁姆技能水平下与印度高中社会科学课程相关的优质问题方面的表现如何？

Nicy Scaria, S. Chenna, Deepak N. Subramani, 2024-Workshop on Innovative Use of NLP for Building Educational Applications

No abstract available

安装插件收集

被引 11

A survey study on pre-service teachers’ perceptions of AI generated texts

基于HyperCLOVA的AI生成文本在师范生认知中的调查研究

Hyekyung Jung, Yongsang Lee, Dongkwang Shin, 2022-The Korean Society of Bilingualism

This study aims to introduce AI text generation using HyperCLOVA, a Korean-based super-large language model, and to examine whether AI text generation is applicable to the educational field. In detail, an example of text generation using HyperCLOVA was presented. Then survey data were collected from university students of education to examine the face validity of AI-generated texts compared with human texts. We also investigated opinions on the feasibility of AI text generation in teaching and learning environments. The survey results show that there was no statistically significant difference in AI-generated texts compared to the original text. Also the response rate was not high in the item that additional corrections were needed to use the AI-generated texts in educational practice, whereas the response rate was high for the opinion that the AI text generation would help reduce the burden on Korean language teachers.

安装插件收集

Advancing AI in Higher Education: A Comparative Study of Large Language Model-Based Agents for Exam Question Generation, Improvement, and Evaluation

推进高等教育中的人工智能：基于大型语言模型代理的考试问题生成、改进和评估的比较研究

V. Nikolovski, D. Trajanov, Ivan Chorbev, 2025-Algorithms

The transformative capabilities of large language models (LLMs) are reshaping educational assessment and question design in higher education. This study proposes a systematic framework for leveraging LLMs to enhance question-centric tasks: aligning exam questions with course objectives, improving clarity and difficulty, and generating new items guided by learning goals. The research spans four university courses—two theory-focused and two application-focused—covering diverse cognitive levels according to Bloom’s taxonomy. A balanced dataset ensures representation of question categories and structures. Three LLM-based agents—VectorRAG, VectorGraphRAG, and a fine-tuned LLM—are developed and evaluated against a meta-evaluator, supervised by human experts, to assess alignment accuracy and explanation quality. Robust analytical methods, including mixed-effects modeling, yield actionable insights for integrating generative AI into university assessment processes. Beyond exam-specific applications, this methodology provides a foundational approach for the broader adoption of AI in post-secondary education, emphasizing fairness, contextual relevance, and collaboration. The findings offer a comprehensive framework for aligning AI-generated content with learning objectives, detailing effective integration strategies, and addressing challenges such as bias and contextual limitations. Overall, this work underscores the potential of generative AI to enhance educational assessment while identifying pathways for responsible implementation.

安装插件收集

被引 17

The Efficacy of Using a Large-Language Model as an Item Writing Assistant

大型语言模型作为试题编写辅助工具的有效性研究

Paul E. Jones, Kirk A. Becker, 2025-Applied Measurement in Education

ABSTRACT The authors investigated using a large language model (LLM) for writing test questions for a real estate licensing exam. In Study 1 items were generated by GPT-4 and rated by subject matter experts (SMEs). These items were on-topic,relevant, and generally appropriate. Item difficulty manipulation was ineffective. Cognitive level matching was harder as cognitive level increased. Study 2 compared human and LLM items using SME and content developer ratings. Human and LLM items were similar in blueprint alignment, relevance, factual errors, and key quality. LLM items had better stem quality and cognitive level matching. Human distractors had an edge in quality. In Study 3 investigated content overlap and breadth of coverage. Similar prompts frequently generated overlapping content. The range of content represented in large sets of generated items did not cover the breadth of the generating content areas. Results suggest LLMs are as good as SMEs at generating first-draft items.

安装插件收集

The answer may vary: large language model response patterns challenge their use in test item analysis

Lauren K Buhl, 2025-Medical Teacher

Abstract Introduction The validation of multiple-choice question (MCQ)-based assessments typically requires administration to a test population, which is resource-intensive and practically demanding. Large language models (LLMs) are a promising tool to aid in many aspects of assessment development, including the challenge of determining the psychometric properties of test items. This study investigated whether LLMs could predict the difficulty and point biserial indices of MCQs, potentially alleviating the need for preliminary analysis in a test population. Methods Sixty MCQs developed by subject matter experts in anesthesiology were presented one hundred times each to five different LLMs (ChatGPT-4o, o1-preview, Claude 3.5 Sonnet, Grok-2, and Llama 3.2) and to clinical fellows. Response patterns were analyzed, and difficulty indices (proportion of correct responses) and point biserial indices (item-test score correlation) were calculated. Spearman correlation coefficients were used to compare difficulty and point biserial indices between the LLMs and fellows. Results Marked differences in response patterns were observed among LLMs: ChatGPT-4o, o1-preview, and Grok-2 showed variable responses across trials, while Claude 3.5 Sonnet and Llama 3.2 gave consistent responses. The LLMs outperformed fellows with mean scores of 58% to 85% compared to 57% for the fellows. Three LLMs showed a weak correlation with fellow difficulty indices (r = 0.28–0.29), while the two highest scoring models showed no correlation. No LLM predicted the point biserial indices. Discussion These findings suggest LLMs have limited utility in predicting MCQ performance metrics. Notably, higher-scoring models showed less correlation with human performance, suggesting that as models become more powerful, their ability to predict human performance may decrease. Understanding the consistency of an LLM’s response pattern is critical for both research methodology and practical applications in test development. Future work should focus on leveraging the language-processing capabilities of LLMs for overall assessment optimization (e.g., inter-item correlation) rather than predicting item characteristics.

安装插件收集

GISedu-GPT: a large language model framework with prior knowledge for GIS education question bank generation

GISedu-GPT：一种用于地理信息系统教育题库生成的具有先验知识的大语言模型框架

Zhiyun Wang, Yifan Zhang, Wen Min 等, 2025-Journal of Geography in Higher Education

ABSTRACT Intelligent education relies on the generation of multi-level, comprehensive, and diverse question banks to assess student learning effectiveness and teaching efficacy. However, the development of professional question banks often presents challenges such as reliance on expert knowledge and experience, limited transferability, high workload, and subjective biases. In Geographical Information Systems (GIS), personalized question settings could be impacted by diverse knowledge sources and varying student orientations. To address this issue, we propose a novel large language model (LLM) framework guided by GIS prior knowledge for generating professional GIS question banks. Specifically, we tackle three major challenges in intelligent GIS question bank generation: incomplete knowledge coverage, skewed difficulty distribution, and limited adaptability of question types. This framework is founded upon the autonomous understanding, planning, and reasoning capabilities of LLMs, augmented by an elaborate retrieval strategy. It comprises three key modules: subtask matching and partitioning, subtask importance evaluation and quantity allocation, as well as adaptive scenario question generation. Together, these components enable the generation of personalized GIS question banks for learning and teaching tasks. Extensive experiments demonstrate its effectiveness across various metrics. Furthermore, our method with specialized knowledge organization can serve as a valuable resource for advancing research and applications in GIS education.

安装插件收集

被引 1

Examination Questions Generation System Based on Soft Knowledge Prompt and Large Language Model

Zhan Tang, Xiaoyu Lu, N. Yang 等, 2025-Proceedings of the 2nd Guangdong-Hong Kong-Macao Greater Bay Area Education Digitalization and Computer Science International Conference

In the field of education, the traditional way of writing examination questions by hand has low efficiency and uneven quality. An examination question generation system based on soft knowledge prompt and large language model is proposed to address this difficulty. The large language model is used to enhance the examination question generation process of review materials. The key knowledge points are extracted and the soft knowledge prompt is generated through the feature representation and knowledge mining module. The soft knowledge prompt calculates the similarity score of external domain knowledge, selects the most relevant knowledge segment, and guides the large language model to generate examination questions through the adaptive fusion mechanism, so as to realize the automatic generation from review materials to examination questions, and improve the generation efficiency and quality of educational resources.

安装插件收集

Impact of retrieval augmented generation and large language model complexity on undergraduate exams created and taken by AI agents

检索增强生成和大型语言模型复杂度对AI代理创建和参与的本科生考试的影响

Erick Tyndall, Colleen Gayheart, Alexandre Some 等, 2025-Data & Policy

Abstract The capabilities of large language models (LLMs) have advanced to the point where entire textbooks can be queried using retrieval-augmented generation (RAG), enabling AI to integrate external, up-to-date information into its responses. This study evaluates the ability of two OpenAI models, GPT-3.5 Turbo and GPT-4 Turbo, to create and answer exam questions based on an undergraduate textbook. 14 exams were created with four true-false, four multiple-choice, and two short-answer questions derived from an open-source Pacific Studies textbook. Model performance was evaluated with and without access to the source material using text-similarity metrics such as ROUGE-1, cosine similarity, and word embeddings. Fifty-six exam scores were analyzed, revealing that RAG-assisted models significantly outperformed those relying solely on pre-trained knowledge. GPT-4 Turbo also consistently outperformed GPT-3.5 Turbo in accuracy and coherence, especially in short-answer responses. These findings demonstrate the potential of LLMs in automating exam generation while maintaining assessment quality. However, they also underscore the need for policy frameworks that promote fairness, transparency, and accessibility. Given regulatory considerations outlined in the European Union AI Act and the NIST AI Risk Management Framework, institutions using AI in education must establish governance protocols, bias mitigation strategies, and human oversight measures. The results of this study contribute to ongoing discussions on responsibly integrating AI in education, advocating for institutional policies that support AI-assisted assessment while preserving academic integrity. The empirical results suggest not only performance benefits but also actionable governance mechanisms, such as verifiable retrieval pipelines and oversight protocols, that can guide institutional policies.

安装插件收集

被引 1

Large Language Model-based Pipeline for Item Difficulty and Response Time Estimation for Educational Assessments

基于大型语言模型的评估项目难度和响应时间估计的教育评估流程

Hariram Veeramani, Surendrabikram Thapa, Natarajan Balaji Shankar 等, 2024-Workshop on Innovative Use of NLP for Building Educational Applications

No abstract available

安装插件收集

被引 8

Exploring the feasibility of using large language models for automated item generation in social studies

探索使用大型语言模型在社会科学自动命题生成中的可行性

Sang-hoon Lim, Haewon Cho, Jungwoo Lee 等, 2024-Korean Association for Educational Information and Media

No abstract available

安装插件收集

Math Multiple Choice Question Generation via Human-Large Language Model Collaboration

基于人机大型语言模型协作的数学多项选择题生成

Jaewook Lee, Digory Smith, Simon Woodhead 等, 2024-Educational Data Mining

Multiple choice questions (MCQs) are a popular method for evaluating students' knowledge due to their efficiency in administration and grading. Crafting high-quality math MCQs is a labor-intensive process that requires educators to formulate precise stems and plausible distractors. Recent advances in large language models (LLMs) have sparked interest in automating MCQ creation, but challenges persist in ensuring mathematical accuracy and addressing student errors. This paper introduces a prototype tool designed to facilitate collaboration between LLMs and educators for streamlining the math MCQ generation process. We conduct a pilot study involving math educators to investigate how the tool can help them simplify the process of crafting high-quality math MCQs. We found that while LLMs can generate well-formulated question stems, their ability to generate distractors that capture common student errors and misconceptions is limited. Nevertheless, a human-AI collaboration has the potential to enhance the efficiency and effectiveness of MCQ generation.

安装插件收集

被引 18

Fine-Tuning a Large Language Model with Reinforcement Learning for Educational Question Generation

使用强化学习微调大型语言模型以生成教育性问题的研究

Salima Lamsiyah, Abdelkader El Mahdaouy, A. Nourbakhsh 等, 2024-Lecture Notes in Computer Science

No abstract available

安装插件收集

被引 11

Intelligent teaching design assistant for primary mathematics: A large language model-driven framework with retrieval-augmented generation and problem-chain pedagogy

基于大语言模型驱动的检索增强生成与问题链教学法的低年级数学智能教学设计助手：一种社会建构主义和认知负荷理论为基础的框架

Danna Tang, Ran Ding, Meng He 等, 2026-International Electronic Journal of Mathematics Education

Primary mathematics education faces systemic challenges in translating curriculum reforms into classroom practice, exacerbated by teachers’ cognitive overload and limited support for pedagogical innovation. This study develops an Intelligent Teaching Design Assistant grounded in socio-constructivist and cognitive load theories to address these challenges. Thirty-four primary mathematics teachers participated in a quasi-experimental study. The Intelligent Teaching Design Assistant integrates Large Language Models with multi-dimensional knowledge bases (curriculum standards, teaching strategies, student profiles) and a multi-agent architecture (process planner, student simulator). The Intelligent Teaching Design Assistant significantly outperformed generic Large Language Models, improving overall lesson plan quality. This work pioneers a replicable pathway for AI to empower teacher agency and advance 21st-century educational transformation.

安装插件收集

Development and evaluation of a retrieval-augmented large language model framework for enhancing endodontic education

Xiaowei Xu, Siyi Liu, Lin Zhu 等, 2025-International Journal of Medical Informatics

BACKGROUND Integrating domain-specific knowledge into large language models (LLMs) remains a critical challenge in medical education. In dental specialties such as endodontics, effective learning requires access to both textual clinical evidence and visual procedural demonstrations. However, generic LLMs often produce content that lacks clinical accuracy, contextual grounding, or pedagogical clarity, thereby limiting their applicability in specialized training environments. OBJECTIVE To develop and evaluate a Retrieval-Augmented Generation (RAG)-enhanced LLMs framework that addresses the challenge of integrating domain-specific knowledge in AI-driven endodontic education. METHOD We present Endodontics-KB, a multimodal knowledge integration platform that combines evidence-based dental literature (e.g., textbooks, clinical guidelines) with visual instructional materials (e.g., procedural videos) through a hierarchical RAG architecture. The system's core component, the EndoQ chatbot, utilizes LLMs augmented with multimodal dental datasets to enable context-aware clinical reasoning. Benchmarking was conducted against three general-purpose LLMs: GPT-4, Qwen2.5, and DeepSeek R1, using a structured question bank comprising 11 expert-validated endodontic questions. Two domain experts performed a blinded evaluation across five performance dimensions: clinical accuracy, contextual relevance, completeness, decision-making professionalism, and communication fluency. RESULTS The framework integrated 2,200 multimodal knowledge units through dynamic semantic indexing. EndoQ demonstrated statistically significant improvements across all evaluation metrics compared to general purpose LLMs: accuracy (4.45 ± 0.96), clinical relevance (4.59 ± 0.8), completeness (4.27 ± 0.83), professionalism judgment (4.45 ± 1.06), and language fluency (4.86 ± 0.47), as measured on a 5-point Likert scale. CONCLUSION This proposed framework improves educational outcomes through precise and context-aware knowledge delivery. Furthermore, it represents a scalable and transferable model for AI-enhanced clinical training across medical specialties, significantly advancing competency-based pedagogy in dental education.

安装插件收集

被引 9

Automated Multilingual Translation of Exam Question Papers Using Generative AI

基于生成式AI的自动多语言考试试卷翻译

S. Venkatraman, Sumneet Kaur Bamrah, D. Pushgara 等, 2025-2025 International Conference on Computing and Communication Technologies (ICCCT)

The research article undertakes an experimental analysis of utilizing conversational/generative AI tools for translating question papers from English to other Indian languages, as frequently seen in the question papers of many Indian universities/colleges and competitive recruitment examinations. This automation of question paper translation shall offload a portion of the workload of academic teachers who are into preparing question papers for various types of examinations. A desktop application of GUI type is developed leveraging artificial intelligence backed ChatGPT and Claude AI as a ready to use zero cost application.

安装插件收集

Generative AI Empowering Talent Cultivation in Big Data Management and Application: Research on the Construction of an Innovation-Oriented Curriculum Model

生成式人工智能赋能大数据管理与应用人才培养：创新导向课程模型构建研究

Ruijie Chen, 2025-Journal of Education and Educational Research

With the in-depth advancement of China's national strategy for the development of a new generation of artificial intelligence, generative artificial intelligence, as a key technology, is profoundly reshaping the educational ecosystem. This study focuses on the emerging interdisciplinary field of Big Data Management and Application, exploring the innovative challenges faced by talent cultivation in the digital-intelligent era. The research aims to analyze the intrinsic mechanisms of generative artificial intelligence (taking "ERNIE Bot" as an example) in promoting learners' innovative thinking and innovative skills, and further construct a "generative AI-empowered, innovation-oriented project-based curriculum model". This model integrates the entire process of "pre-class preparation - teaching implementation - project conclusion", covering core links such as learner profile construction, intelligent scenario creation, personalized task distribution, dynamic feedback, and intelligent evaluation. Finally, the paper analyzes the potential challenges in implementing the model and proposes corresponding strategies centered on the "teacher-AI-student" tripartite collaboration, aiming to provide an operable and iterable digital path for cultivating innovative talents in the Big Data Management and Application major.

安装插件收集

Research on the Construction of Medical Critical Thinking Assessment Gauge Driven by Generative AI

基于生成式人工智能构建医学批判性思维评估量表的研究

Liang Ying, Zixun Dai, Xiaoqing Qiu 等, 2025-Academic Journal of Management and Social Sciences

With the rapid development of artificial intelligence (AI) technology, major changes have taken place in the field of medical education in China. In recent years, in order to respond to the training requirements of “new medicine” for compound talents, the demand for systematic evaluation of critical thinking ability of medical students in China is increasing. Based on SOAP clinical reasoning framework and integrating existing critical thinking theory, this study established a medical critical thinking assessment gauge covering six dimensions of “interpretation-analysis-evaluation-inference-self-adjustment-clinical adaptation”, each dimension has five levels, presenting a path from information processing to clinical decision-making ability, and introducing evidence-based medicine tools (such as AGREE II), cognitive bias and other professional concepts enhance the professionalism and consistency of evaluation, which can be used as the core quantitative basis of the generative AI-driven critical thinking education system. Meanwhile, the gauge realizes the paradigm transformation from static evaluation to dynamic diagnosis and from general scoring to personalized intervention, providing a reliable path for the cultivation of medical high-order thinking ability.

安装插件收集

Exploring Approaches to Writing Assessment Using Generative AI : Focusing on the Construction and Perspective of a Specific Instructor Persona

探索使用生成式AI进行写作评估的方法：聚焦于特定教师角色的构建与视角

Jae Myoung, 2025-Korean Association for Literacy

This study first investigates whether a “writing evaluator persona” modeling a professor’s writing-assessment perspective can be developed using ChatGPT’s customization and prompting strategies and then examines its potential and limits. Based on the vackground and publications of Emeritus Professor J, we employed iterative input, summarization, and Q&A to design the “Professor J” GPT persona. Across three experiments, comparisons with Professor J’s actual ratings showed that underspecified score-allocation instructions can distort score distributions and may elicit hallucinations. Although the approach increases procedural transparency, full score-level agreement is constrained because authentic grading incorporates contextual factors. Overall, the study frames generative AI not as a value-neutral automated grading tool but as a hybrid tool for locally instantiating instructor-specific evaluative norms and supporting reflective calibration of assessment practices.

安装插件收集

LLM Generative AI and Students’ Exam Code Evaluation: Qualitative and Quantitative Analysis

大型语言模型生成式人工智能与学生考试代码评估：定性与定量分析

Ema Smolic, Marko Pavelic, Bartol Boras 等, 2024-2024 47th MIPRO ICT and Electronics Convention (MIPRO)

Since the introduction of generative artificial intelligence (GAI) technology in the context of large language models (LLMs), it has been widely used for information extraction and/or extrapolation from different sources. In computer science education, a potential application of such technology is for automatic code review, i. e. shifting the burden of debugging non-compilable code, detecting overlooked optimization concerns such as poor memory management in code that otherwise passes automated tests, and other advanced tasks from a human grader to LLMs. However, LLMs are currently not capable of evaluating code or mathematical expressions with 100% reliability, i. e. beyond token pattern recognition and subsequent probabilistic answer generation. With that in mind, in this paper, we explore the risk of incorrect LLM code evaluation, both descriptive and numerical, as well as begin research on its mitigation and propose further work directions.

安装插件收集

被引 6

Generative AI Use in Dental Education: Efficient Exam Item Writing.

生成式人工智能在口腔医学教育中的应用：高效考试题目编写

Margeaux C. Johnson, A. P. Ribeiro, Tiffany M Drew 等, 2023-Journal of Dental Education

No abstract available

安装插件收集

被引 5

Semi-automatic Construction of Bidirectional Dialogue Dataset for Dialogue-Based Reading Comprehension Tutoring System Using Generative AI

Sung-Kwon Choi, Jin-Xia Huang, Oh-Woog Kwon, 2024-Lecture Notes in Computer Science

No abstract available

安装插件收集

Research on Application of Generative AI in the Resource Construction of “Data Analysis and Mining” Course

生成式人工智能在“数据分析与挖掘”课程资源建设中的应用研究

猛赵, 2024-Advances in Education

No abstract available

安装插件收集

AI for data generation in education: Towards learning and teaching support at scale

教育领域数据生成中的AI：迈向大规模的学习与教学支持

Mohammad Khalil, Qinyi Liu, Jelena Jovanovic, 2025-British Journal of Educational Technology

Supporting learning and teaching at scale requires access to large and high‐quality content and datasets for analysis and innovation. With rapid advances in artificial intelligence (AI) and the growing demand for data, synthetic data has emerged as a potential solution for addressing these challenges. This editorial introduces the contributions of five accepted articles to the special section AI for Synthetic Data Generation in Education: Scaling Teaching and Learning. These articles explore key themes in leveraging AI‐generated synthetic data to support learning and teaching as well as enhance educational practices at scale. The editorial emphasizes that hybrid strategies that leverage AI alongside human judgment are essential for scaling support for learning and teaching through synthetic data generation.

安装插件收集

被引 7

Using a Hybrid of AI and Template-Based Method in Automatic Item Generation to Create Multiple-Choice Questions in Medical Education: Hybrid AIG

在医学教育中利用人工智能与基于模板方法的混合自动题目生成：混合AIG

Yavuz Selim Kıyak, A. Kononowicz, 2025-JMIR Formative Research

Abstract Background Template-based automatic item generation (AIG) is more efficient than traditional item writing but it still heavily relies on expert effort in model development. While nontemplate-based AIG, leveraging artificial intelligence (AI), offers efficiency, it faces accuracy challenges. Medical education, a field that relies heavily on both formative and summative assessments with multiple choice questions, is in dire need of AI-based support for the efficient automatic generation of items. Objective We aimed to propose a hybrid AIG to demonstrate whether it is possible to generate item templates using AI in the field of medical education. Methods This is a mixed-methods methodological study with proof-of-concept elements. We propose the hybrid AIG method as a structured series of interactions between a human subject matter expert and AI, designed as a collaborative authoring effort. The method leverages AI to generate item models (templates) and cognitive models to combine the advantages of the two AIG approaches. To demonstrate how to create item models using hybrid AIG, we used 2 medical multiple-choice questions: one on respiratory infections in adults and another on acute allergic reactions in the pediatric population. Results The hybrid AIG method we propose consists of 7 steps. The first 5 steps are performed by an expert in a customized AI environment. These involve providing a parent item, identifying elements for manipulation, selecting options and assigning values to elements, and generating the cognitive model. After a final expert review (Step 6), the content in the template can be used for item generation through a traditional (non-AI) software (Step 7). We showed that AI is capable of generating item templates for AIG under the control of a human expert in only 10 minutes. Leveraging AI in template development made it less challenging. Conclusions The hybrid AIG method transcends the traditional template-based approach by marrying the “art” that comes from AI as a “black box” with the “science” of algorithmic generation under the oversight of expert as a “marriage registrar”. It does not only capitalize on the strengths of both approaches but also mitigates their weaknesses, offering a human-AI collaboration to increase efficiency in medical education.

安装插件收集

被引 6

Enhancing AI-Driven Education: Integrating Cognitive Frameworks, Linguistic Feedback Analysis, and Ethical Considerations for Improved Content Generation

提升人工智能驱动的教育：整合认知框架、语言反馈分析及伦理考量以优化内容生成

Antoun Yaacoub, Sansiri Tarnpradab, Phattara Khumprom 等, 2025-2025 International Joint Conference on Neural Networks (IJCNN)

Artificial intelligence (AI) is rapidly transforming education, presenting unprecedented opportunities for personalized learning and streamlined content creation. However, realizing the full potential of AI in educational settings necessitates careful consideration of the quality, cognitive depth, and ethical implications of AI-generated materials. This paper synthesizes insights from four related studies to propose a comprehensive framework for enhancing AI-driven educational tools. We integrate cognitive assessment frameworks (Bloom’s Taxonomy and SOLO Taxonomy), linguistic analysis of AI-generated feedback, and ethical design principles to guide the development of effective and responsible AI tools. We outline a structured three-phase approach encompassing cognitive alignment, linguistic feedback integration, and ethical safeguards. The practical application of this framework is demonstrated through its integration into OneClickQuiz, an AI-powered Moodle plugin for quiz generation. This work contributes a comprehensive and actionable guide for educators, researchers, and developers aiming to harness AI’s potential while upholding pedagogical and ethical standards in educational content generation.

安装插件收集

被引 5

Automated Generation of Curriculum-Aligned Multiple-Choice Questions for Malaysian Secondary Mathematics Using Generative AI

基于生成式AI的马来西亚中学数学课程匹配多选题自动生成研究

Rohaizah Abdul Wahid, Muhamad Said Nizamuddin Nadim, Suliana Sulaiman 等, 2025-arXiv.org

This paper addresses the critical need for scalable and high-quality educational assessment tools within the Malaysian education system. It highlights the potential of Generative AI (GenAI) while acknowledging the significant challenges of ensuring factual accuracy and curriculum alignment, especially for low-resource languages like Bahasa Melayu. This research introduces and compares four incremental pipelines for generating Form 1 Mathematics multiple-choice questions (MCQs) in Bahasa Melayu using OpenAI's GPT-4o. The methods range from non-grounded prompting (structured and basic) to Retrieval-Augmented Generation (RAG) approaches (one using the LangChain framework, one implemented manually). The system is grounded in official curriculum documents, including teacher-prepared notes and the yearly teaching plan (RPT). A dual-pronged automated evaluation framework is employed to assess the generated questions. Curriculum alignment is measured using Semantic Textual Similarity (STS) against the RPT, while contextual validity is verified through a novel RAG-based Question-Answering (RAG-QA) method. The results demonstrate that RAG-based pipelines significantly outperform non-grounded prompting methods, producing questions with higher curriculum alignment and factual validity. The study further analyzes the trade-offs between the ease of implementation of framework-based RAG and the fine-grained control offered by a manual pipeline. This work presents a validated methodology for generating curriculum-specific educational content in a low-resource language, introduces a symbiotic RAG-QA evaluation technique, and provides actionable insights for the development and deployment of practical EdTech solutions in Malaysia and similar regions.

安装插件收集

被引 1

Generative AI vs. human expertise: a comparative analysis of case-based rational pharmacotherapy question generation

生成式人工智能与人类专业知识对比：基于案例的理性药物治疗问题生成比较分析

Muhammed Cihan Güvel, Yavuz Selim Kıyak, H. D. Varan 等, 2025-European Journal of Clinical Pharmacology

No abstract available

安装插件收集

被引 4

AI-Powered Narrative Generation for Personalized Learning in Primary Schools

基于人工智能的个性化小学叙事生成助力个性化学习

Oualid Ali, Karrar Abbas Yousif, Gulsanam Tillayeva 等, 2025-2025 International Conference on AI-Driven STEM Education and Learning Technologies (AISTEMEDU)

Personalized storytelling in elementary school increases participation and retention by tailoring stories to each student's individual interests and learning style. But today's schools aren't flexible enough to make lessons that are tailored to each student. This study's neural text generation model is based on an improved GPT-2 architecture. It uses learner profiles that include interest vector, reading level, and emotional tone. The model uses Byte Pair Encoding for input formatting and token-level conditioning, ensuring that the narrative it generates is relevant and coherent. When BLEU, METEOR, and human-rated engagement metrics are used to measure performance, the results are better than the baselines for general storytelling. Specifically, personalized outputs boosted participation 24% and understanding by 18% in experimental classroom environments. The results show that AI-powered personalized stories work well in preschool and kindergarten. This method enables adaptive learning systems to change based on each student's needs.

安装插件收集

VR, AR, gamification and AI towards the next generation of systems supporting cultural heritage: addressing challenges of a museum context

虚拟现实、增强现实、游戏化与人工智能：面向下一代文化遗产支持系统的挑战与应对——以博物馆环境为例

M. Ribeiro, Joana Santos, João Lobo 等, 2024-Proceedings of the 29th International ACM Conference on 3D Web Technology

This paper explores the development and integration of a system combining Augmented Reality (AR), Virtual Reality (VR), and gamification within a museum setting to enhance the presentation and interaction with cultural heritage. The technological framework employs AR for dynamic artifact interaction and in situ navigation, while VR capabilities facilitate virtual tours, broadening access for individuals with disabilities or those from distant geographies or socioeconomically disadvantaged backgrounds. Gamification transforms educational content into interactive experiences, fostering deeper engagement and learning. Moreover, aligning with the mission of museum institutions for cultural heritage preservation, a module for digital conservation and reconstruction was developed resorting to photogrammetry-based approaches. This module aims to create a virtual catalog accessible to both experts and the general public. Artificial Intelligence (AI) tools automate tasks such as generating thematic quizzes for gamification and cataloging scanned artifacts. The system aims to improve the interpretative and educational potential of museum exhibits, modernizing visitor engagement while preserving the integrity of physical artifacts and spaces. Its continuous evolution aims to bridge traditional forms of cultural preservation and promotion with contemporary digital interaction techniques, leveraged from cost-effective publicly accessible edge technologies.

安装插件收集

被引 26

Knowledge Supply Chain Furnace: AI Cross-Modal Extraction Training Method for Business Administration Case Generation

知识供应链熔炉：基于人工智能的跨模态提取训练方法在商业管理案例生成中的应用

Han Wang, Feng Huang, Yaqi Zhang, 2025-Innovative Applications of AI

By constructing a knowledge supply chain model with both theoretical and practical value, this study proposes a novel approach to integrating multimodal data—such as text, financial reports, video cases, and business models—to generate teaching cases. The experiment employs a privatized Deepseek32b system, utilizing multimodal knowledge embedding technology, cognitive logic injection mechanisms, and systematic design of a teaching logic enhancer to significantly improve interdisciplinary knowledge integration and extraction efficiency. The experimental results show that generative artificial intelligence consistently produces an excess of teaching cases, with a significantly higher coverage of knowledge points compared to traditional NLP and manual methods. While generative AI exhibits stable logical coherence, its content logic is slightly inferior to that of high-quality human-generated works. This study verifies the effectiveness of the cross-modal knowledge extraction training method and provides valuable reference insights.

安装插件收集

被引 1

The Future of Learning in the Age of Generative AI: Automated Question Generation and Assessment with Large Language Models

生成式AI时代的学习未来：基于大型语言模型的自动问答生成与评估

Subhankar Maity, Aniket Deroy, 2024-arXiv.org

In recent years, large language models (LLMs) and generative AI have revolutionized natural language processing (NLP), offering unprecedented capabilities in education. This chapter explores the transformative potential of LLMs in automated question generation and answer assessment. It begins by examining the mechanisms behind LLMs, emphasizing their ability to comprehend and generate human-like text. The chapter then discusses methodologies for creating diverse, contextually relevant questions, enhancing learning through tailored, adaptive strategies. Key prompting techniques, such as zero-shot and chain-of-thought prompting, are evaluated for their effectiveness in generating high-quality questions, including open-ended and multiple-choice formats in various languages. Advanced NLP methods like fine-tuning and prompt-tuning are explored for their role in generating task-specific questions, despite associated costs. The chapter also covers the human evaluation of generated questions, highlighting quality variations across different methods and areas for improvement. Furthermore, it delves into automated answer assessment, demonstrating how LLMs can accurately evaluate responses, provide constructive feedback, and identify nuanced understanding or misconceptions. Examples illustrate both successful assessments and areas needing improvement. The discussion underscores the potential of LLMs to replace costly, time-consuming human assessments when appropriately guided, showcasing their advanced understanding and reasoning capabilities in streamlining educational processes.

安装插件收集

被引 18

Scaling Up Mastery Learning with Generative AI: Exploring How Generative AI Can Assist in the Generation and Evaluation of Mastery Quiz Questions

利用生成式AI扩大掌握学习规模：探索生成式AI在掌握性测验问题生成与评估中的辅助作用

Stephen Hutt, Grayson Hieb, 2024-Proceedings of the Eleventh ACM Conference on Learning @ Scale

Generative AI has the potential to scale a number of educational practices, previously limited by resources. One such instructional approach is mastery learning, a pedagogy emphasizing proficiency before progression that is highly resource (teacher time, materials) intensive. The rise of computer-based instruction offered partial solutions, tailoring student progression and automating some facets of the mastery learning process. This work in progress considers the application of large language models for content generation tailored to mastery learning. We present a paired framework for analyzing and evaluating the generated content relative to rubrics designed by the teacher. Recognizing the potential of large language models, we critically assess the potential of improving mastery-based instruction. We close our discussion by considering the applications and limitations of this approach.

安装插件收集

被引 8

AI-powered automated item generation for language testing

基于人工智能的自动试题生成技术在语言测试中的应用

Dongkwang Shin, Jang Ho Lee, 2024-ELT Journal

Although automated item generation has gained a considerable amount of attention in a variety of fields, it is still a relatively new technology in ELT contexts. Therefore, the present article aims to provide an accessible introduction to this powerful resource for language teachers based on a review of the available research. Particularly, it will give a brief introduction to different types of automated item generation approaches, provide a summary of previous ELT studies on this technology, and introduce three different AI-powered tools, along with practical tips for ELT practitioners. We conclude by calling for more empirical research on automated item generation from the ELT community and encouraging language teachers to take an interest in this technology themselves.

安装插件收集

被引 3

AI-Enhanced Learning Assistant Platform: An Advanced System for Q&A Generation from Provided Content, Answer Evaluation, Identification of Students' Weak Areas, Recursive Testing for Strengthening Knowledge, Integrated Query Forum, and Expert Chat Support

人工智能增强型学习助手平台：一种从提供内容生成问答、答案评估、识别学生薄弱环节、强化知识递归测试、集成查询论坛和专家聊天支持的高级系统

Vivek Redhu, Kumar Singh, Dr. M. Saravanan, 2024-2024 2nd International Conference on Artificial Intelligence and Machine Learning Applications Theme: Healthcare and Internet of Things (AIMLA)

The AI-Enhanced Learning Assistant Platform is a revolutionary system designed to enhance learning, with cutting-edge features like question-and-answer generation, answer evaluation, identification of weak areas, recursive testing, an integrated query forum, and expert chat support. This platform makes use of artificial intelligence (AI) technology to try to satisfy the many needs that students and teachers have. Using natural language processing and machine learning, the platform’s question and answer generating feature generates relevant questions on its own from the provided content. This encourages participation and in-depth subject understanding. The answer evaluation section provides quick feedback for improvement by utilizing AI algorithms to assess the accuracy and caliber of student responses. One of this platform’s key advantages is its capacity to identify students' areas of weakness. Through the analysis of performance patterns and root causes, the system can generate customized recommendations and learning materials to help overcome those constraints. The property of recurring testing facilitates continuous assessment and reinforcement of knowledge. Through repeated practice, the program gradually pushes students to increase their understanding of the material by creating adaptive exams. Through` the integrated query forum, students can collaborate and ask for assistance from others by asking questions and receiving answers from teachers and their peers. Furthermore, by enabling real-time communication between users and subject matter experts, the expert chat support tool fosters an engaging and motivating learning environment. To sum up, the AI-Enhanced Learning Assistant Platform offers a wide range of features designed to maximize learning. With AI technology, it helps students learn more effectively and retain what they have learned, promotes active learning, and provides the support they need for a good educational experience.

安装插件收集

被引 5

Game Master LLM: Task-Based Role-Playing for Natural Slang Learning

游戏大师LLM：基于任务的角色扮演游戏，用于自然俚语学习

AmirMohammad Tahmasbi, Milad Esrafilian, J. Wright 等, 2025-arXiv.org

Natural and idiomatic expressions are essential for fluent, everyday communication, yet many second-language learners struggle to acquire and spontaneously use casual slang despite strong formal proficiency. To address this gap, we designed and evaluated an LLM-powered, task-based role-playing game in which a GPT-4o-based Game Master guides learners through an immersive, three-phase spoken narrative. After selecting five unfamiliar slang phrases to practice, participants engage in open-ended dialogue with non-player characters; the Game Master naturally incorporates the target phrases in rich semantic contexts (implicit input enhancement) while a dedicated Practice Box provides real-time explicit tracking and encouragement. Post-session, learners receive multi-level formative feedback analyzing the entire interaction. We evaluated the system in a between-subjects study with 14 international graduate students, randomly assigned to either the RPG condition or a control condition consisting of a traditional AI-led virtual classroom. Results from an immediate post-test show that the RPG group achieved greater gains in both comprehension of the target phrases and their accurate, contextual use in sentences. Quantitative analysis of in-activity word-usage frequency, combined with qualitative survey responses, further indicates that the game-based approach provided more practice opportunities and higher perceived engagement, resulting in a more natural learning experience. These findings highlight the potential of narrative-driven LLM interactions in vocabulary acquisition.

安装插件收集

Small Models, Big Support: A Local LLM Framework for Teacher-Centric Content Creation and Assessment using RAG and CAG

小型模型，强大支持：基于RAG和CAG的本地LLM框架，用于以教师为中心的内容创作和评估

Zarreen Reza, Alexander Mazur, Michael T. Dugdale 等, 2025-arXiv.org

No abstract available

安装插件收集

被引 2

The Impact of Educational LLM Agent Use on Teachers’ Curriculum Content Creation: The Chain Mediating Role of School Support and Teacher Self-Efficacy

教育LLM代理使用对教师课程内容创作的影响：学校支持和教师自我效能感的链式中介作用

Huifen Xu, Minjing Chen, Minjuan Wang 等, 2026-Behavioral Sciences

The application of social cognitive theory has expanded to the boundaries of human-computer interaction research. However, existing research has scarcely addressed mutual cognitive facilitation between humans and personalized educational large language model (LLM) agents. This study explored how educational LLM agents influence teachers’ curriculum design and content creation, based on a sample of 464 teachers from coastal regions of China, along with semi-structured interviews with 23 participants. Quantitative analysis of the survey data revealed that the involvement of educational LLM agents positively predicts teachers’ ability to create content in curriculum design. Additionally, teachers’ self-efficacy mediated this relationship, while both school support and self-efficacy together created a chain mediation effect. Qualitative findings from the interviews supported the quantitative results and further highlighted individual differences and contextual nuances in teachers’ use of educational LLM agents. In summary, the findings indicated that educational LLM agents positively impact teachers’ curriculum design and content creation, with school support and teachers’ self-efficacy acting as a chain mediator in this process.

安装插件收集

Integrating LLM Usage in Gamified Systems

将大型语言模型（LLM）集成到游戏化系统中的方法

Carlos J. Costa, 2025-WSEAS TRANSACTIONS ON MATHEMATICS

In this work, a thorough mathematical framework for incorporating Large Language Models (LLMs) into gamified systems is presented with an emphasis on improving task dynamics increasing user engagement, and improving reward systems. Personalized feedback adaptive learning and dynamic content creation are all made possible by the integration of LLMs and are crucial for improving user engagement and system performance. A simulated environment is used to test the framework's adaptability and demonstrate its potential for real-world applications in a variety of industries including business healthcare and education. The findings demonstrate how LLMs can offer customized experiences that raise system effectiveness and user retention. This study also examines the difficulties this framework aims to solve highlighting its importance in maximizing involvement and encouraging sustained behavioral change in a range of sectors.

安装插件收集

Automatic Large Language Models Creation of Interactive Learning Lessons

自动生成交互式学习课程的大型语言模型创建

Jionghao Lin, Jiarui Rao, Yiyang Zhao 等, 2025-ArXiv

We explore the automatic generation of interactive, scenario-based lessons designed to train novice human tutors who teach middle school mathematics online. Employing prompt engineering through a Retrieval-Augmented Generation approach with GPT-4o, we developed a system capable of creating structured tutor training lessons. Our study generated lessons in English for three key topics: Encouraging Students'Independence, Encouraging Help-Seeking Behavior, and Turning on Cameras, using a task decomposition prompting strategy that breaks lesson generation into sub-tasks. The generated lessons were evaluated by two human evaluators, who provided both quantitative and qualitative evaluations using a comprehensive rubric informed by lesson design research. Results demonstrate that the task decomposition strategy led to higher-rated lessons compared to single-step generation. Human evaluators identified several strengths in the LLM-generated lessons, including well-structured content and time-saving potential, while also noting limitations such as generic feedback and a lack of clarity in some instructional sections. These findings underscore the potential of hybrid human-AI approaches for generating effective lessons in tutor training.

安装插件收集

被引 1

Transforming Children's Python Turtle Graphics Learning with LLM Technology: A Design Proposal

Mondheera Pituxcoosuvarn, Yohei Murakami, 2024-2024 9th International STEM Education Conference (iSTEM-Ed)

STEM education, particularly programming and coding, is of great importance in today's technological landscape. Turtle graphics, an effective tool for teaching programming concepts to children, is widely used in languages such as Python, known for its simplicity and readability. However, coding can be challenging for young learners, necessitating individualized support from teachers. Large language models (LLMs), which are already employed in debugging, present an opportunity to enhance educational support systems by providing personalized hints without revealing answers, thus preserving the educational value. This proposal aims to explore the use of LLMs to generate tailored hints and explanations for different age groups and skill levels, creating a dynamic and responsive learning environment. Additionally, the proposed system includes task creation that adapts to the student's previous performance and completed tasks, ensuring continuous and appropriately challenging learning experiences. The goal of our research is to design a support system that leverages LLM technology to improve children and young students' learning in Python Turtle graphics. This system promises personalized educational support and adaptive task generation, enhancing the overall learning experience for young programmers. Future studies are necessary to test this system with real users, evaluate its effectiveness, and refine its design based on practical feedback.

安装插件收集

被引 2

PromptHive: Bringing Subject Matter Experts Back to the Forefront with Collaborative Prompt Engineering for Educational Content Creation

PromptHive：通过协作提示工程将领域专家重新置于教育内容创作的前沿

Mohi Reza, Ioannis Anastasopoulos, Shreya Bhandari 等, 2024-Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems

Involving subject matter experts in prompt engineering can guide LLM outputs toward more helpful, accurate, and tailored content that meets the diverse needs of different domains. However, iterating towards effective prompts can be challenging without adequate interface support for systematic experimentation within specific task contexts. In this work, we introduce PromptHive, a collaborative interface for prompt authoring designed to better connect domain knowledge with prompt engineering through features that encourage rapid iteration on prompt variations. We conducted an evaluation study with ten subject matter experts in math and validated our design through two collaborative prompt writing sessions and a learning gain study with 358 learners. Our results elucidate the prompt iteration process and validate the tool’s usability, enabling non-AI experts to craft prompts that generate content comparable to human-authored materials while reducing perceived cognitive load by half and shortening the authoring process from several months to just a few hours.

安装插件收集

被引 15

A Study on Task-Based English Teaching Design in Higher Education Based on Large Language Models

基于大型语言模型的大学英语任务型教学设计研究

Lixia Xiang, Pan Zhang, Huilian Luo 等, 2025-Proceedings of the 2025 International Conference on AI-enabled Education

With the rapid development of artificial intelligence, Large Language Models (LLMs) as ChatGPT utilize demonstrated strong capabilities in natural language understating at generation, providing new possibilities for innovative teaching in higher education. The explores integration of LLMs into task-based English teaching to enhance students’ language competence through interactive, meaningful in contextualized learning activities. Has analyzing the theoretical foundation of Task-Based Language Teaching (TBLT) at the pedagogical affordances of LLMs, the system design incorporates a modular pipeline consisting of a prompt pre-processor, an LLM-based task response engine with adaptive feedback module, allowing for seamless integration into existing teaching platforms. Experimental deployment in two undergraduate English courses, with one group using the LLM-enhanced system group relying on conventional task-based instruction. Quantitative results experimental group outperformed the control group in Task Performance Score and Learner Engagement Index with statistical significance. Furthermore, qualitative feedback from learners and instructors indicates increased engagement, confidence in linguistic creativity. The results suggest that LLMs support learner autonomy at engagement, improve linguistic accuracy fluency, over-reliance on AI in the teacher's evolving role in offers suggestions for future integration of LLMs in higher education English pedagogy.

安装插件收集

Instructional Agents: LLM Agents on Automated Course Material Generation for Teaching Faculties

教学代理：面向教师自动生成课程材料的LLM代理

Huaiyuan Yao, Wanpeng Xu, J. Turnau 等, 2025-arXiv.org

No abstract available

安装插件收集

被引 3

Towards Reliable LLM-based Exam Generation Lessons Learned and Open Challenges in an Industrial Project

基于大型语言模型的可靠考试生成：工业项目中的经验教训与开放性挑战

Renzo Degiovanni, Jordi Cabot, 2025-2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE)

Large Language Models (LLMs) have revolutionized the way natural language tasks are handled, with big potential applications in the context of education. LLMs can save educators time and effort, for instance, in content creation and exam generation. Although promising, LLMs’ integration into educational products brings some risks that companies must mitigate.In the context of an industrial project, we investigate the effectiveness of LLMs to generate educational multiple-choice questions. The experiments include 16 commercial and open-source LLMs, rely on standard metrics to assess the accuracy (F1 and BLEU) and linguistic quality (perplexity and diversity) of the generated questions, and compare with five specialized models. The results suggest that recent LLMs can outperform the fine-tuned models for question generation, open-source LLMs are very competitive with the commercial ones, with Meta Llama models being the best performing, and DeepSeek as performing as recent GPT4 models.This promising empirical evidence encourages us to focus on advanced prompting strategies, for which we report relevant open challenges we aim to address in the short term.

安装插件收集

PedagoSense: A Pedology Grounded LLM System for Pedagogical Strategy Detection and Contextual Response Generation in Learning Dialogues

PedagoSense：基于土壤学的LLM系统，用于学习对话中的教学策略检测和情境响应生成

Shahem Sultan, Shahem Fadi, Yousef Melhim 等, 2026-arXiv.org

This paper addresses the challenge of improving interaction quality in dialogue based learning by detecting and recommending effective pedagogical strategies in tutor student conversations. We introduce PedagoSense, a pedology grounded system that combines a two stage strategy classifier with large language model generation. The system first detects whether a pedagogical strategy is present using a binary classifier, then performs fine grained classification to identify the specific strategy. In parallel, it recommends an appropriate strategy from the dialogue context and uses an LLM to generate a response aligned with that strategy. We evaluate on human annotated tutor student dialogues, augmented with additional non pedagogical conversations for the binary task. Results show high performance for pedagogical strategy detection and consistent gains when using data augmentation, while analysis highlights where fine grained classes remain challenging. Overall, PedagoSense bridges pedagogical theory and practical LLM based response generation for more adaptive educational technologies.

安装插件收集

Improving LLM-Generated Educational Content: A Case Study on Prototyping, Prompt Engineering, and Evaluating a Tool for Generating Programming Problems for Data Science

Jiaen Yu, Ylesia Wu, Gabriel Cha 等, 2026-Proceedings of the 57th ACM Technical Symposium on Computer Science Education V.1

One key challenge for instructors is creating high-quality educational content, such as programming practice questions for introductory programming courses. While Large Language Models (LLMs) show promise for this task, their output quality can be inconsistent, and it is often unclear how to systematically improve their performance. In this experience report, we present the development process for ContentGen, an open-source tool that generates programming questions within the context of data science instructional materials. We describe our process of designing the tool and iteratively improving the tool through prompt engineering. To evaluate our changes, we designed and open-sourced a dataset of 91 test cases based on our course materials and developed three metrics to assess the generated questions: Correctness, Contextual Fit, and Coherence. We compare three prompting strategies and find that providing detailed instructions and an automatically generated summary of recently covered instructional materials to the LLM substantially improves the quality of the generated questions across our metrics. A usability study with six data science instructors further suggests that our final prototype is perceived as usable and effective. Our work contributes a case study of evidence-based prompt engineering for an educational tool and offers a practical approach for instructors and tool designers to evaluate and enhance LLM-based content generation.

安装插件收集

ELF: Educational LLM Framework of Improving and Evaluating AI-generated Content for Classroom Teaching

ELF：提升和评估AI生成内容用于课堂教学的教育型LLM框架

Kehui Tan, Jiayang Yao, Tianqi Pang 等, 2025-Journal of Data and Information Quality

Recent studies [48, 72] have demonstrated that Large Language Models (LLMs), like ChatGPT [3, 46] and LLAMA [59], can assist with routine teaching tasks and have the potential to revolutionize traditional education. However, other studies [35] highlight that LLMs often contain inaccuracies and demonstrate limited effectiveness in educational contexts. To address this issue, we propose a unified Education LLM Framework that integrates LLM into classroom teaching practice to enrich high-quality dialogical content and teacher-student interactions. Unlike complex data-driven models that require vast amounts of data, our framework can quickly enhance educational engagement and teaching strategies by utilizing a few carefully selected teaching examples from master teachers with our prompting techniques. We focus on two typical classroom teaching scenarios that require AI-generated content: Dialogue Completion and Expertise Transfer Learning. The former scenario requires generating contextually appropriate dialogues, while the latter scenario requires migrating the instructional styles and organization to new teaching topics. We demonstrate the effectiveness of our data quality-centered approach in generating semantically clear and factually accurate content as organized instructions for teaching materials. We comprehensively evaluate these materials by utilizing Perplexity-based Statistical Evaluation, Human Evaluation with Questionnaires, BertScore, Rouge, and BLEU. Experiments on two self-collected datasets show that our method significantly improves various metrics in Dialogue Completion and Expertise Transfer Learning tasks, enhancing the overall utility of AI for educational purposes.

安装插件收集

被引 15

Bringing Interactive Learning to Industrial IDEs: Kotlin Notebook and LLM-Generated Exercises

将交互式学习引入工业级IDE：Kotlin笔记本和LLM生成练习

Daniil Karol, Ksenia Shneyveys, Roman Belov 等, 2026-Proceedings of the 57th ACM Technical Symposium on Computer Science Education V.2

In-IDE learning became a popular approach, integrating programming education with professional development tools in a seamless environment. Kotlin Notebook extends this concept by enabling highly interactive lessons within an industrial IDE while leveraging its capabilities, such as code quality inspections or refactorings. Kotlin Notebook structures programming content into interactive sections, enhancing both engagement and comprehension. This talk explores the combination of in-IDE learning and Kotlin Notebook with the integration of LLMs to create a powerful tool for interactive learning within an industrial-grade IDE. We propose a method for automatically generating exercises, visual materials, and contextual explanations directly within Kotlin Notebook. This approach not only streamlines lesson creation but also allows students to stay within the IDE and interact with its professional features. Additionally, previous research has shown that integrating LLMs with IDE functionality can enhance the quality and control of LLM outputs through static analysis and validation. This combination represents a novel and scalable approach to improving programming education and interactive learning experiences.

安装插件收集

Multimodal Quiz Generation via RAG with LLM-as-Judge Evaluation

基于大型语言模型作为评判者的多模态测验生成方法

M. T. Kunuku, N. Dehbozorgi, 2025-2025 IEEE Frontiers in Education Conference (FIE)

This paper presents a novel multimodal quiz generation framework that integrates audio, visual, and textual data using a Retrieval-Augmented Generation (RAG) architecture. The system leverages LLaVA for vision-language understanding and LLaMA 3.1 for text generation to produce contextually relevant and pedagogically meaningful multiple-choice questions (MCQs) from lecture videos. This approach addresses key limitations of traditional text-only quiz generation models by capturing richer, multimodal information. The system was tested on a real-world use case, generating 15 MCQs from the first lecture in an introductory computer science course. To evaluate the effectiveness of the generated quizzes, we designed a two-stage evaluation framework. In the first stage, we assessed retrieval and generation performance using standard metrics such as Hit Rate, Mean Reciprocal Rank (MRR), Correctness, Relevance, and Faithfulness. In the second stage, we examined how closely AI evaluations align with human expert judgments. We involved four human raters and three LLM-as-Judge models—Claude 3 Sonnet, GPT-4, and LLaMA 3.1—to evaluate each question. To analyze agreement, we used Percentage Agreement, Cohen's Kappa, Spearman's Rho, and Krippendorff's Alpha, capturing both exact matches and ordinal consistency. Our results show high retrieval accuracy and reasonable alignment between LLM based and human assessments, particularly in factual and procedural questions. However, discrepancies emerged in questions requiring deeper reasoning or visual interpretation, where human raters exhibited stronger consistency. These findings highlight the strengths of LLMs in scalable content generation, while reinforcing the need for human oversight in evaluating complex educational tasks. This work takes a significant step toward more human-aligned and effective AI-driven assessment systems.

安装插件收集

Training Optimus Prime, M.D.: Generating Medical Certification Items by Fine-Tuning OpenAI's gpt2 Transformer Model

训练超级机器人大夫：通过微调OpenAI的gpt2 Transformer模型生成医学认证项目

M. Davier, 2019-arXiv.org

This article describes new results of an application using transformer-based language models to automated item generation (AIG), an area of ongoing interest in the domain of certification testing as well as in educational measurement and psychological testing. OpenAI's gpt2 pre-trained 345M parameter language model was retrained using the public domain text mining set of PubMed articles and subsequently used to generate item stems (case vignettes) as well as distractor proposals for multiple-choice items. This case study shows promise and produces draft text that can be used by human item writers as input for authoring. Future experiments with more recent transformer models (such as Grover, TransformerXL) using existing item pools are expected to improve results further and to facilitate the development of assessment materials.

安装插件收集

被引 16

Revolutionizing Assessment: Leveraging ChatGPT for Automated Item Generation: An AI Driven Exploratory Study with EFL Teachers

评估革命：利用ChatGPT进行自动试题生成——一项基于人工智能驱动的对EFL教师的探索性研究

Ahmad A. Alsagoafi, Hanan S. Alomran, 2025-World Journal of English Language

ChatGPT is gaining widespread acceptance in many disciplines since its launch at the end of 2022. The impact of ChatGPT on education is evident, but there is a dearth of knowledge on how English as a Foreign Language (EFL) teachers benefit from this technology. Therefore, this study investigates the use of ChatGPT to generate exam questions among EFL educators in Saudi Arabia. Through a mixed-methods approach that included an online questionnaire and an experimental design, the study attempted to gain insights from educators on using artificial intelligence (AI) technology for assessment. An online questionnaire was shared with 200 public school EFL teachers at various grade levels in the Eastern Province of Saudi Arabia. The findings revealed a varied landscape of perspectives, with some educators approving ChatGPT’s efficiency in generating exam questions, whereas others expressed concerns about its limited application. A further examination of the instructor-designed and ChatGPT-generated test items revealed that ChatGPT has the potential to stimulate critical thinking and expand assessment formats. The results indicate that educators require professional development to leverage AI technology responsibly. Furthermore, this study highlights the importance of navigating the emerging ChatGPT in EFL classrooms to ensure reliability and consistency of the evaluation process.

安装插件收集

Language Assessment Using Word Family-Based Automated Item Generation: Evaluating Item Quality Using Teacher Ratings

基于词族自动命题的语言评估：通过教师评分评估试题质量

S. Marandi, S. Hosseini, 2024-WorldCALL Official Conference Proceedings

The integration of Artificial Intelligence (AI) technologies has initiated a new era in language assessment practices, revolutionizing the field with its innovative approaches. This study introduces an advanced Automated Item Generation (AIG) system that utilizes word families as a foundation to automatically generate test items. The primary objective of this research is to investigate the effectiveness of the AIG system in producing high-quality questions through a comprehensive evaluation that combines both quantitative and qualitative measures. The AIG system is developed using cutting-edge machine learning and deep learning techniques, enabling it to enhance and facilitate the language assessment process by generating a substantial number of items. To assess the quality of the generated questions, a group of 30 experienced English teachers participated in the evaluation process. The participants assessed the quality of multiple-choice and fill-in-the-blank questions generated by the AIG system using a 4-point scale. To supplement the quantitative analysis, interviews were conducted to capture the perspectives of the teachers concerning the integration of AIG in language assessment. The findings demonstrate highly promising outcomes in terms of question quality, validating the efficacy of employing word families as a linguistic basis for generating test items. By shedding light on the advantages and effectiveness of utilizing word families as a fundamental lexical unit for AIG, this study contributes to the field of automated item generation in language assessment.

安装插件收集

An Exploratory Study on Two Automated Item Generators for Generating L2 Reading Test Items

关于两种自动生成L2阅读测试题目的探索性研究

Dongkwang Shin, Jang Ho Lee, Kyungmin Kim, 2025-RELC Journal

Given the increasing interest in automated item generation in the second language assessment field, this study investigated the potential of two automated item generators for L2 reading assessment. The first generator, KR-Item-Generator, was developed by the authors, who used a free chatbot builder. The second, Q-Craft, was developed using GPT-4 API and employs an all-in-one method to generate questions and passages. A total of 83 pre-service teachers at a college of education in South Korea were asked to generate English reading passages and test items using both generators. They were then given a post-task survey on varying aspects of the two generators. The results of the study demonstrated that both generators were positively perceived regarding the naturalness of the sentences in the passages and the level of completion of the test items, although Q-Craft was rated significantly more positively in terms of the latter. Given these findings, we discuss the pedagogical implications and offer key directives for further L2 AIG research.

安装插件收集

被引 5

Harnessing Generative AI for Assessment Item Development: Comparing AI‐Generated and Human‐Authored Items

利用生成式AI进行评估项目开发：比较AI生成项目和人工编写项目

Jaclyn Martin Kowal, Kenzie Hurley Bryant, Dan Segall 等, 2025-International Journal of Selection and Assessment

The use of generative AI, specifically large language models (LLMs), in test development presents an innovative approach to efficiently creating technical, knowledge‐based assessment items. This study evaluates the efficacy of AI‐generated items compared to human‐authored counterparts within the context of employee selection testing, focusing on data science knowledge areas. Through a paired comparison approach, subject matter experts (SMEs) were asked to evaluate items produced by both LLMs and human item writers. Findings revealed a significant preference for LLM‐generated items, particularly in specific knowledge domains such as Statistical Foundations and Scientific Data Analysis. However, despite the promise of generative AI in accelerating item development, human review remains critical. Issues such as multiple correct answers or ineffective distractors in AI‐generated items necessitate thorough SME review and revision to ensure quality and validity. The study highlights the potential of integrating AI with human expertise to enhance the efficiency of item generation while maintaining psychometric standards in high‐stakes environments. The implications for psychometric practice and the necessity of domain‐specific validation are discussed, offering a framework for future research and application of AI in test development.

安装插件收集

被引 3

EXPLANATION-BASED AUTOMATED ASSESSMENT OF OPEN ENDED LEARNER RESPONSES

基于解释的开放式学习者响应的自动评估

V. Rus, 2018-eLearning and Software for Education

Open ended assessment items require students to freely articulate their thinking as opposed to, for instance, multiple choice questions. Such free generation of answers by students enables what we may call true assessment because these answers offer a direct view of learners’ mental models. Nevertheless, assessing open ended learner responses is extremely challenging, e.g., if done manually by experts it becomes prohibitively expensive to scale up to millions of learners. To address this scalability challenge, automated methods to assess students' free responses are being explored. To this end, we present a novel solution to automatically assess open ended learner responses based on recent advances in computational linguistics and optimization algorithms. Our proposed solution accounts for linguistic phenomena such as anaphora resolution and negation in order to reach a deeper level of semantic interpretation of student answers. This is a key advantage compared to previous methods that focus primarily on distributional semantic representations of texts. Furthermore, our method provides both a holistic score as well as a detailed explanation of the score by performing a concept-level analysis of student responses. We present results obtained with the proposed method on a dataset that is widely used to evaluate automated methods for assessing open ended learner responses. The results indicate that our method is extremely competitive or surpasses the performance of previously proposed methods. Furthermore, by being able to pick on concepts students have yet to articulate, it enables the development of more personalized and dynamic generation of feedback in intelligent tutoring systems.

安装插件收集

被引 1

A Systematic Model of an Adaptive Teaching, Learning and Assessment Environment Designed Using Genetic Algorithms

基于遗传算法设计的自适应教学、学习和评估环境的系统模型

D. A. Popescu, Nicolae Bold, Michail Stefanidakis, 2025-Applied Sciences

The educational assessment is an essential task within the educational process. The generation of right and correct assessment content is a determinant process within the assessment. The creation of an automated method of generation similar to a human experienced operator (teacher) deals with a complex series of issues. This paper presents a compiled set of methods and tools used to generate educational assessment content in the form of assessment tests. The methods include the usage of various structures (e.g., trees, chromosomes and genes, and genetic operators) and algorithms (graph-based, evolutionary, and genetic) in the automated generation of educational assessment tests. This main purpose of the research is developed in the context of the existence of several requirements (e.g., degree of difficulty, item topic), which gives a higher degree of complexity to the issue. The paper presents a short literature review related to the issue. Next, the description of the models generated in the authors’ previous research is presented. In the final part of the paper, the results related to the implementations of the models are presented, as well as results and performance. Several conclusions were drawn based on this compilation, the most important of them being that tree and genetic-based approaches to the issue have promising results related to performance and assessment content generation.

安装插件收集

被引 1

The Impact of Item-Writing Flaws on Difficulty and Discrimination in Item Response Theory

项目编写缺陷对项目反应理论中难度和区分度的影响

Robin Schmucker, Steven Moore, 2025-arXiv.org

High-quality test items are essential for educational assessments, particularly within Item Response Theory (IRT). Traditional validation methods rely on resource-intensive pilot testing to estimate item difficulty and discrimination. More recently, Item-Writing Flaw (IWF) rubrics emerged as a domain-general approach for evaluating test items based on textual features. This method offers a scalable, pre-deployment evaluation without requiring student data, but its predictive validity concerning empirical IRT parameters is underexplored. To address this gap, we conducted a study involving 7,126 multiple-choice questions across various STEM subjects (physical science, mathematics, and life/earth sciences). Using an automated approach, we annotated each question with a 19-criteria IWF rubric and studied relationships to data-driven IRT parameters. Our analysis revealed statistically significant links between the number of IWFs and IRT difficulty and discrimination parameters, particularly in life/earth and physical science domains. We further observed how specific IWF criteria can impact item quality more and less severely (e.g., negative wording vs. implausible distractors) and how they might make a question more or less challenging. Overall, our findings establish automated IWF analysis as a valuable supplement to traditional validation, providing an efficient method for initial item screening, particularly for flagging low-difficulty MCQs. Our findings show the need for further research on domain-general evaluation rubrics and algorithms that understand domain-specific content for robust item validation.

安装插件收集

被引 5

Exploring AI-Enabled Test Practice, Affect, and Test Outcomes in Language Assessment

探索人工智能辅助的语言评估测试实践、情感与测试结果

Jill Burstein, Ramsey Cardwell, Ping-Ling Chuang 等, 2025-arXiv.org

Practice tests for high-stakes assessment are intended to build test familiarity, and reduce construct-irrelevant variance which can interfere with valid score interpretation. Generative AI-driven, automated item generation (AIG) scales the creation of large item banks and multiple practice tests, enabling repeated practice opportunities. We conducted a large-scale observational study (N = 25,969) using the Duolingo English Test (DET) -- a digital, high-stakes, computer-adaptive English language proficiency test to examine how increased access to repeated test practice relates to official DETscores, test-taker affect (e.g., confidence), and score-sharing for university admissions. To our knowledge, this is the first large-scale study exploring the use of AIG-enabled practice tests in high-stakes language assessment. Results showed that taking 1-3 practice tests was associated with better performance (scores), positive affect (e.g., confidence) toward the official DET, and increased likelihood of sharing scores for university admissions for those who also expressed positive affect. Taking more than 3 practice tests was related to lower performance, potentially reflecting washback -- i.e., using the practice test for purposes other than test familiarity, such as language learning or developing test-taking strategies. Findings can inform best practices regarding AI-supported test readiness. Study findings also raise new questions about test-taker preparation behaviors and relationships to test-taker performance, affect, and behaviorial outcomes.

安装插件收集

Can LLMs Estimate Student Struggles? Human-AI Difficulty Alignment with Proficiency Simulation for Item Difficulty Prediction

大型语言模型能否估算学生的学习困难？基于能力模拟的人机难度对齐与项目难度预测

Ming Li, Han Chen, Yunze Xiao 等, 2025-arXiv.org

Accurate estimation of item (question or task) difficulty is critical for educational assessment but suffers from the cold start problem. While Large Language Models demonstrate superhuman problem-solving capabilities, it remains an open question whether they can perceive the cognitive struggles of human learners. In this work, we present a large-scale empirical analysis of Human-AI Difficulty Alignment for over 20 models across diverse domains such as medical knowledge and mathematical reasoning. Our findings reveal a systematic misalignment where scaling up model size is not reliably helpful; instead of aligning with humans, models converge toward a shared machine consensus. We observe that high performance often impedes accurate difficulty estimation, as models struggle to simulate the capability limitations of students even when being explicitly prompted to adopt specific proficiency levels. Furthermore, we identify a critical lack of introspection, as models fail to predict their own limitations. These results suggest that general problem-solving capability does not imply an understanding of human cognitive struggles, highlighting the challenge of using current models for automated difficulty prediction.

安装插件收集

被引 2

Evaluation of automated vocabulary quiz generation with VocQGen

基于VocQGen的自动词汇测验生成评估

Qiao Wang, Ralph L. Rose, Ayaka Sugawara 等, 2025-Vocabulary Learning and Instruction

VocQGen is an automated tool designed to generate multiple-choice cloze (MCC) questions for vocabulary assessment in second language learning contexts. It leverages several natural language processing (NLP) tools and OpenAI’s GPT-4 model to produce MCC items quickly from user-specified word lists. To evaluate its effectiveness, we used the first sublist in the Academic Word List (AWL) to generate 60 questions with VocQGen. Then we compared the quality of 60 autogenerated questions with 40 manually created ones through expert reviews and through pilot testing with 68 students. Expert review results indicate that automatically generated questions exhibit higher grammatical accuracy and clearer contexts in question stems. However, the tool occasionally produces distractors that are acceptable as correct responses. Pilot testing results show that in general the number of correct responses is higher in autogenerated questions, indicating the less challenging nature of these questions. The study concludes that manual check is still required for questions generated by VocQGen and future work should focus on improving distractor effectiveness.

安装插件收集

被引 1

The Impact of ChatGPT on Language Assessment in ELT

ChatGPT对英语作为第二语言教学（ELT）语言评估的影响

Gülden Tüm, 2026-Sınırsız Eğitim ve Araştırma Dergisi

This descriptive study scrutinizes the impact of ChatGPT on English Language Teaching (ELT) assessment and examines the extent to which it presents both opportunities and threats. A systematic review including 963 academic publications published between December 2023 and December 2024 was carried out so as to see the opportunities and threats. Out of 963 publications, the most relevant 150 articles were filtered to address the use of ChatGPT in online language assessment in ELT. Document analysis and thematic coding were utilized and 12 recurring themes were identified, including six classified opportunities and six threats. The findings reveal that six opportunities identified by ChatGPT were of automated grading, personalized feedback, practice partner simulation in speaking and writing, assessment item generation, engagement and motivation and multimodal & inclusive assessment, and six threats were academic dishonesty, validity & reliability concerns, algorithmic bias, overdependence & de-skilling, data privacy & institutional gaps, and adaptability. This study determines that ChatGPT's utilization in ELT assessment is dual-promising and problematic. The findings suggest this duality could be solved by pedagogical guidelines, interdisciplinary collaboration, curriculum calibration and ethical frameworks to harness its potential while safeguarding educational integrity.

安装插件收集

Generation and Assessment of Multiple-Choice Questions from Video Transcripts using Large Language Models

基于大型语言模型从视频讲义中生成和评估多项选择题的研究

Taimoor Arif, Sumit Asthana, K. Collins-Thompson, 2024-Proceedings of the Eleventh ACM Conference on Learning @ Scale

We present an empirical study evaluating the quality of multiple-choice questions (MCQs) generated by Large Language Models (LLMs) from a corpus of video transcripts of course lectures in an online data science degree program. With our database of thousands of generated questions, we conducted both human and automated judging of question quality on a representative sample using a broad set of criteria, including well-established Item Writing Flaw (IWF) categories. We found the number of average IWFs per MCQ ranged from 1.6 (rule-based verification) to 2.18 (LLM-based). Among the most frequently identified MCQ flaws were lack of enough context (17%) or answer choices with at least one implausible distractor (57%). Both human and automated assessment identified implausible distractors as one of the most frequent flaw categories. Results from our human annotation study were generally more positive (51--65% good items) compared to our automated assessment study results, which tended toward greater flaw identification (15--25% good items), depending on evaluation method.

安装插件收集

被引 6

STRIVE: A Think & Improve Approach with Iterative Refinement for Enhancing Question Quality Estimation

STRIVE：一种基于迭代改进的增强问题质量估计的思考与改进方法

Aniket Deroy, Subhankar Maity, 2025-arXiv.org

Automatically assessing question quality is crucial for educators as it saves time, ensures consistency, and provides immediate feedback for refining teaching materials. We propose a novel methodology called STRIVE (Structured Thinking and Refinement with multiLLMs for Improving Verified Question Estimation) using a series of Large Language Models (LLMs) for automatic question evaluation. This approach aims to improve the accuracy and depth of question quality assessment, ultimately supporting diverse learners and enhancing educational practices. The method estimates question quality in an automated manner by generating multiple evaluations based on the strengths and weaknesses of the provided question and then choosing the best solution generated by the LLM. Then the process is improved by iterative review and response with another LLM until the evaluation metric values converge. This sophisticated method of evaluating question quality improves the estimation of question quality by automating the task of question quality evaluation. Correlation scores show that using this proposed method helps to improve correlation with human judgments compared to the baseline method. Error analysis shows that metrics like relevance and appropriateness improve significantly relative to human judgments by using STRIVE.

安装插件收集

Beyond Static Question Banks: Dynamic Knowledge Expansion via LLM-Automated Graph Construction and Adaptive Generation

超越静态题库：通过LLM自动化图构建和自适应生成实现动态知识扩展

Yingquan Wang, Tianyu Wei, Qinsi Li 等, 2026-arXiv.org

Personalized education systems increasingly rely on structured knowledge representations to support adaptive learning and question generation. However, existing approaches face two fundamental limitations. First, constructing and maintaining knowledge graphs for educational content largely depends on manual curation, resulting in high cost and poor scalability. Second, most personalized education systems lack effective support for state-aware and systematic reasoning over learners'knowledge, and therefore rely on static question banks with limited adaptability. To address these challenges, this paper proposes a Generative GraphRAG framework for automated knowledge modeling and personalized exercise generation. It consists of two core modules. The first module, Automated Hierarchical Knowledge Graph Constructor (Auto-HKG), leverages LLMs to automatically construct hierarchical knowledge graphs that capture structured concepts and their semantic relations from educational resources. The second module, Cognitive GraphRAG (CG-RAG), performs graph-based reasoning over a learner mastery graph and combines it with retrieval-augmented generation to produce personalized exercises that adapt to individual learning states. The proposed framework has been deployed in real-world educational scenarios, where it receives favorable user feedback, suggesting its potential to support practical personalized education systems.

安装插件收集

LLM Cheat Prevention via Adversarial Question Paraphrasing

通过对抗性问题释义防止大型语言模型（LLM）作弊

B. Balaji, M. D. Reddy, P. Pavankumar 等, 2026-2026 International Conference on Intelligent and Innovative Technologies in Computing, Electrical and Electronics (IITCEE)

As Large Language Model (LLM) chatbots have become increasingly accessible and their misuse for academic dishonesty has raised growing concern. Current methods that attempt to detect LLM-generated text are unreliable and risk producing false positives, which can unfairly harm genuine students. This paper offers an alternative by developing an “inoculation” process that generated paraphrased questions to find semantically similar ones that LLMs answer incorrectly. Using Llama 3.2 3B, create and evaluate paraphrases for MMLU questions, then test GPT-4o mini on them to identify effective inoculated questions. The approach successfully finds inoculations for 26.7% of correctly answered questions, requiring review of on more than 20 paraphrases per question. Detects weakness in instructor LLM responses with low cost.

安装插件收集

被引 1

LLM-Driven Learner Modeling and Personalized Learning Pathways: A Closed-Loop Framework and Engineering Design for Virtual Laboratories

基于LLM的学学习者建模与个性化学习路径：一种闭环框架与虚拟实验室的工程设计

Ruijie Wang, Guangtao Xu, 2025-2025 International Conference on Educational Technology Management (ICETM)

Focusing on virtual experiment teaching, this paper proposes a personalized learning closed-loop with LLM as the core. A simulation engine provides a verifiable factual baseline, while the LLM undertakes semantic interpretation, two-phase path way generation (skeleton-verification-refinement), fact-grounded judgement and feedback, and explanatory summarization. To enhance robustness and compliance, the framework employs retrieval-augmented generation (RAG), structured outputs, and a second-pass verifier as guardrails. At the learner-modeling layer, we fuse LLM semantic increments with BKT/IRT steady estimates to obtain a fine grained yet stable representation that drives adaptive replanning. The engineering design covers windowed reporting and fact checks, an orchestration service with template interfaces, result caching and tiered inference (small model first), minimal-necessary data collection with anonymization, and classroom-orien ted batching and rate limiting. Although large scale evaluation re mains for future work, the framework connects the key chain “interpretation—modeling—path—judgement—explanation,” demonstrating interpretability, controllability, and deployment feasibility.

安装插件收集

DW-Indicators: Assessing Learners' Draw & Write Artefacts with Indicators Extracted with LLM

DW-指标：利用大型语言模型提取指标评估学习者的绘制与写作作品

Rwitajit Majumdar, Kyoko Shiga, 2025-2025 IEEE International Conference on Advanced Learning Technologies (ICALT)

Activities that engage learners to articulate their answers often make them reflect. However, evaluating such activities and providing feedback is time-consuming for teachers. For text analysis, various data-driven indicators, such as cohesion and coherence, evaluate linguistic measures and the semantic understanding of artefacts created. However, for drawing-based activities, defining such indicators is still underexplored. In this research, we conducted a draw-and-write activity that engaged students to express their understanding of a concept through writing and drawing. The question was, “What is data science?”. The human raters analyzed the artefacts generated (n=40), and then a learning analytics approach was taken to define data-driven indicators. The study proposes a data processing pipeline involving a large language model (LLM) and defines indicators to understand the coherence of written text and drawn diagrams. Further, a clustering analysis of the collected artefacts highlighted differences in the participants' expressions of data science (task context). The discussion compares automated and human classification and its implications for assessment and feedback. Future work aims to integrate the pipeline in an online learning environment that affords drawing and text input from the learners.

安装插件收集

An Architectural Framework for Educational Knowledge Graphs (IEEE P2807.6): Ontology Design, Llm Integration, and Adaptive Learning Applications

教育知识图谱架构（IEEE P2807.6）：本体设计、LLM集成与自适应学习应用

Bin Xu, Richard Tong, Yanyan Li 等, 2025-2025 IEEE Conference on Artificial Intelligence (CAI)

The IEEE P2807.6 Education Knowledge Graph (EduKG) standard defines a semantic infrastructure to represent educational knowledge, resources, and pedagogy in a unified graph format. This paper expands on the core EduKG architecture, detailing its ontology design and key entities-Learning Points, Resource Items, and Pedagogical Rules-that collectively model the domain, content, and instructional strategies of learning systems. We further explore how EduKG can be integrated with advanced AI technologies, including large language models (LLMs) and retrieval-augmented generation (Graph-RAG) via embedding databases, to enable intelligent behavior such as semantic search, question answering, and dynamic content generation. These integrations position EduKG as a central component in next-generation smart education systems, wherein knowledge graphs work in concert with intelligent agents and adaptive instructional systems to deliver fully automated, personalized, and interactive learning experiences. By leveraging the standardized graph-structured representation and semantic reasoning capabilities of EduKG, such systems can achieve interoperability across platforms and support complex AI-driven tutoring and training scenarios. This work provides a comprehensive overview of the EduKG framework and highlights its role in empowering adaptive, cognitive, and collaborative learning solutions for the future of digital education.

安装插件收集

An Open-Source Dual-Loss Embedding Model for Semantic Retrieval in Higher Education

开源双损失嵌入模型在高等教育语义检索中的应用

Ramteja Sajja, Y. Sermet, Ibrahim Demir, 2025-arXiv.org

Recent advances in AI have catalyzed the adoption of intelligent educational tools, yet many semantic retrieval systems remain ill-suited to the unique linguistic and structural characteristics of academic content. This study presents two open-source embedding models fine-tuned for educational question answering, particularly in the context of course syllabi. A synthetic dataset of 3,197 sentence pairs, spanning synonymous terminology, paraphrased questions, and implicit-explicit mappings, was constructed through a combination of manual curation and large language model (LLM)-assisted generation. Two training strategies were evaluated: (1) a baseline model fine-tuned using MultipleNegativesRankingLoss (MNRL), and (2) a dual-loss model that combines MNRL with CosineSimilarityLoss to improve both semantic ranking and similarity calibration. Evaluations were conducted on 28 university course syllabi using a fixed set of natural language questions categorized into course, faculty, and teaching assistant information. Results demonstrate that both fine-tuned models outperform strong open-source baselines, including all-MiniLM-L6-v2 and multi-qa-MiniLM-L6-cos-v1, and that the dual-loss model narrows the performance gap with high-performing proprietary embeddings such as OpenAI's text-embedding-3 series. This work contributes reusable, domain-aligned embedding models and provides a replicable framework for educational semantic retrieval, supporting downstream applications such as academic chatbots, retrieval-augmented generation (RAG) systems, and learning management system (LMS) integrations.

安装插件收集

被引 7

TutorCraftEase: Enhancing Pedagogical Question Creation with Large Language Models

TutorCraftEase：利用大型语言模型提升教学问题创设

Wenhui Kang, Lin Zhang, Xiaolan Peng 等, 2025-Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems

Pedagogical questions are crucial for fostering student engagement and learning. In daily teaching, teachers pose hundreds of questions to assess understanding, enhance learning outcomes, and facilitate the transfer of theory-rich content. However, even experienced teachers often struggle to generate a large volume of effective pedagogical questions. To address this, we introduce TutorCraftEase, an interactive generation system that leverages large language models (LLMs) to assist teachers in creating pedagogical questions. TutorCraftEase enables the rapid generation of questions at varying difficulty levels with a single click, while also allowing for manual review and refinement. In a comparative user study with 39 participants, we evaluated TutorCraftEase against a traditional manual authoring tool and a basic LLM tool. The results show that TutorCraftEase can generate pedagogical questions comparable in quality to those created by experienced teachers, while significantly reducing their workload and time.

安装插件收集

被引 5

Optimizing Automated Question Generation for Educational Assessments

基于大型语言模型和本体论的教育评估自动问题生成优化

Sumayyah Alamoudi, Lama A. Al Khuzayem, Amani Jamal, 2025-Engineering, Technology & Applied Science Research

This study explores the optimization of Automated Question Generation (AQG) for educational assessments using Large Language Models (LLMs) and ontologies. Three approaches are evaluated: template-based structured ontology question generation, LLM-based structured ontology question generation, and LLM-based flat concept list question generation, using BERT Precision, Recall, F1-score, and Semantic Similarity as performance metrics. The results show that: i) the template-based structured ontology approach achieved a BERT Precision of 0.833, Recall of 0.844, and F1-score of 0.838, with a Semantic Similarity of 0.563, ii) the LLM-based structured ontology method showed improvements with a BERT Precision of 0.856, Recall of 0.863, and F1-score of 0.859, but a lower Semantic Similarity of 0.534, and iii) the LLM-based flat concept list approach provided the best results, achieving BERT Precision, Recall, and F1-score of 0.859, along with the highest Semantic Similarity of 0.567. Despite the higher semantic similarity of the LLM-based flat concept list, qualitative analysis revealed that the unstructured ontology sometimes produced hallucinated or unrelated questions. These findings suggest that LLM-based methods provide a balance of relevance and diversity in question generation, with LLM-based flat concept list offering the most optimal results for question generation, while LLM-based structured ontology strikes a balance between Precision and Recall.

安装插件收集

被引 1

Designing Answer-Aware LLM Hints to Scaffold Deeper Learning in K–12 Programming Education

Sahana Bhaskar, Sally Hamouda, 2025-Proceedings of the 2025 ACM Conference on International Computing Education Research V.2

Motivation and Background. Many K–12 students struggle with programming concepts. While LLMs offer scalable, timely support, overly direct answers can reduce reasoning and engagement [8], prompting the question: How can LLMs support learning without encouraging overreliance? In our study with 105 students, 31.4% showed misconceptions about variable assignment and data types, and in another survey, only 20% correctly solved conditional problems. This highlights the need for scaffolding to address conceptual gaps in K–12 programming. To address these gaps, we designed an answer-aware hint generation system using LLMs to support learning without reducing cognitive demand. We developed the system for CodeKids—an open-source, curriculum-aligned platform built with Virginia Tech and local public schools. It helps students practice grade-level programming through interactive activities, using LLM-generated hints to guide thinking without revealing answers [1, 11]. Based on Vygotsky’s Zone of Proximal Development [12], our approach balances support and autonomy through structured prompting that preserves productive struggle. Methodology. Building on research showing that machine learning supports K–12 learners without compromising cognitive development [15], we implemented a mindful answer-aware prompting approach [5, 7] grounded in two principles. The first principle, cognitive scaffolding, draws from ZPD and ITS research [10, 12], and ensures hints progress from general to specific while preserving learner autonomy. The second principle, technical safeguards, applies semantic similarity thresholds and constraint-based prompting to prevent answer leakage [13]. The system is deployed across 12 advanced CodeKids books covering core topics like variables, data types, conditionals, loops, and logical operators. Hints are concise, pedagogically sound, and generated by GPT-4 when students request help or load a page. Each request includes the topic, question, answer choices, and correct answer sent to the LLM, enabling context-aware adaptation to the activity and content. Our prompt design constrains hints to one sentence, emphasizes conceptual clarity, and gradually increases specificity to preserve student agency. This aligns with research on scaffold types—such as sense-making, elaboration, and motivational cues—that support self-regulated learning [9]. To support diverse learners, the system includes text-to-speech for reading hints aloud. Our approach combines learning sciences and prompt engineering to foster scalable support, student agency, and conceptual understanding. Evaluation. We evaluated semantic hint alignment using sentence embeddings: 98.1% of hints scored ≥ 0.30 in content alignment and 44.2% ≥ 0.20 in answer alignment, indicating strong relevance with minimal over-reliance. GPT-4, used as an LLM-as-a-judge due to its > 85% agreement with human ratings [14], gave an average score of 0.958 for hints on convergence, pedagogical value, and context. Combining LLM and cosine scores (0.7/0.3), we computed a Hint Quality Score of 0.749 [3]. To assess real-world impact, we developed surveys to collect feedback on clarity, usefulness, and learning [4]. Ongoing Work and Vision. We are investigating hint convergence across LLMs (e.g., Claude 3, Gemini 1.5 Pro) and exploring alternative prompting strategies to improve diversity. Future work includes personalizing hints through difficulty adaptation and embedding-based models for curriculum-aligned scaffolding [6], reducing reliance on proprietary LLMs, and incorporating retrieval-augmented generation (RAG) for contextualization [2].

安装插件收集

Twelve tips to leverage AI for efficient and effective medical question generation: A guide for educators using Chat GPT

利用人工智能高效生成医学问题的十二个技巧：Chat GPT使用指南

I. Indran, Priya Paramanathan, N. Gupta 等, 2023-Medical Teacher

Abstract Background Crafting quality assessment questions in medical education is a crucial yet time-consuming, expertise-driven undertaking that calls for innovative solutions. Large language models (LLMs), such as ChatGPT (Chat Generative Pre-Trained Transformer), present a promising yet underexplored avenue for such innovations. Aims This study explores the utility of ChatGPT to generate diverse, high-quality medical questions, focusing on multiple-choice questions (MCQs) as an illustrative example, to increase educator’s productivity and enable self-directed learning for students. Description Leveraging 12 strategies, we demonstrate how ChatGPT can be effectively used to generate assessment questions aligned with Bloom’s taxonomy and core knowledge domains while promoting best practices in assessment design. Conclusion Integrating LLM tools like ChatGPT into generating medical assessment questions like MCQs augments but does not replace human expertise. With continual instruction refinement, AI can produce high-standard questions. Yet, the onus of ensuring ultimate quality and accuracy remains with subject matter experts, affirming the irreplaceable value of human involvement in the artificial intelligence-driven education paradigm.

安装插件收集

被引 61

Semantic analysis of test responses using synthetic data generation

基于合成数据生成对测试响应的语义分析

B. Polyakov, 2025-Modelling and Data Analysis

Purpose. To evaluate the feasibility of using synthetic data generated by large language models for training automated classifiers of text responses in educational and professional testing. Methods. The experiment involved generating 100 response examples using LLMs, followed by text preprocessing (tokenization, stemming, TF-IDF) and training two classification models - logistic regression and RBF network, with subsequent evaluation on a test dataset. Results. The models achieved accuracy of 80% and 65-90% respectively. Systematic limitations were identified: high keywords dependency, insensitivity to semantic inversions, and contextual blindness in classification. Conclusions. The approach shows promise for developing auxiliary assessment tools, though current limitations prevent complete replacement of human evaluators. Further refinement is needed for practical implementation.

安装插件收集

GPT-3-Driven Pedagogical Agents to Train Children’s Curious Question-Asking Skills

基于GPT-3的教辅智能体训练儿童好奇提问技能

Rania Abdelghani, Yen-Hsiang Wang, Xingdi Yuan 等, 2022-International Journal of Artificial Intelligence in Education

The ability of children to ask curiosity-driven questions is an important skill that helps improve their learning. For this reason, previous research has explored designing specific exercises to train this skill. Several of these studies relied on providing semantic and linguistic cues to train them to ask more of such questions (also called divergent questions ). But despite showing pedagogical efficiency, this method is still limited as it relies on generating the said cues by hand, which can be a very long and costly process. In this context, we propose to leverage advances in the natural language processing field (NLP) and investigate the efficiency of using a large language model (LLM) for automating the production of key parts of pedagogical content within a curious question-asking (QA) training. We study generating the said content using the "prompt-based" method that consists of explaining the task to the LLM in natural text. We evaluate the output using human experts annotations and comparisons with hand-generated content. Results suggested indeed the relevance and usefulness of this content. We then conduct a field study in primary school (75 children aged 9–10), where we evaluate children’s QA performance when having this training. We compare 3 types of content: 1) hand-generated content that proposes "closed" cues leading to predefined questions; 2) GPT-3-generated content that proposes the same type of cues; 3) GPT-3-generated content that proposes "open" cues leading to several possible questions. Children were assigned to either one of these groups. Based on human annotations of the questions generated, we see a similar QA performance between the two "closed" trainings (showing the scalability of the approach using GPT-3), and a better one for participants with the "open" training. These results suggest the efficiency of using LLMs to support children in generating more curious questions, using a natural language prompting approach that affords usability by teachers and other users not specialists of AI techniques. Furthermore, results also show that open-ended content may be more suitable for training curious question-asking skills.

安装插件收集

被引 138

Owlgorithm: Supporting Self-Regulated Learning in Competitive Programming through LLM-Driven Reflection

猫头鹰算法：通过LLM驱动的反思支持竞技编程中的自我调节学习

Juliana Nieto-Cardenas, E. J. Kramer, Peter Kurto 等, 2025-Proceedings of the 57th ACM Technical Symposium on Computer Science Education V.1

We present Owlgorithm, an educational platform that supports Self-Regulated Learning (SRL) in competitive programming (CP) through AI-generated reflective questions. Leveraging GPT-4o, Owlgorithm produces context-aware, metacognitive prompts tailored to individual student submissions. Integrated into a second- and third-year CP course, the system-provided reflective prompts adapted to student outcomes: guiding deeper conceptual insight for correct solutions and structured debugging for partial or failed ones. Our exploratory assessment of student ratings and TA feedback revealed both promising benefits and notable limitations. While many found the generated questions useful for reflection and debugging, concerns were raised about feedback accuracy and classroom usability. These results suggest advantages of LLM-supported reflection for novice programmers, though refinements are needed to ensure reliability and pedagogical value for advanced learners. From our experience, several key insights emerged: GenAI can effectively support structured reflection, but careful prompt design, dynamic adaptation, and usability improvements are critical to realizing their potential in education. We offer specific recommendations for educators using similar tools and outline next steps to enhance Owlgorithm's educational impact. The underlying framework may also generalize to other reflective learning contexts.

安装插件收集

Schema Study: A Large Language Model (LLM) Application for Asynchronous Student Learning and Inquiry

模式研究：一种用于异步学生学习和探究的大语言模型（LLM）应用

Keefe D. Reuther, Liam O. Mueller, Grace Constantian 等, 2026-CourseSource

Undergraduate biology educators face a critical challenge: providing immediate, personalized formative feedback to increasingly large, diverse classes. Large Language Models (LLMs) offer potential solutions, but open-ended chat interfaces pose challenges including curricular misalignment and equity gaps. We developed Schema Study, a free, no-code, open-source web application where instructors upload course terms and context via a single spreadsheet to create an AI-powered chatbot. Our LLM tutor uses evidence-based teaching practices and Socratic questioning to deepen understanding, correct misconceptions, and encourage students to find connections among course concepts. During Winter 2025, we integrated Schema Study into an introductory biology course, embedding it within structured assignments and updating content weekly. Pre-and post-surveys (N=225) indicated strong student satisfaction; 72% would reuse Schema Study in future biology courses. Each additional day per week students used Schema Study more than doubled the likelihood they would recommend it. Schema Study enhanced students’ AI self-efficacy and their belief that AI is relevant to their education and careers. Through iterative, classroom-based refinement, we updated the application based on student feedback, highlighting best practices for integrating LLM chatbots: clear structured messaging, AI literacy training, curricular alignment, and scaffolded active learning opportunities. The tool provides formative practice through question-led dialogue; independent performance is evaluated in secure assessments outside the app. Schema Study offers a scalable, accessible strategy for biology educators to leverage generative AI’s benefits while mitigating its risks.

安装插件收集

Filling the Gap: LLMs as Scaffolds for Competency Question Instantiation

填补空白：将大型语言模型作为能力问题实例化的支架

Clare McNamara, Lucy Hederman, Declan O'Sullivan, 2026-Proceedings of the 31st International Conference on Intelligent User Interfaces

Knowledge graphs (KGs) are a powerful way of representing information for digital humanities. However, non-technical users often struggle at the outset of exploration, a challenge defined as the Initial Exploration Problem. The Tús Maith framework addresses this issue through curated natural language questions and answers (CuQAs) created from Competency Questions (CQs) that aim to convey the scope of a KG and provide meaningful entry points into it. While prior work has explored using large language models (LLMs) for CQ template generation, the template-filling step, where questions and answers are instantiated with entity information, remains a key challenge. In this paper, we evaluate whether LLMs have the capacity to support domain experts in this stage, focusing on the Virtual Record Treasury of Ireland (VRTI) KG, where accuracy, provenance, and robustness are crucial for practical use. Using structured JSON inputs derived from popular search terms and expert-authored templates, we generated and assessed 24,900 question-answer pairs across four LLMs (GPT-5, DeepSeek-V3.1, Gemini 2.0 Flash, Qwen-2.5-72B) under two provenance conditions (basic vs. full). Our evaluation considers slot fidelity, semantic similarity, completeness, hallucination rates, and runtime efficiency, with statistical tests conducted per run per LLM, and additional batch-level analysis (n = 68) to isolate provenance requirement effects. We further show that a lightweight JSON validation check is an effective proxy for ground truth semantic evaluation of factual question-answer pairs. These LLM-generated, validated questions form an intermediate step in the lifecycle from abstract CQ templates to filled-in questions and answers intended to be reviewed and refined by the VRTI KG’s domain experts (historians) to produce the final user-facing questions (CuQAs). To demonstrate the practical impact, we present a prototype (TMv1) of the Tús Maith framework and highlight the design implications for curator-facing interfaces: provenance-transparent interaction, validation-integrated workflows, and performance-transparent model selection.

安装插件收集

被引 2

Evaluating and Validating Large Language Models for Health Education on Developmental Dysplasia of the Hip: 2-Phase Study With Expert Ratings and a Pilot Randomized Controlled Trial

Ouyang Hui, Gan Lin, Yiyuan Li 等, 2026-Journal of Medical Internet Research

Background Developmental dysplasia of the hip (DDH) is a common pediatric orthopedic disease, and health education is vital to disease management and rehabilitation. The emergence of large language models (LLMs) has provided new opportunities for health education. However, the effectiveness and applicability of LLMs in education with DDH have not been systematically evaluated. Objective This study conducted an integrated 2-phase evaluation to assess the quality and educational effectiveness of LLM-generated educational materials. Methods This study comprised 2 phases. Based on Bloom’s taxonomy, a 16-item DDH question bank was created through literature analysis and collaboration. Four LLMs (ChatGPT-4 [OpenAI], DeepSeek-V3, Gemini 2.0 Flash [Google], and Copilot [Microsoft Corp]) were questioned using standardized prompts. All responses were independently evaluated by 5 pediatric orthopedic experts using 5-point Likert measures of accuracy, fluency, and richness, the scales of Patient Education Materials Assessment Tool for Printable Materials, and DISCERN. The readability was measured by a formula. The data were examined using Kruskal-Wallis tests, ANOVA, and post hoc comparisons. In phase 2, an assessor-blinded, 2-arm pilot randomized controlled trial was conducted. A total of 127 caregivers were randomized into an LLM-assisted education group or a web search control group. The intervention included structured LLM training, supervised practice, and 2 weeks of reinforcement training. Measured at baseline, postintervention, and 2 weeks following, the outcomes were eHealth literacy (primary), DDH knowledge, health risk perception, perceived usefulness, information self-efficacy, and health information-seeking behavior. Cohen d effect sizes and linear mixed-effects models were used in an intention-to-treat manner. Results There were significant differences between the 4 LLMs concerning accuracy, richness, fluency, Patient Education Materials Assessment Tool for Printable Materials Understandability, and DISCERN (P<.05). ChatGPT-4 (median 63.67, IQR 63.67-64.67) and DeepSeek-V3 (median 63.67, IQR 63.33-64.67) generate more accurate text than Copilot (median 59.00, IQR 58.67-59.67). DeepSeek-V3 (median 64.00, IQR 64.00-64.00) was language richer than Copilot (median 52.33, IQR 51.33-52.67). Gemini 2.0 Flash (median 72.67, IQR 72.33-73.00) was more fluent than Copilot (median 65.67, IQR 63.33-65.67). In phase 2, the intervention group showed higher eHealth literacy at T1 (33.62, 95% CI 32.76-34.49; d=0.20, 95% CI 0.13-0.56) and T2 (33.27, 95% CI 32.38-34.17; d=0.36, 95% CI 0.01-0.80), greater DDH knowledge at T1 (7.87, 95% CI 7.48-8.25, d=0.71, 95% CI 0.33-1.11) and T2 (7.12, 95% CI 6.72-7.51; d=0.54, 95% CI 0.17-0.96), and slight improvements in health risk prediction and perceived usefulness. Conclusions Mainstream LLMs demonstrate varying capacities in generating educational content for DDH. They generated DDH caregiver education materials that were associated with modest improvements in eHealth literacy and knowledge. Although LLMs can address general informational needs, they cannot completely substitute clinical evaluation. Future research should focus on optimizing plain language, refining dialogue design, and enhancing audience personalization to improve the quality of LLMs’ materials. Trial Registration Chinese Clinical Trial Registry ChiCTR2500108410; https://www.chictr.org.cn/showproj.html?proj=271987

安装插件收集

被引 1

SMART: Simulated Students Aligned with Item Response Theory for Question Difficulty Prediction

SMART：基于项目反应理论的模拟学生与问题难度预测的匹配

Alexander Scarlatos, Nigel Fernandez, Christopher Ormerod 等, 2025-Conference on Empirical Methods in Natural Language Processing

Item (question) difficulties play a crucial role in educational assessments, enabling accurate and efficient assessment of student abilities and personalization to maximize learning outcomes. Traditionally, estimating item difficulties can be costly, requiring real students to respond to items, followed by fitting an item response theory (IRT) model to get difficulty estimates. This approach cannot be applied to the cold-start setting for previously unseen items either. In this work, we present SMART (Simulated Students Aligned with IRT), a novel method for aligning simulated students with instructed ability, which can then be used in simulations to predict the difficulty of open-ended items. We achieve this alignment using direct preference optimization (DPO), where we form preference pairs based on how likely responses are under a ground-truth IRT model. We perform a simulation by generating thousands of responses, evaluating them with a large language model (LLM)-based scoring model, and fit the resulting data to an IRT model to obtain item difficulty estimates. Through extensive experiments on two real-world student response datasets, we show that SMART outperforms other item difficulty prediction methods by leveraging its improved ability alignment.

安装插件收集

被引 13

Evaluating Psychometric Properties in Generated Thai Reading Ability Diagnostic Tests between the Large Language Model (LLM-Claude) and Human: Rasch Analysis

评估大型语言模型（LLM-Claude）与人类生成的泰语阅读能力诊断测试的心理学特性：Rasch分析

Tanawad Chanaman, Putcharee Junpeng, Thanapong Intharah, 2025-2025 International Technical Conference on Circuits/Systems, Computers, and Communications (ITC-CSCC)

This study aims to compare the quality assessment of Thai reading comprehension diagnostic tests created by Claude AI versus those developed by humans. The sample consisted of 735 seventh-grade students from Secondary educational service areas in the central northeastern region of Thailand. The methodology applied Rasch Model Analysis integrated with Turing Test procedures. The findings revealed that diagnostic tests created by both Claude AI and humans demonstrated comparable measurement quality in terms of validity, reliability, and item-model fit. Both tests exhibited low measurement error, allowing for accurate estimation of students' Thai reading abilities close to their true proficiency levels. Furthermore, both test versions showed good distribution of difficulty levels, covering nearly the full spectrum of student ability levels. These characteristics make the tests particularly suitable for students with slightly above-average proficiency. Nevertheless, certain test items require refinement to enhance their assessment efficiency according to established criteria. While some educators may remain hesitant about implementing AI-generated tests for formal student evaluation, Claude AI-created tests can effectively serve as practice exercises for student development.

安装插件收集

LLM-generated competence-based e-assessment items for higher education mathematics: methodology and evaluation

基于大型语言模型生成的大学数学能力评估电子试题：方法论与评估

Roy Meissner, Alexander Pögelt, Katja Ihsberner 等, 2024-Frontiers in Education

In this article, we explore the transformative impact of advanced, parameter-rich Large Language Models (LLMs) on the production of instructional materials in higher education, with a focus on the automated generation of both formative and summative assessments for learners in the field of mathematics. We introduce a novel LLM-driven process and application, called ItemForge, tailored specifically for the automatic generation of e-assessment items in mathematics. The approach is thoroughly aligned with the levels and hierarchy of cognitive learning objectives as developed by Anderson and Krathwohl, and takes specific mathematical concepts from the considered courses into consideration. The quality of the generated free-text items, along with their corresponding answers (sample solutions), as well as their appropriateness to the designated cognitive level and subject matter, were evaluated in a small-scale study. In this study, three mathematical experts reviewed a total of 240 generated items, providing a comprehensive analysis of their effectiveness and relevance. Our findings demonstrate that the tool is proficient in producing high-quality items that align with the chosen concepts and targeted cognitive levels, indicating its potential suitability for educational purposes. However, it was observed that the provided answers (sample solutions) occasionally exhibited inaccuracies or were not entirely complete, signalling a necessity for additional refinement of the tool's processes.

安装插件收集

被引 6

Take Out Your Calculators: Estimating the Real Difficulty of Question Items with LLM Student Simulations

拿出你的计算器：利用LLM学生模拟估算试题的实际难度

Christabel Acquaye, Yi Ting Huang, Marine Carpuat 等, 2026-arXiv.org

Standardized math assessments require expensive human pilot studies to establish the difficulty of test items. We investigate the predictive value of open-source large language models (LLMs) for evaluating the difficulty of multiple-choice math questions for real-world students. We show that, while LLMs are poor direct judges of problem difficulty, simulation-based approaches with LLMs yield promising results under the right conditions. Under the proposed approach, we simulate a"classroom"of 4th, 8th, or 12th grade students by prompting the LLM to role-play students of varying proficiency levels. We use the outcomes of these simulations to fit Item Response Theory (IRT) models, comparing learned difficulty parameters for items to their real-world difficulties, as determined by item-level statistics furnished by the National Assessment of Educational Progress (NAEP). We observe correlations as high as 0.75, 0.76, and 0.82 for grades 4, 8, and 12, respectively. In our simulations, we experiment with different"classroom sizes,"showing tradeoffs between computation size and accuracy. We find that role-plays with named students improves predictions (compared to student ids), and stratifying names across gender and race further improves predictions. Our results show that LLMs with relatively weaker mathematical abilities (Gemma) actually yield better real-world difficulty predictions than mathematically stronger models (Llama and Qwen), further underscoring the suitability of open-source models for the task.

安装插件收集

被引 1

LLM-Simulated Nonequivalent Groups With Anchor Test: A Novel Approach for Test Equating in the Absence of Traditional Anchor Items

基于锚测试的LLM模拟非等价组：无传统锚项目情况下测试等值的创新方法

Junlei Du, Yishen Song, Qinhua Zheng, 2026-IEEE Transactions on Learning Technologies

Nonanchor equating presents a significant challenge in educational assessment when test forms lack common items, requiring innovative solutions to ensure score comparability across different test administrations. This study proposes a novel large language model-simulated nonequivalent groups with anchor test (LLM-SNGAT) method that leverages large language models (LLMs) to simulate test-taking samples and generate common item sets for equating purposes. The approach eliminates traditional dependencies on specialized test design and extensive demographic data collection by utilizing the inherent capabilities of LLMs to simulate diverse response patterns. We evaluated the method using Tucker and Levine equating approaches across multiple LLMs, including generative pre-trained transformer 4o (GPT-4o), O1-preview, and DeepSeek-R1. Results demonstrated the feasibility of the proposed approach, with the Tucker method showing superior performance and consistent improvements as common item coverage increased. Sensitivity analysis confirmed that model performance rankings remained consistent across varying prompt formulations. The study revealed characteristic that standard errors were smallest near the mean and became larger farther away from the mean, and identified optimal common item proportions of 30%–50% for stable equating performance. While current limitations include the capacity of LLMs to accurately simulate human cognitive and behavioral diversity, this proof-of-concept study provides preliminary evidence for the feasibility of the LLM-SNGAT methodology. The approach represents a paradigm shift from resource-intensive traditional methods to computationally driven solutions, offering promising prospects for addressing nonanchor equating challenges in the digital age.

安装插件收集

Synthetic Student Responses: LLM-Extracted Features for IRT Difficulty Parameter Estimation

合成学生回答：基于LLM提取特征的IRT难度参数估计

Matias Hoyl, 2026-arXiv.org

Educational assessment relies heavily on knowing question difficulty, traditionally determined through resource-intensive pre-testing with students. This creates significant barriers for both classroom teachers and assessment developers. We investigate whether Item Response Theory (IRT) difficulty parameters can be accurately estimated without student testing by modeling the response process and explore the relative contribution of different feature types to prediction accuracy. Our approach combines traditional linguistic features with pedagogical insights extracted using Large Language Models (LLMs), including solution step count, cognitive complexity, and potential misconceptions. We implement a two-stage process: first training a neural network to predict how students would respond to questions, then deriving difficulty parameters from these simulated response patterns. Using a dataset of over 250,000 student responses to mathematics questions, our model achieves a Pearson correlation of approximately 0.78 between predicted and actual difficulty parameters on completely unseen questions.

安装插件收集

Educational engineering in light of perceptual invariance theory: Semantic noise elimination and universal mathematical language construction

N. Demirkuş, 2026-World Journal of Advanced Engineering Technology and Sciences

This theoretical framework addresses the chronic educational crisis of "semantic entropy"—the systematic degradation of meaning in knowledge transmission. Drawing on Cognitive Load Theory (Sweller, 2024) and Schema Theory (Anderson, 2020), PIT (Perceptual Invariance Theory) proposes that educational failure stems not from student deficits but from "semantic noise" in instructional materials and assessments. The paper introduces three engineering solutions: (1) Generalization and Uniqueness principles for material design to achieve ≥99% comprehension fidelity; (2) Clarity-Indexed Scoring System that replaces difficulty-based assessment with clarity-based metrics; and (3) Edu Code Protocol—a universal mathematical language to eliminate natural language ambiguity. Analysis of PISA 2022 and World Bank 2024 data reveals that 40% global reading failure and 70% learning poverty in Turkey correlate more strongly with item ambiguity (r = -0.67) than student SES (r = -0.42), supporting semantic noise as the primary pathogen. The proposed Randomized Controlled Trial (N=500) framework predicts 15-20% comprehension improvement and 30% cognitive load reduction with PI-engineered materials. PIT reframes education from a probabilistic selection mechanism to a deterministic engineering discipline where 99% success becomes a design target. Implementation requires systematic material redesign, teacher training, and digital infrastructure (QR codes, AI-powered ontologies) to realize Leibniz's vision of universal language.

安装插件收集

Scaling Item-to-Standard Alignment with Large Language Models: Accuracy, Limits, and Solutions

利用大型语言模型实现项目与标准对齐的扩展：准确性、限制与解决方案

Farzan Karimi-Malekabadi, Pooya Razavi, Sonya J. Powers, 2025-arXiv.org

As educational systems evolve, ensuring that assessment items remain aligned with content standards is essential for maintaining fairness and instructional relevance. Traditional human alignment reviews are accurate but slow and labor-intensive, especially across large item banks. This study examines whether Large Language Models (LLMs) can accelerate this process without sacrificing accuracy. Using over 12,000 item-skill pairs in grades K-5, we tested three LLMs (GPT-3.5 Turbo, GPT-4o-mini, and GPT-4o) across three tasks that mirror real-world challenges: identifying misaligned items, selecting the correct skill from the full set of standards, and narrowing candidate lists prior to classification. In Study 1, GPT-4o-mini correctly identified alignment status in approximately 83-94% of cases, including subtle misalignments. In Study 2, performance remained strong in mathematics but was lower for reading, where standards are more semantically overlapping. Study 3 demonstrated that pre-filtering candidate skills substantially improved results, with the correct skill appearing among the top five suggestions more than 95% of the time. These findings suggest that LLMs, particularly when paired with candidate filtering strategies, can significantly reduce the manual burden of item review while preserving alignment accuracy. We recommend the development of hybrid pipelines that combine LLM-based screening with human review in ambiguous cases, offering a scalable solution for ongoing item validation and instructional alignment.

安装插件收集

Do LLMs Give Psychometrically Plausible Responses in Educational Assessments?

大型语言模型在教育评估中给出的回答是否具有心理测量学上的合理性？

Andreas Säuberli, Diego Frassinelli, Barbara Plank, 2025-Workshop on Innovative Use of NLP for Building Educational Applications

Knowing how test takers answer items in educational assessments is essential for test development, to evaluate item quality, and to improve test validity. However, this process usually requires extensive pilot studies with human participants. If large language models (LLMs) exhibit human-like response behavior to test items, this could open up the possibility of using them as pilot participants to accelerate test development. In this paper, we evaluate the human-likeness or psychometric plausibility of responses from 18 instruction-tuned LLMs with two publicly available datasets of multiple-choice test items across three subjects: reading, U.S. history, and economics. Our methodology builds on two theoretical frameworks from psychometrics which are commonly used in educational assessment, classical test theory and item response theory. The results show that while larger models are excessively confident, their response distributions can be more human-like when calibrated with temperature scaling. In addition, we find that LLMs tend to correlate better with humans in reading comprehension items compared to other subjects. However, the correlations are not very strong overall, indicating that LLMs should not be used for piloting educational assessments in a zero-shot setting.

安装插件收集

被引 3

Leveraging Large Language Model for Automatic Translation of Educational Content: Exploring the Effectiveness of Curriculum-Aware Prompt Engineering

利用大型语言模型进行教育内容自动翻译：探讨课程感知提示工程的有效性

Euigyum Kim, Hyo Jeong Shin, 2025-Korean Educational Research Association

Despite the globalization of educational content, language remains a significant barrier. When translating educational content, multilingual translation has become crucial to meet this challenge, with an emphasis on incorporating the cultural context of the target country and the educational context of the learners. However, existing machine translation systems often fail to adequately account for these contextual factors. This study explores the potential of the Large Language Model(LLM) to improve the translation of assessment items through In-context Learning. Two prompt engineering strategies are compared: the ‘assessment-aware prompt’, which includes only the specifications of the assessment, and the ‘curriculum-aware prompt’, which includes the educational and cultural context of the target country in addition to the assessment specifications. From the comparison of linguistic features and the expert reviews, we found that the curriculum-aware translation produced more valid and feasible results, highlighting the effectiveness of LLM-based automatic translation methods that integrate curriculum context.

安装插件收集

Assessing LLM-generated vs. expert-created clinical anatomy MCQs: a student perception-based comparative study in medical education

评估LLM生成的与专家创作的临床解剖学多项选择题：基于学生感知的医疗教育比较研究

Maram Elzayyat, Janatul Naeim Mohammad, S. Zaqout, 2025-Medical Education Online

ABSTRACT Large language models (LLMs) such as ChatGPT and Gemini are increasingly used to generate educational content in medical education, including multiple-choice questions (MCQs), but their effectiveness compared to expert-written questions remains underexplored, particularly in anatomy. We conducted a cross-sectional, mixed-methods study involving Year 2–4 medical students at Qatar University, where participants completed and evaluated three anonymized MCQ sets—authored by ChatGPT, Google-Gemini, and a clinical anatomist—across 17 quality criteria. Descriptive and chi-square analyses were performed, and optional feedback was reviewed thematically. Among 48 participants, most rated the three MCQ sources as equally effective, although ChatGPT was more often preferred for helping students identify and confront their knowledge gaps through challenging distractors and diagnostic insight, while expert-written questions were rated highest for deeper analytical thinking. A significant variation in preferences was observed across sources (χ² (64) = 688.79, p < .001). Qualitative feedback emphasized the need for better difficulty calibration and clearer distractors in some AI-generated items. Overall, LLM-generated anatomy MCQs can closely match expert-authored ones in learner-perceived value and may support deeper engagement, but expert review remains critical to ensure clarity and alignment with curricular goals. A hybrid AI-human workflow may provide a promising path for scalable, high-quality assessment design in medical education.

安装插件收集

被引 3

Estimating Exam Item Difficulty with LLMs: A Benchmark on Brazil's ENEM Corpus

使用大型语言模型（LLMs）估计考试项目难度：基于巴西ENEM语料库的基准研究

Thiago Brant, Julien Kühn, Jun Pang, 2026-arXiv.org

As Large Language Models (LLMs) are increasingly deployed to generate educational content, a critical safety question arises: can these models reliably estimate the difficulty of the questions they produce? Using Brazil's high-stakes ENEM exam as a testbed, we benchmark ten proprietary and open-weight LLMs against official Item Response Theory (IRT) parameters for 1,031 questions. We evaluate performance along three axes: absolute calibration, rank fidelity, and context sensitivity across learner backgrounds. Our results reveal a significant trade-off: while the best models achieve moderate rank correlation, they systematically underestimate difficulty and degrade significantly on multimodal items. Crucially, we find that models exhibit limited and inconsistent plasticity when prompted with student demographic cues, suggesting they are not yet ready for context-adaptive personalization. We conclude that LLMs function best as calibrated screeners rather than authoritative oracles, supporting an"evaluation-before-generation"pipeline for responsible assessment design.

安装插件收集

Large language models as educational collaborators: developing non-conventional teaching aids in pharmacology & therapeutics

大型语言模型作为教育合作伙伴：开发药理学与治疗学中的非常规教学辅助工具

K. Sridharan, G. Sivaramakrishnan, 2025-BMC Medical Education

With the growing integration of artificial intelligence in medical education, this study compares the quality and educational robustness of content generated by two large language models (LLMs), DeepSeek-V3 and ChatGPT 4.0, on the emerging, non-conventional topic (and not present in textbooks) of gender-affirming hormone therapy (GAHT) across three educational phases: preclerkship and clerkship phases in undergraduate medical curriculum, and master’s level in pharmacology. A total of 23 prompts were designed to generate Specific Learning Objectives (SLOs), reading materials, assessment items (MCQs, SAQs, and OSPEs), and case-based learning (CBL) scenarios across the three learner stages. The outputs from both LLMs were evaluated independently using rubric-based frameworks assessing content appropriateness, pedagogical structure, assessment alignment, and inclusivity. Both LLMs produced pedagogically sound outputs; however, DeepSeek consistently demonstrated superior adherence to rubric criteria. For SLOs, DeepSeek maintained a clear hierarchical progression across phases and showed greater precision, contextual alignment, and time-bound formulation. Its objectives were more assessable and reflective of increasing cognitive complexity. ChatGPT’s SLOs were inclusive and coherent but occasionally lacked time-specificity and structural clarity. In reading materials, DeepSeek outperformed by integrating clinical relevance, scaffolded structure, and interactive learning tools across all phases. It included visual aids, case vignettes, and phase-specific assessments, while ChatGPT’s content was accurate and readable but leaned toward text-heavy exposition with fewer embedded learning activities. MCQs from both models adhered to core psychometric principles. DeepSeek avoided testwiseness cues more consistently and offered better stratification of difficulty and realism, especially at the master’s level. ChatGPT demonstrated strong pharmacological accuracy but occasionally showed testwiseness cues and illogical distractor sequencing. In CBL and OSPE outputs, DeepSeek showed stronger alignment with instructional and assessment criteria through modular formatting, diverse patient representation, and integration of formative tools. ChatGPT’s cases and OSPEs were realistic and engaging but more narrative and occasionally less standardized. While both LLMs demonstrated educational utility, DeepSeek produced more rubric-aligned, contextually rich, and assessment-ready content across all learner stages. This study supports the integration of advanced LLMs like DeepSeek and ChatGPT in curriculum design, provided there is oversight to ensure alignment with pedagogical goals and learner needs.

安装插件收集

AcadX: An AI-Integrated Framework for Automated Question Generation, Grading and Educational Analytics

AcadX：一个集成的AI框架，用于自动生成问题、评分和教育分析

Pratibha Patel, Saket Ghodeswar, Sejal Dongre 等, 2025-2025 3rd DMIHER International Conference on Artificial Intelligence in Healthcare, Education and Industry (IDICAIHEI)

The growing reliance on digital learning platforms has increased the need for automated, scalable and pedagogically aligned assessment systems. Current approaches to automated question generation (QG) and grading remain fragmented, focusing on either objective items or short-answer evaluation, with limited attention to difficulty calibration and educator supervision. This paper introduces an AI-driven assessment framework that unifies question generation, automated grading and performance analytics into a single workflow. The framework accepts two input modes: (i) structured content extracted from PDF-based learning resources, with optional optical character recognition (OCR) for scanned or image-based materials and (ii) teacher-specified topics for targeted assessments. Large language models (LLMs) produce a variety of question formats, including formats that involve option selection or text completion, and case-based questions, while a Difficulty Index (DI) guarantees alignment with the intended cognitive levels. Objective responses are graded instantly and AI-assisted evaluation of subjective answers is proposed as a future enhancement with teacher verification. All generated assessments and student outcomes are stored in a Supabase-backed repository enabling real-time analytics such as difficulty-wise performance, progress tracking and cohort comparisons. By integrating content parsing difficulty-aware QG, automated grading and analytics, the proposed system reduces manual workload, corroborates adaptive learning and provides educators with actionable insights for classroom and online environments.

安装插件收集

Automatic Generation of Inference Making Questions for Reading Comprehension Assessments

阅读理解评估中推理问题的自动生成

Wanjing Anya Ma, Michael Flor, Zuowei Wang, 2025-Workshop on Innovative Use of NLP for Building Educational Applications

Inference making is an essential but complex skill in reading comprehension (RC). Some inferences require resolving references across sentences, and some rely on using prior knowledge to fill in the detail that is not explicitly written in the text. Diagnostic RC questions can help educators provide more effective and targeted reading instruction and interventions for school-age students. We introduce a taxonomy of inference types for RC and use it to analyze the distribution of items within a diagnostic RC item bank. Next, we present experiments using GPT-4o to generate bridging-inference RC items for given reading passages via few-shot prompting, comparing conditions with and without chain-of-thought prompts. Generated items were evaluated on three aspects: overall item quality, appropriate inference type, and LLM reasoning, achieving high inter-rater agreements above 0.90. Our results show that GPT-4o produced 93.8% good-quality questions suitable for operational use in grade 3-12 contexts; however, only 42.6% of the generated questions accurately matched the targeted inference type. We conclude that combining automatic item generation with human judgment offers a promising path toward scalable, high-quality diagnostic RC assessments.

安装插件收集

被引 1

大模型辅助自动命题

最终分组结果勾勒了大模型辅助自动命题的完整生态系统：从基础的提示工程与微调技术出发，通过检索增强（RAG）和多模态技术确保内容准确性与多样性；随后进入以心理测量学为核心的质量校验环节，确保试题具备科学的难度与区分度；在应用层，研究已深入特定学科定制并延伸至自动化评分与个性化支架生成；最后，通过课程对齐与人机协作框架，将技术落地于宏观教育治理与伦理监管之中。

共 75 篇文献，6 个研究方向

核心生成技术、提示工程与模型微调

侧重于大模型生成题目的底层实现，包括Few-shot、CoT提示策略优化，以及通过微调（如T5, Llama）和流水线设计提升生成内容的结构化与指令遵循能力。相关文献: Euigyum Kim et. al, 2025 等 11 篇文献

知识增强、RAG与多模态命题框架

探讨如何利用检索增强生成（RAG）、知识图谱和外部语料库解决幻觉问题，确保题目真实性，并扩展到视频、图像等多模态命题场景。相关文献: Nicholas X. Wang et. al, 2025 等 12 篇文献

质量验证、心理测量学评估与难度预测

利用项目反应理论（IRT）、Rasch模型及模拟学生技术，对生成题目的信效度、难度、写作缺陷及区分度进行自动化分析与校验。相关文献: Thiago Brant et. al, 2026 等 17 篇文献

特定学科深度定制与跨语言应用

聚焦医疗、STEM、编程、语言教学等垂直领域，研究领域知识的准确性以及针对不同语言环境的本地化命题技术。相关文献: Margeaux C. Johnson et. al, 2023 等 13 篇文献

自动化评分、个性化反馈与教学支架

研究命题技术的下游应用，包括开放性问题判分、生成即时反馈提示（Hints）、反思性问题，以及作为教学代理（Pedagogical Agents）支持自主学习。相关文献: Sahana Bhaskar et. al, 2025 等 10 篇文献

课程对齐、人机协作与教育伦理框架

探讨如何将AI生成内容与课程标准（Bloom分类法等）对齐，分析教师对AI工具的感知、人机协作模式及算法偏见、反作弊等伦理挑战。相关文献: Farzan Karimi-Malekabadi et. al, 2025 等 12 篇文献