euro-pravda.org.ua

HSE has developed a new method for evaluating artificial intelligence in educational tasks.

Researchers from the Higher School of Economics (HSE) have introduced a novel scientific approach for assessing the competency of artificial intelligence in the field of education. This approach is grounded in psychometric principles and is exemplified through an evaluation of GPT-4. This marks the initial step toward verifying the actual readiness of generative models to serve as assistants for educators or students.
В НИУ ВШЭ создана новая методика для оценки искусственного интеллекта в образовательных целях.

The results of the work have been published on arXiv. Each year, artificial intelligence becomes an increasingly vital part of educational processes, raising an important question for developers: how should we evaluate AI capabilities, especially regarding its role in learning? Researchers from the Higher School of Economics have proposed a new psychometric approach that will aid in creating effective tests to assess the professional competencies of large language models (LLMs), such as GPT. This approach relies on Bloom's taxonomy, which, along with the availability of sufficient benchmarks (tests for language models), is not actively utilized in the context of evaluating outcomes.

A distinctive feature of the presented methodology is that it compares different levels of tasks—both easy (knowledge-based) and professional (how to apply knowledge)—and the evaluation of tasks takes these characteristics into account. This is necessary to assess the quality of the model's recommendations in vastly different situations and how much trust can be placed in it within the educational field. As part of the research, the scientists developed and tested over 3,900 unique tasks divided into 16 professional areas, including teaching methods, educational psychology, and classroom management. The experiment was conducted on the GPT-4 model in its Russian version.

“We have developed a new approach that goes beyond traditional testing,” explains the lead author of the project, scientific supervisor of the Center for Psychometrics and Measurements in Education at the Institute of Education of HSE Elena Kardanova. “Our approach is illustrated by a special new extensive benchmark (the term used for tests for language models) for AI in pedagogy, which is built on psychometric principles and focuses on key competencies important in teaching activities.”

Modern AIs, such as ChatGPT, indeed possess the remarkable ability to rapidly process and generate text, making them potential assistants in educational environments. The results showed that the model struggles with more complex tasks that require deep understanding and adaptive thinking. For example, AI performs well in tasks that establish facts but is less successful in situations that demand detailed analysis and flexible thinking in real, authentic pedagogical cases. Notably, ChatGPT is not 100 percent successful in solving theoretical tasks, even those that are quite simple for regular students.

“The approach we developed vividly highlights a key problem of AI today: you never know where to expect errors. The model can make mistakes even in the simplest tasks, which could be considered the core of the discipline. Our test reveals critical issues both in the knowledge domain and in the area of practical application, thus outlining a path to overcoming these key problems. Addressing them is critically important, as we rely on such models as assistants to teachers and especially students. However, an assistant that requires constant verification—like it is now—will hardly encourage its use,” explains the scientific supervisor of HSE Yaroslav Kuzminov.

Among the potential scenarios for using AI in education, researchers worldwide mention assisting teachers in creating educational materials, automated assessment of student responses, forming adaptive curricula, and providing timely analytics on student academic achievements. The authors believe that AI can become a powerful support for teachers, especially given the increasing workload. However, there is still a need to improve models and approaches to their training and evaluation.

“The test we conducted helped us understand not only how to teach large generative models but also why fears about AI replacing teachers are at least premature. Indeed, it is worth noting the breakthrough of generative models as a teaching assistant: they can already try to create a lesson plan or, for example, a reading list for a class, and in some cases, assess assignments.

Nevertheless, we still encounter model hallucinations, where it invents answers to questions without having information about a phenomenon, or situations of misunderstanding of context. Overall, if we want generative model-based tools to be used in pedagogical practice and to gain epistemic trust, there is still much work to be done,” assessed the results of the conducted test the head of the Laboratory for Designing Educational Content at HSE Taras Pashchenko.

In the future, the research team plans to continue working on improving the benchmark and to incorporate more complex types of tasks that can evaluate abilities such as analysis and information assessment.

“Our future articles will focus on both describing new types of benchmarks and detailing academic techniques. These techniques will be created to further train models to eliminate the risks of hallucinations, loss of context, and errors in core knowledge. The most important goal we hope to achieve is to enable models to be robust in knowledge, as well as to understand how to test such robustness with even higher accuracy; otherwise, it will remain a tool for simplified copying and imitation of knowledge,” noted senior lecturer at the Department of Higher Mathematics at HSE Ekaterina Kruchinskaya.