New Study Suggests GPT Can Outsmart Most Exams, But It Has a Weakness

It’s been used in virtually every application imaginable, and classrooms are no exception. But how good is ChatGPT really at solving exam questions? Image generated by AI.

When ChatGPT burst onto the global scene, one of its first proving grounds was the classroom. Suddenly, students had access to a tool capable of answering questions, writing essays, and potentially bypassing traditional academic challenges.

Unsurprisingly, students loved it.

Also unsurprisingly, for educators, this raised a pressing concern: could ChatGPT help students cheat their way through education?

In a new study, researchers took the question one step further, looking at just how many courses ChatGPT could graduate on its own, without any student input.

“We were surprised at the results. Nobody expected that the AI assistants would achieve such a high percentage of correct answers across so many courses. Importantly, the 65% of questions answered correctly was achieved using the most basic, no knowledge prompt strategy, so anyone, without understanding anything technically could achieve this. With some subject knowledge, which is typical, it was possible to achieve an 85% success rate and that was really the shock,” said Anna Sotnikova, a scientist with the NPL and co-author on the paper.

STEM-GPT

The researchers analyzed over 5,500 exam and assignment questions from STEM fields. For multiple-choice questions, it achieved passing marks on 77% of all courses. It excelled even more in structured or text-based questions. However, open-ended tasks and creative problem-solving proved more challenging.

There were no degrees that were invulnerable to GPT’s ability, but the ones with more standardized tests seemed to be more vulnerable. Basically, researchers using AI can cheat these courses without grasping anything from the curriculum. Courses with more participants and not much human supervision were also more vulnerable.

The course pass rate for different types of GPTs. Image credits: PNAS.

Overall, the findings underscore an urgent need to adapt educational assessment methods to maintain their relevance in the current world where AI availability is widespread.

“The fear is that if these models are as capable as what we indicate, students that use them might shortcut the process through which they would learn new concepts. This could build weaker foundations for certain skills earlier on, making it harder to learn more complex concepts later.

“Maybe this needs to be a debate about what we should be teaching in the first place in order to come up with the best synergies of the technologies we have and what students will do in decades to come,” said Bosselut.

Where ChatGPT didn’t do well

Researchers were very interested in areas where GPT struggled the most. After all, the type of evaluation that’s more likely to withstand the AI assault is probably the one you want to use in your course.

The AI tends to have difficulty with higher-order thinking or creative applications — areas that align with the upper levels of Bloom’s taxonomy. Bloom’s taxonomy is a classification of the different outcomes and skills that educators set for their students (learning outcomes).

Higher-level tasks such as analysis, evaluation, and creation require synthesizing information, forming original insights, and applying knowledge in novel ways, all of which are currently beyond the capabilities of AI systems like GPT-4. This gap highlights an opportunity to design assignments that emphasize these complex cognitive processes, encouraging students to engage deeply with the material and develop skills that are less replicable by AI.

However, researchers say it won’t be enough to simply mimic the type of assignment where ChatGPT struggled. We need a paradigm shift in assessment. That shift is making assignments more difficult — not in terms of absolute difficulty, but in terms of the type of assignments in exams.

For instance, they recommend focusing more on project learning, as practical applications of knowledge may be more difficult for AI to replicate. Project learning is also considered to be more difficult by most students, possibly for the same reason.

They also suggest making “AI-adversarial questions” — questions that demand originality and multi-disciplinary approaches. Naturally, these are traditionally considered more difficult than straightforward questions. Replacing pure memorization tasks with complex, open-ended questions is one of the first places to start.

“In the short term we should push for harder assessments — not in the sense of question difficulty, but in the sense of the complexity of the assessment itself, where multiple skills have to be taken from different concepts that are learned throughout the course during the semester and that are brought together in a holistic assessment,” Bosselut suggested. “The models aren’t yet really designed to plan and work in this kind of way and, in the end, we actually think this project-based learning is better for students anyway.”

Rather than banning AI, which is probably unrealistic by now, researchers say we should strive to integrate it as a tool that students should analyze critically, developing an ability to work responsibly alongside this technology.

A big challenge for universities

Many universities have become complacent in how they do assignments, implementing systems that are largely similar to those used decades ago. This is simply not going to work anymore.

University students will use AI and ChatGPT (or other similar chatbots). And, if they can get away with this on assignments, then the assignment is almost worthless. Of course, you can do live exams with direct supervision and enforce no AI usage, but this is not a long-term solution. We need to look at the skills that AI can do easily and stop pushing students to develop those skills. Instead, we should focus on things that cannot be replicated easily by AI.

“AI challenges higher education institutions in many ways, for example: which new skills are required for future graduates, which are becoming obsolete, how can we provide feedback at scale and how do we measure knowledge? These kinds of questions come up in almost every management meeting at EPFL and what matters most is that our teams initiate projects which provide evidence-based answers to as many of them as we can,” said Pierre Dillenbourg, Vice President for Academic Affairs at EPFL.

It’s already been over a year since chatbots have skills that students can use to pass exams and most universities have been slow to react. In the long term, students will likely have even more access to AI tools they can use to cheat exams and do assignments.

At the same time, it doesn’t make sense to ignore these fields altogether. For instance, we’ve had calculators for over twenty years but we still teach mathematics and still consider it a cornerstone of education.

Calculators didn’t stop math classes

“This is only the beginning and I think a good analogy to LLMs right now are calculators when they were introduced, there were a similar set of concerns that children would no longer learn mathematics. Now, in the earlier phases of education, calculators are usually not allowed but by high school and above, they are expected, taking care of lower-level work while students are learning more advanced skills that rely upon them,” added Beatriz Borges, a Ph.D. student in the NLP and co-author of the paper.

“I think that we will see a similar, gradual adaptation and shift to an understanding of both what these systems can do for us and what we can’t rely on them to do. Ultimately, we include practical suggestions to better support students, teachers, administrators, and everyone else during this transition, while also helping to reduce some of the risks and vulnerabilities outlined in the paper,” she concluded.

The findings from EPFL serve as a wake-up call for higher education. AI is no longer a futuristic concept but an integral part of our present reality. Institutions must adapt quickly to harness its potential while safeguarding the integrity of education. It won’t be easy and will likely require some trial-and-error, but ultimately, this is what higher education needs in this day and age.

Journal Reference: Beatriz Borges et al, Could ChatGPT get an engineering degree? Evaluating higher education vulnerability to AI assistants, Proceedings of the National Academy of Sciences (2024). DOI: 10.1073/pnas.2414955121