The Wrong Test: How AI Exposed a Flaw in How We Measure Learning

Reading time: ~ 11 minutes

How generative AI exposed a flaw in the way we assess learning, and why fixing it requires more humans, not more technology.
performance vs learning
desirable difficulties
The Wrong Test: How AI Exposed a Flaw in How We Measure Learning
Procedural art produced using p5.js. A Perlin noise flow field sets particle trajectories; a second, lower-frequency noise layer masks the deposit boundary; a third modulates internal density. Particles carry momentum between frames and leave line-segment trails whose colour shifts from concentrated at the boundary to translucent in the interior. The blend mode, noise octaves, and stroke weight are parameterised. Inspired by some Yorkstone slabs I found in Brick Lane, London, and polished after coming across David Ramalho’s experiment.
Author

Jon Cardoso-Silva

First Draft

11 April 2026

Modified

11 April 2026

Key Definitions used in this article
TermDefinition
desirable difficultiesManipulations that slow acquisition but enhance retention and transfer, such as spacing practice, interleaving task types, and testing rather than re-reading. Term from Nick Soderstrom and Robert Bjork. Conditions that make practice feel harder often produce better long-term outcomes (Soderstrom and Bjork, 2015).
learningUsed in two senses across this blog. When discussing outcomes, learning consists of relatively permanent changes in knowledge or skills that persist beyond practice and transfer to new contexts (Soderstrom and Bjork, 2015). But when discussing process, learning refers to knowledge created through the transformation of experience (Kolb, 1984), where the doing, reflecting, and experimenting are the learning.
meta-analysisA statistical method that combines results from multiple independent studies to estimate an overall effect. A single study might be too small to be conclusive; a meta-analysis pools the evidence across many studies.
performanceTemporary changes in behaviour or knowledge observable during or immediately after practice. A student can perform well today and fail the same task next month. Performance during acquisition does not reliably indicate learning (Soderstrom and Bjork, 2015).
transferUsing what you learned in one context to handle a different one. Includes solving a new problem unaided (direct application) and learning something new faster because of prior experience (preparation for future learning). See Bransford and Schwartz (1999).

Since ChatGPT launched in late 2022, anyone who teaches has had to reconsider how their students learn. I teach data science and programming at LSE, and the early conversation about the “death of the essay” applied to my courses too, since students write up their analyses alongside their code, but I worried about the coding at least as much because ChatGPT could already write working code. The tools have only got better since: Claude Code, GitHub Copilot, Cursor, Lovable, and something new every few weeks.

The recent experimental evidence suggests this worry is grounded. Bastani et al. (2025) gave roughly a thousand high school maths students access to GPT-4 for practice and found that the group with unrestricted access scored 17% worse on the subsequent exam than students who never had the tool, even though their practice scores had gone up. Fan et al. (2025) ran a similar comparison with essay writing: the ChatGPT group produced higher-scoring work than those who used no AI, and even those paired with a human writing expert, but showed no advantage on a knowledge test or a task given the same day.

I use these tools myself every day, and I am excited about what they let me do that I could not do before. But it does worry me when I cannot tell whether my students have actually understood the material or whether they just handed in whatever the chatbot produced. That worry gets a bit existential when you sit down to revise your syllabus or plan the graded assignments for a new term. What am I actually measuring in an assignment? How do I design coursework that rewards more the learning process rather than a coherent output? This post is about grappling with those questions.

The gap between output and understanding

A of 69 ChatGPT experiments from 2022 to 2024 (Deng et al., 2025) finds that students using ChatGPT achieve higher academic performance, feel more motivated, and score better on higher-order thinking tasks, all while exerting less mental effort. More learning for less effort? Sounds great, right?!

But Deng et al. flag a measurement problem in their own data: of the 51 studies that contributed to the performance estimate, nine allowed students to use ChatGPT during the assessment itself, 33 did not report whether it was allowed, and only nine clearly prohibited it. The positive findings may reflect the quality of ChatGPT’s output rather than anything the students learned (Yan et al., 2025). A separate review (Walker & Vorvoreanu, 2025) reaches a similar conclusion: unstructured generative AI use in formal learning is associated with weaker memory, less critical engagement, and growing dependence on the tool1.

1 It’s not all just bad news, though! I have a new blog post coming soon about the conditions under which Generative AI helps with learning.

When researchers remove the AI from students, the gains disappear. Akgun & Toker (2025) found that advantages measured on immediate tasks had vanished entirely three weeks later. Darvishi et al. (2024) found that students who had relied on AI produced lower-quality work after its removal than students who never had it. Even Bastani et al.’s pedagogically constrained GPT Tutor, which withheld direct answers and pushed students to reason through problems, only managed to avoid harm and did not outperform the no-AI control on the subsequent exam (Bastani et al., 2025).

Interestingly, Bastani et al. also ran an NLP classification of student messages and found that in 95% of GPT Base conversations, students asked for the answer at least once. The error pattern in the data tells us something about how they treated those answers. GPT-4 was correct only 51% of the time, making logical errors on 42% of problems and arithmetic errors on 8%. A student who was reading and evaluating the solutions would catch arithmetic mistakes more easily than logical ones, since checking a calculation is simpler than evaluating a line of reasoning. But both error types reduced practice scores by similar amounts, suggesting students were accepting the output wholesale rather than, say, internalising the wrong concepts because they ‘learned’ from the AI. It is also telling that the students themselves never reported feeling that they had learned less, even though their exam scores had dropped by 17%.

Kosmyna et al. (2025) approached the question from a different direction2. Using EEG to measure brain activity during essay writing across three conditions (ChatGPT, a search engine, and no tools), they found that cognitive engagement scaled down with the level of external support, with ChatGPT users showing the weakest neural connectivity. Participants who later switched from ChatGPT to working alone showed reduced neural engagement compared to those who had never used it, a pattern the researchers call cognitive debt 3. Over four months, the ChatGPT group underperformed at neural, linguistic, and behavioural levels, and struggled to recall or quote their own essays.

2 You might have come across this study before as it made quite the splash in the media and online in general.3 There has been a proliferation of new terms to describe this in the literature! I plan to write separately about the difference between them. There’s cognitive offloading, cognitive debt, our own cognitive bypass (Sallai et al., 2024), and the newly coined cognitive surrender (Shaw & Nave, 2026).

Learning science predicted this

I would wager that none of the findings above would come as a surprise to a cognitive psychologist. The gap between how well students perform during practice and how much they actually retain has been studied for decades, long before anyone had heard of ChatGPT. Soderstrom & Bjork’s (2015) Learning versus performance: An integrative review4 synthesises that body of work. What looks like learning during practice often is not. Students can perform well while receiving instruction but fail tests on the same topic weeks later. Conversely, and counterintuitively, students who struggle during practice often outperform on delayed tests. That is, conditions that make acquisition feel easier often produce worse long-term outcomes.

4 Google Scholar counts nearly 1000 citations for this paper.

Bjork & Bjork’s (1992) “new theory of disuse” offers a possible explanation for this mechanism. They distinguish storage strength (how integrated a memory is with other knowledge) from retrieval strength (how accessible it is right now). Gains in storage strength are greater when current retrieval strength is lower. The harder you work to retrieve something, the more that retrieval strengthens the memory. Bjork coined the term “desirable difficulties” 5 for manipulations that slow acquisition but enhance retention and transfer: spacing practice across sessions rather than cramming, interleaving different task types rather than practising one to mastery, and testing rather than re-reading.

5 I love this concept! My students, not so much… I might write about how I engineer pedagogical attrition in my courses in the future.

Generative AI provides exactly the conditions that the cognitive science research would predict to be harmful. GenAI reduces difficulty, provides constant guidance, and makes work feel fluent. The apparent fluency itself is a huge part of the problem because students may believe they have learned because the task felt easy, when in fact they have not built durable memory or transferable skills.

Several mechanisms contribute to this, from the straightforward (the model does the cognitive work so the student never practises it) to the subtler (the student stops evaluating the model’s output, and then stops noticing when it is wrong). I unpack the different labels researchers have given to these processes, and how they relate to each other, in a new blog post. The short version is that these labels describe the same problem at different levels, from the general mechanism to the behaviour you see in a student’s chat log.

The decades of research that Soderstrom and Bjork review show that ease during practice is a weak guide to what people retain weeks later and to how they fare on a genuinely new task. The mismatch between performance and learning existed long before GenAI, but a chatbot can now produce the polished output that used to require understanding. When we grade what students can generate in a single sitting, with or without a model whispering in the tab, we are mostly auditing that output.

Assessment built on the wrong theory

Most university courses and the studies I described above share the same assumption about how to find out whether students learned: teach them, let them practise, then remove the support and judge how they tackle a similar task on their own. Bransford & Schwartz (1999) call this “sequestered problem solving” (SPS).

Bransford & Schwartz argue that SPS, and the “direct application” theory of transfer that accompanies it, are responsible for much of the pessimism about whether education produces transfer at all. Under SPS, transfer looks rare. Students tested in isolation frequently fail to produce adequate solutions (1999, pp. 66–68), and the conclusion is that their education did not prepare them.

But consider what SPS misses. When Bransford & Schwartz (1999, pp. 66–67) asked fifth graders and college students to create recovery plans for bald eagles, neither group produced adequate plans. Under SPS, both failed. But when asked what they would need to research, the groups diverged: fifth graders asked about individual eagles (“How big are they?”), while college students asked structural questions about ecosystems, historical threats, and the kinds of specialists needed. Their prior learning had not given them the answer, but it had prepared them to ask better questions.

Bransford & Schwartz call this “preparation for future learning” (PFL). Where direct application asks “can you apply what you know?”, PFL asks “has what you know prepared you to learn new things?” The evidence for PFL is found in process: the sophistication of questions asked, the quality of hypotheses formed, the ability to seek and use resources effectively, the trajectory of improvement when given the chance to revise.

Nobody would assess a newly qualified teacher by locking her in a room and testing whether she can recall her education courses from memory. You would watch her teach over time: how she adapts to her students, how she seeks feedback, how her practice improves. That is PFL assessment. Yet for most students, in most courses, we give them the locked-room version and call the result “evidence of learning.”

Broudy (1977, as discussed by Bransford & Schwartz (1999)) offers a useful concept here: “knowing with.” Beyond replicating facts (knowing that) and applying procedures (knowing how), people perceive and interpret the world through their accumulated knowledge. An educated person “thinks, perceives and judges with everything that he has studied in school, even though he cannot recall these learnings on demand.” You forget the details of a biology course, but the concept of bacterial infection still shapes how you interpret illness. You forget specific statistical formulas, but the idea that data has variability still shapes how you read a graph 6. This residual framework, largely tacit, is what Broudy calls “knowing with,” and it is what PFL-style assessment tries to detect.

6 This vibes a lot with the style of education Paulo Freire advocated in his critical pedagogy work.

SPS cannot tell us whether GenAI-mediated learning changed what students “know with,” whether it shifted their questions or their readiness to learn the next thing. Our conventional assessments cannot tell us either. As Bransford & Schwartz write: _“Despite the value of the SPS methodology, it often comes with a set of unexamined assumptions about what it means to know and understand. The most important assumption is that ‘real transfer’ involves only the direct application of previous learning; we believe that this assumption has unduly limited the field’s perspective.”

We still treat one solo shot at the problem as the verdict on learning. Preparation for future learning looks elsewhere: at how students approach problems they have not seen before, and whether they get better when given the chance to revise.

Watching how students learn

The alternative to sequestred problem solving assessment is to assess how students learn when given the opportunity to do so, using all the resources available to them, including Generative AI. This is what Bransford & Schwartz’s PFL perspective implies, and it shifts attention from product to process7.

7 Much of the assessment theory here draws on work from the late 1990s. From what I can tell, the arguments still hold up. A recent systematic review of reviews on problem-based learning (Amoa-Danquah & Carbonneau, 2025) confirms that process-focused, learner-centred approaches continue to show benefits for engagement and critical thinking. I plan to write separately about PBL and project-based learning.

Bransford & Schwartz (1999) gave a concrete example in 1999: students attempted a geometry challenge, rated how confident they were, and chose how much help they wanted (from a brief definition up to an interactive simulation). They then tried an analogous problem. What mattered was how they responded to difficulty: did the student recognise they were stuck, pick help that addressed the gap, and improve on the second attempt?

Current assessment rarely captures any of this. A student who uses ChatGPT to produce correct code and a student who struggled with the problem, asked the AI specific questions, and modified what it returned would receive the same mark on most rubrics. The first student may not notice a gap in their understanding if the problem changed. The second student already showed that they can.


Although I do not yet have answers for how to implement process-based assessment at scale, my colleagues and I have been developing a framework to help make sense of what students actually do when they interact with GenAI. The GENIAL Framework (Cardoso-Silva et al., 2025) provides a tool for investigating the process of learning when mediated by GenAI. It describes several engagement patterns, of which two are most relevant here: “Resourceful” (students who use AI to support their own thinking by adapting suggestions, asking follow-up questions, and testing alternatives) and “Receptive” (students who delegate thinking to AI by copying output without modification and accepting answers without evaluation). In Bransford & Schwartz’s terms, Resourceful engagement looks like PFL evidence and Receptive engagement looks like SPS failure made visible in real time 8.

8 I plan to write separately about what process-based assessment looks like in practice.

Observing learning trajectories takes time. It requires reading students’ process alongside their products, examining how they interact with tools, tracking their questions and revisions. This is labour-intensive work that does not scale through automation (yet?), because the judgment required is precisely the kind that cannot be delegated to an algorithm without recreating the problem we are trying to solve. If we outsource the evaluation of learning processes to AI, we risk the same gap at the assessment level that we identified at the student level which is the appearance of rigour without the substance.

Educators who want to assess learning rather than performance need time to observe processes, design dynamic assessments, and evaluate trajectories. This means smaller class sizes, with more time and motivation to investigate the learning process of their students when grading their work, or more teaching staff, or both. It means administrators recognising that AI-resilient assessment is not a technology problem with a technology solution. The answer may be, uncomfortably to some, more humans, not more AI.

That said, the evidence is not uniformly bleak. There are conditions under which students who use AI learn more than those who work without it, and the design choices that separate productive use from dependency are becoming clearer. 9

9 I plan to cover those findings in a new blog post.

References

Akgun, M., & Toker, S. (2025). Short-Term Gains, Long-Term Gaps: The Impact of GenAI and Search Technologies on Retention. arXiv.org. https://doi.org/10.48550/ARXIV.2507.07357
Amoa-Danquah, P., & Carbonneau, K., J. (2025). A Systematic Review of Reviews on Problem-Based Learning and Its Effectiveness. Current Issues in Education, 26(2). https://doi.org/10.14507/cie.vol26iss2.2293
Bastani, H., Bastani, O., Sungu, A., Ge, H., Kabakcı, Ö., & Mariman, R. (2025). Generative AI without guardrails can harm learning: Evidence from high school mathematics. Proceedings of the National Academy of Sciences, 122(26), e2422633122. https://doi.org/10.1073/pnas.2422633122
Bjork, R. A., & Bjork, E. L. (1992). A new theory of disuse and an old theory of stimulus fluctuation. In A. K. Healy, S. M. Kosslyn, & R. M. Shiffrin (Eds.), From learning processes to cognitive processes: Essays in honor of william k. estes (Vol. 2, pp. 35–67). Erlbaum.
Bransford, J. D., & Schwartz, D. L. (1999). Rethinking transfer: A simple proposal with multiple implications. In Review of research in education (Vol. 24, pp. 61–100). American Educational Research Association. https://doi.org/10.2307/1167267
Cardoso-Silva, J., Sallai, D., Kearney, C., Panero, F., & Barreto, M. E. (2025). Mapping Student-GenAI Interactions onto Experiential Learning: The GENIAL Framework. SSRN Electronic Journal, 22. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5674422
Darvishi, A., Khosravi, H., Sadiq, S., Gašević, D., & Siemens, G. (2024). Impact of AI assistance on student agency. Computers & Education, 210, 104967. https://doi.org/10.1016/j.compedu.2023.104967
Deng, R., Jiang, M., Yu, X., Lu, Y., & Liu, S. (2025). Does ChatGPT enhance student learning? A systematic review and meta-analysis of experimental studies. Computers & Education, 227, 105224. https://doi.org/10.1016/j.compedu.2024.105224
Fan, Y., Tang, L., Le, H., Shen, K., Tan, S., Zhao, Y., Shen, Y., Li, X., & Gašević, D. (2025). Beware of metacognitive laziness: Effects of generative artificial intelligence on learning motivation, processes, and performance. British Journal of Educational Technology, 56(2), 489–530. https://doi.org/10.1111/bjet.13544
Kosmyna, N., Hauptmann, E., Yuan, Y. T., Situ, J., Liao, X.-H., Beresnitzky, A. V., Braunstein, I., & Maes, P. (2025). Your Brain on ChatGPT: Accumulation of Cognitive Debt when Using an AI Assistant for Essay Writing Task. arXiv. https://doi.org/10.48550/ARXIV.2506.08872
Sallai, D., Cardoso-Silva, J., Barreto, M. E., Panero, F., Berrada, G., & Luxmoore, S. (2024). Approach generative AI tools proactively or risk bypassing the learning process in higher education. LSE Public Policy Review, 3(3), 7. https://doi.org/10.31389/lseppr.108
Shaw, S. D., & Nave, G. (2026). Thinking—Fast, Slow, and Artificial: How AI is Reshaping Human Reasoning and the Rise of Cognitive Surrender. PsyArXiv. https://doi.org/10.31234/osf.io/yk25n_v1
Soderstrom, N. C., & Bjork, R. A. (2015). Learning versus performance: An integrative review. Perspectives on Psychological Science, 10(2), 176–199. https://doi.org/10.1177/1745691615569000
Walker, K., & Vorvoreanu, M. (2025). Learning outcomes with GenAI in the classroom: A review of empirical evidence [Microsoft Aether Psychological Influences of AI (Psi) working group]. Microsoft Research. https://www.microsoft.com/en-us/research/wp-content/uploads/2025/10/GenAILearningOutcomes-Report-published-10-07-2025.pdf
Yan, L., Greiff, S., Lodge, J. M., & Gašević, D. (2025). Distinguishing performance gains from learning when using generative AI. Nature Reviews Psychology, 4, 435–436. https://doi.org/10.1038/s44159-025-00467-5