19 May 2026
When do Generative AI tools act as a catalyst for learning?


LSE AI and Education Fellow (2025–2027)
I am one of the 10 LSE Fellows in AI and Education, a programme with an ambitious goal to test out how to embed Generative AI in the teaching & learning practices of our disciplines.
* (I’m moving to the LSE Department of Methodology as an Associate Professor in September)
Although the opinions are mine, some of the data and preliminary findings come from the
GENIAL study.
We asked students to share their chat logs and in the case of my data science courses, I also collected the git histories of their assignments.

| Case study | Autumn Term (Sep–Dec 2023) | Winter Term (Jan–Mar 2024) |
|---|---|---|
|
Undergraduate courses |
DS105A – Data for Data Science Quant | DS105W – Data for Data Science Quant |
| DS202A – Data Science for Social Scientists Quant | DS202W – Data Science for Social Scientists Quant | |
| ST207 – Databases Quant | MG317 – Leading Organisational Change Qual | |
|
Postgraduate courses |
— | ST456 – Deep Learning Quant |
| PP422 – Data Science for Public Policy Quant | ||
| MG4B7 – Leading Organisational Change Qual |
Cohorts: 48 active participants (out of 200+) / ~160 active participants (out of 300+)
Students are very optimistic about GenAI and the tools had already been seemingly fantastically helpful for their learning.
There’s nothing to learn from such a biased sample, right?
Well..

Source: LSE DS105 website (lse-dsi.github.io/DS105)
| Student A | Student B | |
|---|---|---|
| Background | 2nd Year BSc, International Social and Public Policy | 2nd Year BSc, Economics |
| Prior coding | Took the Python pre-sessional (struggled with it) | Had prior experience with Python |
| How they used ChatGPT | Used ChatGPT to build the solution for the assignment | First reviewed the week’s content with ChatGPT, then asked for help |
Student A’s logs

Student B’s logs

Student B’s logs

The task involved writing code to:
Student A missed two crucial steps. They were misled by an inaccurate use of GenAI, clearly swayed by the chatbot’s authoritative tone.

After receiving feedback on their poor performance and on their unhelpful use of GenAI, Student A improved tremendously, scoring scores equivalent to distinction scores on the subsequent two graded assignments.
Despite what I had initially thought, the pedagogical tools I adopted in the course had not failed me. They were actually what helped Student A!
Even if not apparent to students, the mapping and the process-driven approach to the assessment helped me identify more easily where the student had gone on a wrong path.
The continuous feedback mechanism helped them course-correct more effectively and still learn in time. The student even engaged more in class (they were someone who wanted to learn!)
I can now use what I have learned from this interaction and conduct a backwards (re)design of the assignment again, for the future.
The process of marking student work becomes a rewarding task of discovery and less of a chore (but it’s still very laborious though…)
Maps student interactions with GenAI onto Kolb’s experiential learning cycle.
Each stage is coded for whether the student:
Five engagement patterns along an agency spectrum:
Resistive → Receptive → Resourceful → Reflective → Riffing
Student A’s traces code as Receptive. Student B’s code as Resourceful.
Level 1: low inference
Codes each exchange on observable features:
Level 2: high inference
Groups exchanges into learning cycles (one sub-task) and asks:
Each stage gets a quality label: +, −, or skip.
Default for any unobserved stage is “skipped.” The coder needs positive evidence to code it otherwise.
Two coders. Training round on students 1–5, calibration round on students 6–10. Target κ > 0.60.
The best comparable in the published literature (Oliveira et al.’s DRIVE framework) reports κ = 0.44 on its hardest categories.
The research assistant codes Level 1 and does not know the grades. The derived variables that enter the regression come from his codes, not mine.
I code Level 2. The dual role (instructor and researcher) is documented.
| DS105W (Data Science) |
MG317 (Management) |
|
|---|---|---|
| Exam? | No | No |
| Assessment | Coursework only | Coursework only |
| AI access | Enterprise Claude (LSE) + personal tools | Enterprise Claude (LSE) + personal tools |
| Process data | Chat logs, git histories, reflections | Chat logs, reflections |
83 consenting students in DS105W (80% of cohort)
53 shared chat logs
9,721 student-AI exchanges
1,689 git commits
Three submission points across the term (W04, W06, W11)
What I’m building next in the second strand of the fellowship: a system that proactively reaches out to students when they look stuck, rather than waiting to be asked.
If the tutor works and grades go up, those increases look identical to grade inflation from the outside. Without process data, you cannot tell “the tutor helped them learn” from “the tutor helped them perform.” I will have to think about what to do about grade inflation.
GenAI Creates Performance Paradoxes
Bastani et al. (2025): Students using unstructured GPT-4 showed 48% improved practice scores but 17% decreased exam performance.
The ‘quick wins’ obtained earlier in the learning journey might not translate to real deep learning. In fact, it might make it worse. This is what happened to Student A.
Contextual Pressures Drive Problematic Dependency
Abbas et al. (2024): Time pressure and academic workload are significant predictors of ChatGPT dependency.
When under pressure, students might be more likely to resort to a more ‘Receptive’ style of engagement with GenAI.
Usage Patterns Determine Learning Outcomes
Lehmann & Cornelius (2024): Substitutive use increases coverage of material but decreases understanding. Complementary use does the opposite.
The authors argue that it’s how one uses the tools that matter.
Deng et al. (2025) on 69 ChatGPT studies
Pooled effect on academic performance: g = 0.71
Of 51 studies in that estimate:
The positive finding may reflect ChatGPT’s output quality, not student learning (Yan et al., 2025)
Maier et al. (2026) on programming, pre-registered
10 studies measuring exam scores after a GenAI-assisted learning phase: g = 0.14 (n.s.)
Exam-environment moderator (~50% of variance):
The two results agree once you ask what was being measured.
Chirikov (2026): 507,076 grades across 319 courses at a US research university, 2018 to 2025.
Writing and coding courses saw a 13 percentage point increase in A grades after ChatGPT’s release (~30% above 2022 baseline)
A triple-differences design ties the effect to courses where homework counted for more. Where homework weight was low, the effect was near zero.
If students were learning more, the improvement should have appeared in supervised exams too. It did not.
Chirikov (2026) calls this task displacement: AI improved what students submitted without improving what students knew. This is the effect of seeing Student A at institutional scale.
| Study | Domain | Finding |
|---|---|---|
| Bastani et al. (2025) | Maths (RCT, ~1000 students) | GPT Base group scored 17% worse on exam |
| Fan et al. (2025) | Essay writing | Better essays, same knowledge test scores |
| Kosmyna et al. (2025) | Essay writing (EEG) | Up to 55% less neural connectivity |
| Shaw & Nave (2026) | Reasoning tasks | Followed wrong AI ~80% of the time |
| Akgun & Toker (2025) | Factual recall | ChatGPT advantage at 3 weeks: gone |
These come from cognitive psychology, learning science, neuroscience, labour economics, and programming education, and none of them cite each other.
Bransford & Schwartz (1999)
Distinguished two ways to measure what education produces:
Sequestered problem solving: isolate the learner, remove all resources, test recall. Most exams work this way. vs
Preparation for future learning: watch how students approach new problems. Do they ask better questions? Improve when given a chance to revise?
The gap between what students hand in and what they retain has existed for decades.
Bjork & Bjork (1992)
Conditions that slow practice down produce better long-term retention.
Three mechanisms:
GenAI removes all three 🙃. It answers on demand, provides complete solutions, and narrows the interaction to a single loop.
Students feel fluent, but the fluency comes from the tool rather than from anything they will retain.
Assessment is the major lever to shape student behaviour. If we want to encourage productive use of GenAI, we need to redesign our assessments so they reward it.
I don’t like the idea of exams (“sequestered problem solving”). Instead I favour instruments that are more aligned with the authentic practices of the discipline being taught.
Higher-order tasks
AI substitution fails on more complex work. On Akgun & Toker’s (2025) hardest task (evaluate a policy document, propose improvements), ChatGPT provided no measurable benefit and AI-generated content dropped from 72% to 27%.
Observed components
One option: a short live coding session. A student who built the pipeline can modify it on the spot; a student who pasted output has to reconstruct what the code does first.
Rubrics that reward the process
DS105W uses criterion-based rubrics (Pass / Good / Really Good / WOW) covering technical judgement, interpretation, and communication alongside code correctness.
Cardoso-Silva, J., Sallai, D., Kearney, C., Panero, F., & Barreto, M. E. (2025). Mapping Student-GenAI Interactions onto Experiential Learning: The GENIAL Framework. SSRN Electronic Journal. (Under Review)
Sallai, D., Cardoso-Silva, J., et al. (2024). Approach Generative AI Tools Proactively or Risk Bypassing the Learning Process in Higher Education. LSE Public Policy Review, 3(3), 7.

Cardoso-Silva (2026) | 19 May 2026 SciencesPo, AI in the Public Sphere