The mutant algorithm: what actually went wrong with this year’s school exams
The Prime Minister described it as a “mutant algorithm”.
Social commentators were naturally outraged as it disproportionately upgraded the results of children from private schools, while downgrading those from state schools, especially in the poorest areas.
Few people actually understood how it worked, as the technical report from the UK Office for Qualifications (Ofqual) purporting to explain the algorithm ran to 319 pages.
In the end, the results of the government’s attempt to recalibrate exam grades with a computer were clear enough. It failed to adjust for biases that the developers didn’t consider, and almost 40 per cent of students had their teacher-awarded grades adjusted down.
In the face of public outcry, the government decided to junk the algorithm altogether and revert to teachers’ grades, opening up the undesirable prospect of significant grade inflation, as well as budget chaos at universities.
How did we get here, and what can we learn?
After the exams were cancelled, Ofqual was confronted with two fundamental challenges. The first was to avoid grade inflation and standardise the scores; the second was to assess students as accurately as possible so that university admissions could proceed in as fair a manner as possible.
Unfortunately, the collective decision was made to prioritise standardisation, instead of focusing on the ultimate purpose of the exams system, ensuring that university places are allocated on the basis of universal merit.
Once standardisation was made the ultimate goal, the logical step was for the regulator to come up with a statistical model to predict a distribution of entrance-exam scores for 2020 that would match the distribution from last year.
What would ensue was only ever going to be unfair to individuals on a national scale.
So what went wrong? To use philosophical terminology, the algorithm had a fundamental category mistake built into it.
It treated the efforts, abilities, and corresponding hopes and expectations of millions of young people as inputs that could not only be quantified but also qualified across an arbitrary set of figures and functions.
Considered more scientifically, the algorithm failed to attribute the right factor weightings and added an arbitrary weighting for class size.
The specifics of how it worked are rather involved, or they would not have required a 319 page explanation, but there were three key inputs.
Firstly, Centre Assessment Grades (CAGs). That is to say, what grade the teachers expected each of the children to get.
Secondly, a weighting within that grade for how sure the teacher was that the child would attain that grade.
Finally, a factor to adjust for teacher bias by fitting all grades to the weighting of grades that other children in the same school attained in the previous three years.
Different rules were applied within the algorithm for different class sizes. For classes of five or fewer children, they were simply awarded their CAGs. For classes of between five and fifteen children, the CAGs got some weight, as did the historical performance of other children in that school. For classes of over fifteen children, the CAGs had no impact.
The inevitable outcome was that children in private schools, which almost always have much smaller class sizes, benefited greatly.
The CAGs were set by teachers to higher than realistic levels to maintain the schools’ reputations and those grades were either given or weighted into the final grade.
Conversely, clever children in under-performing state schools were effectively awarded the highest mark of their predecessors, being the cleverer children in previous years.
So plenty of children who would have been awarded A* results, but came from schools which had a history of getting lower grades, were given lower grades.
The government ultimately did a u-turn two days after announcing that they would not change their plan from the patchwork of solutions that briefly followed the algorithm.
The new plan is simply to rely on CAGs for each child’s final grade. This may seem fairer in many cases, but it leads to inevitable grade inflation and some positive bias for children who are considered nicer.
Having spent my school years being more intelligent than amiable, I am sure that I would have been materially disadvantaged by my irate teachers having the final word on my grades.
But why didn’t they just do this to start with?
The government didn’t want to go down this road because of universities.
The Department for Education lifted the cap on university places at the same time as announcing that CAGs would be the new grades. This just pushes the already under-funded universities to make harder decisions in a system with universally higher grades.
In a report published last week by the Oxford Internet Institute, researchers found that one of the most common traps organisations fall into when implementing algorithms is the belief that they will fix extremely complex structural issues.
Educational attainment is unquestionably a complex structural issue. While it might be possible to develop useful algorithms to help on an individual basis, for example with students with learning difficulties, who do not lack intelligence or ability but simply fare poorly with traditional testing methods, using algorithms to blanket mark a whole generation of young people was always going to end badly.
We mustn’t blame algorithms per se, but simply the context in which they are used and the erroneous weightings they often involve.
Recognising this going forward should reduce the risk of future mutants.
Nick Maughan founded education charity the Nick Maughan Foundation.