Getty Images / WIRED
On March 18, the government announced that, like so many annual institutions that have fallen victim to Covid-19, this summer’s exams would be cancelled.
In the exams’ place, the Office of Qualifications and Examinations Regulation (Ofqual) asked teachers to predict the grades each of their students would have achieved. As teachers tend to be optimistic about their student’s attainment – they are prone to predict their result “on a good day” – further refinement was deemed necessary in order not to undermine sixth forms’, universities’, and employers’ confidence in this year’s results. This is where Ofqual’s now much-maligned algorithm stepped in – as a dampener to this optimism.
Advertisement
Ofqual claimed that if this standardisation had not taken place we would have seen the percentage of A* grades at A-levels go up from 7.7 per cent of grades in 2019 to 13.9 per cent of grades this year. But after the algorithm downgraded 39 per cent of the A-level grades predicted by teachers in England, the government has effected a total u-turn. The algorithm has been ditched, and students will be belatedly graded with the original teacher’s predictions. So how did the algorithm work? And what went wrong?
The model was extremely complicated. (Jeni Tennison, vice president of the Open Data Institute, breaks it down in detail here). First, Ofqual created a historic profile of the grades pupils had previously achieved for each subject offered at each school – for A-level results, for instance, it examined the last three years of results per subject. It then examined how those grades related to their final grade in those subjects each year.
Then Ofqual generated three sets of grades – the actual distribution of grades from previous years, the predicted distribution of grades for past students, and then the predicted distribution of grades for current students. Both predicted distributions are based on what was achieved nationally in each subject, in previous years, by children with similar prior attainment (GCSEs for A Level or Key Stage 2 results for GCSEs).
The algorithm calculated the difference between the predicted distribution for current students and previous students, and uses this to adjust the actual distribution for previous students to give a distribution for current students.
Advertisement
After this, the students were assigned their grades based on a ranking provided by teachers of all the students in a cohort, from best to worst. Ofqual outlined the reasoning behind this in a 317 page report, saying people are better at making relative judgements than absolute judgements. Teachers, the report states, would be more accurate when ranking students rather than estimating their future attainments.
“That gives you rough grades for every student,” says Tennison. “From those grades, they work backwards to what marks they might have achieved in the exam.” The ranking system meant that even if you were predicted a B, if you were ranked at fifteenth in your class and the pupil ranked at fifteenth the last three years received a C, you would likely get a C.
Although the proportion of A grades rose to an all time high of 27.9 per cent, the algorithm – and this is where the controversy lies – also downgraded almost 40 per cent of the A-level grades predicted by teachers in England.
There were multiple problems with this process. The first was that using a national average of grades penalised excellent schools. “If a school tends to do very well, in that it gets the best possible A-Level outcomes, not only for students who score exceptionally well at GCSE but also for students who score poorly at GCSE, then that kind of school will be penalised by this model, simply because the average school does not do as well,” says George Constantinides, a professor of digital computation at Imperial College.
Advertisement
There was also an issue with cohort sizes – these are not precisely the same as class sizes (a private school might have lots of tiny classes but still enter enough pupils for a large cohort) but, due to the lack of data, a cohort that averaged smaller than five were just given their teacher’s predictions. “We’ve seen that independent schools have done pretty well out of this because they tend to be smaller than state schools, and more frequently have lower pupil numbers,” says says Philip Nye, external affairs manager at the FFT Education Datalab, an education policy think tank. “In some subjects such as languages and music, we’re seeing results absolutely shoot up. Because these are typically entered by only a few students at every school.”
Another problem with the system surrounded testing. In order to test the model’s accuracy, says Constantinides, Ofqual looked back over a similar time window and tried to use it to see if it would predict adequately the 2019 results. This wasn’t that successful – for some subjects the model was only 40 per cent accurate in terms of predicting what the 2019 cohort actually achieved. Worse – this accuracy was based on the ranking of where the 2019 students actually came in their cohort based on their exam results. 2020’s ranking was just a prediction made by teachers.
“[This test] assumed perfect rank orders in order to see how well the model behaves,” says Constantinides. “So when Ofqual say its model is 40 per cent accurate, it’s actually likely to be a lot less than that, because that doesn’t take into account any inaccuracy in rank ordering.”
Many of these problems could have been sorted before result’s day, had there been proper consultation and transparency. “I think there are still unanswered questions – schools are still scratching their heads about exactly how their results have been worked out, which isn’t fair,” says Constantinides.
The model could have been published before the results, and experts could have monitored and analysed it – Sky News reported that the Royal Statistical Society offered to help the regulator with the algorithm in April, writing to Ofqual to suggest that it take advice from external experts.
“It would have helped to include the voices of parents and children at universities and the teachers, and of education specialists and experts and statisticians and data scientists throughout that process,” says Tennison. “Part of the problem here is that these issues came out only after the grades were given to students, when we could have been having these discussions and being examining the algorithm and understanding the implications of it much, much earlier.” There was no clear redress mechanism either, to save people anxiety, time, and money – despite automated decision-making always making errors.
Now, students will receive their teacher’s predicted grades. Though this is certainly the least worst alternative, it comes with the same worries that Ofqual was trying to mitigate – high grades will be unfair on later cohorts, high grades from specific schools will be unfair on other schools, and sixth forms, universities, and employers may end up taking this year’s results with a pinch of salt. “An algorithm in itself is neither good or bad,” says Nye. “It’s the effects that it has and the way it’s been set up.”
The u-turn itself is a potential disaster, too. Beyond the stress it has inflicted on students, the government has essentially passed the administrative buck to universities, who will now have to consider honouring thousands more offers – they have said that despite u-turn, it will not be possible to honour all original offers. (Ofqual did not answer a request for comment by the time of publication).
“This whole story has really highlighted the problems that there are around automated decision making, in particular when it’s deployed by the public sector,” says Tennison. “This has hit the headlines, because it affects so many people across the country, and it affects people who have a voice. There’s other automated decision making that goes on all the time, around benefits, for example, that affect lots of people who don’t have this strong voice.”
Will Bedingfield is a staff writer for WIRED. He tweets from @WillBedingfield
More great stories from WIRED
🚅 Night trains are brilliant. So why doesn’t the UK have any to Europe?
💉 The race is on to create a vaccine. This mRNA coronavirus vaccine is two breakthroughs in one
🎧 Need some peace? These are the best noise-cancelling headphones in 2020
Advertisement
🔊 Listen to The WIRED Podcast, the week in science, technology and culture, delivered every Friday
👉 Follow WIRED on Twitter, Instagram, Facebook and LinkedIn
Get The Email from WIRED, your no-nonsense briefing on all the biggest stories in technology, business and science. In your inbox every weekday at 12pm sharp.
by entering your email address, you agree to our privacy policy
Thank You. You have successfully subscribed to our newsletter. You will hear from us shortly.
Sorry, you have entered an invalid email. Please refresh and try again.