The Best Accuracy Measurement for Captions Yet: The NER Model
The accuracy of live captions has been an area of rich debate (and intensive research) for some time.
Popular platforms like YouTube have received criticism from the deaf community about their inaccurate computer-generated captions, sparking movements like deaf activist Rikki Poynter’s #nomoreCRAPtions.
But there are distinctions to be made between the accuracy of computer-generated captions (not so good) and human-generated captions (usually very good) – which we have explored in this article.
And perhaps even more importantly when it comes to quality, we need to understand the systems used to measure the accuracy of captions.
Until recently, the most common model used to measure the accuracy of captions today has been the Word Error Rate, or WER, model. However, this model leaves a lot to be desired in terms of how well it actually measures the quality of captions, especially for people who rely on them, like deaf and hard-of-hearing audiences.
For example, the difference between the words ‘can’ and ‘can’t’ would be considered an average error by the WER model measure, while in fact these two words have opposite meanings and completely change the meaning of a sentence. At the same time, leaving out words such as ‘like’ or ‘you know’ are penalized by the WER model, yet their omission barely changes the meaning of a sentence.
This is why Ai-Media favors the ‘NER model’.
The NER model was created by Dr Pablo Romero-Fresco at the University of Roehampton in the UK, as a way of ‘objectively’ assessing the quality of live captions.
The NER system is the best way yet devised to measure the accuracy of live captioning. It is repeatable and usable across various industries and services, as well as being adopted in several global regions.
How is accuracy measured?
Don’t be concerned by the mathematics we’re about to do! We will interpret it just below.
The name ‘NER model’ comes from the equation the model uses to produce a quality score – or assessment of accuracy.
Score = (N-E-R)/N
N is the total number of words and punctuation in the captioned piece
E is the sum of Edition errors – An Edition error being where a word or words have been spoken but do not appear in the captions (or sometimes where words have been added to the captions but have not been spoken).
R is the sum of Recognition errors – A Recognition error being where an incorrect word or words appear in the captions.
So, the NER measurement is looking at how the number and severity of two types of errors measures up to the total number of words and punctuation in the captioned piece.
How this impacts the viewer
The NER model is viewer-centric, in that it decides how bad an error is by the impact that the error has on a viewer’s understanding of the program. This is great news for people that rely on captions, like deaf and hard-of-hearing audiences.
The Edition and Recognition errors we mentioned above are each given weightings, with clear factors determining which weighting should apply.
The weightings are:
- 0.0 The discrepancy has no impact on comprehension.
- 0.25 The discrepancy has a small impact on comprehension.
- 0.5 The discrepancy impacts on comprehension.
- 1.0 The discrepancy misleads the viewer with false information.
A 0.0 weighting is assigned when spoken filler words such as you know, at the end of the day, really, I think are edited from the captions on purpose, or when a phrase is reworded with fewer words but conveys the same meaning.
A 0.25 weighting is assigned to errors such as clear misspellings e.g. there, their, they’re, or to capitalisation or punctuation errors which lead to confusion.
A 0.5 weighting is assigned to errors such as missing information which would otherwise help the viewer’s understanding (“There were road closures in Dunoon today,” spoken but missing from the captions) or errors where a word is mistranslated (“…this famous pianist…” represented in the captions as “..this famous penis…”).
A 1.0 weighting is assigned to errors where the captions may appear to be correct to the viewer, but are actually providing misinformation, such as an incorrect number (1,000 vs. 1,000,000) or the mistranslation of now (“Malcolm Turnbull is not the new leader…” instead of “Malcolm Turnbull is now the new leader…”).
As you might be able to tell, these weightings are assigned to phrases or ‘idea units’, rather than individual words.
If multiple errors occur within one idea unit, only the highest weighting is recorded, as one error usually destroys the meaning of a single idea unit as much as multiple errors in that idea unit.
This illustrates how the system focuses on the viewer rather than the captioner: a captioner who kicks back and completely fails to caption the spoken “Donald Trump is now the new leader…” will be pinged only 0.5, while the captioner whose valiant attempt yields the ‘lying’ caption “Donald Trump is not the new leader…” is hit twice as hard with a 1.0.
What’s considered a good NER score?
The team at University of Roehampton who developed the NER model outlined that a score of 98% is the quality threshold of what can be considered good live captioning.
The 98% benchmark is used by regulatory oversight authorities around the world, including The Office of Communications in the UK and the Canadian Radio-television and Telecommunications Commission.
Pros of the NER system
- It addresses quality from the viewer’s perspective. (Romero-Fresco conducted exhaustive surveys which correlate the NER scores of programs with the subjective reports of viewers.)
- It is repeatable. (Different people/organisations performing NER assessments of the same program will yield similar results.) This means it can determine standardised benchmarks across an industry.
- It is efficient. Analysis of a ten-minute sample will usually indicate the NER score of the entire program.
- While being viewer-focused, the breakdown of Edition and Recognition errors provides valuable feedback to captioners as to where their problems lie.
Cons of the NER system
- Access to the captions as a text file is required. If this is not available, it can be laborious counting the number of words off a video screen.
- The score does not include parameters such as reading rates and synchronicity.