Should You Use Computer-Generated or Human-Generated Captions?
Captions can be made in many different ways.
But there are two main categories that they fall into: human-generated captions – which are created first-hand by humans – and computer-generated captions – which are made using Automatic Speech Recognition (or ASR) technology.
Some industries, including news media, favor human-generated captions. This is particularly true if the job requires fast, real-time turn-around and great precision. Other industries and organisations – especially those who have tighter budgets and less need for high quality – might prefer to use ASR captions.
And whether you’re captioning live streams, meetings or recorded media, you’ll have different needs. So which one should you pick?
Here is our breakdown of the value of human-generated versus ASR captions, according to five main values: accuracy, cost, turn-around time, audio quality, and speech variations.
The main point of difference between human-generated and ASR captions is accuracy.
Human captioners excel at understanding human speech and generating the words they hear quickly and accurately. For example, as a professional captioning service, Ai-Media’s human-generated live captions have a rating of 99.6% accuracy.
And although many IT specialists are busy developing Automatic Speech Recognition technology as we speak, computers still struggle to achieve the same accuracy. The better ASR captioning services say that they can achieve accuracy of 96% to 99%, however, it is important to pay attention to how this accuracy is defined, and how long it takes to achieve this level of accuracy.
Most ASR services define accuracy by the number of words that differ to the source material, also known as the Word Error Rate or WER. This is widely regarded to be a flawed measure of success. For example, the difference between the words ‘can’ and ‘can’t’ would be considered an average error by this measure, while in fact these two words have opposite meanings and completely change the meaning of a sentence.
Ai-Media uses the more sophisticated NER model for live captions, which gives errors various weightings.
And if we’re speaking about live captions, ASR takes much longer to display captions on average than human-generated captions. To get 95% accuracy with ASR, up to a 15-second delay will apply.
For human-generated live captions, you can get an accuracy of 99.6% in 3 to 5 seconds. For a 3 to 5-second display time with ASR captions, your accuracy will go down to 90% in the best-case scenario. And while 90% might seem like a decent accuracy score, this means that about one word in every sentence will be incorrect.
If your viewers are deaf and hard-of-hearing
If you’re using captions to be accessible to people who are deaf or hard-of-hearing, high accuracy is crucial.
Unfortunately, due to their current constraints, ASR captions can be delayed for long periods of time, words don’t always match up to what is being said, and lines of text can be skipped. For deaf and hard-of-hearing audiences, this can be confusing.
Which should I favor if I need accurate captions?
Human-generated captions, at least until ASR technology is further developed.
Sometimes, you want to provide captions, but you don’t have a lot of money to do that with. This is the main reason many people turn to ASR captions.
ASR captions are far cheaper than human-generated captions. This is mostly because they do not require the engagement and payment of a human captioner. Everything can be done within a computer program.
As a result, ASR captions can usually be delivered for up to ten or 20 times cheaper than human-generated captions. They will at very least deliver a five-fold saving on human-generated captions.
However, it should be noted that this lower cost also reflects the lower quality of the captions.
Which should I favor if I need low-cost captions?
ASR/computer-generated captions, but be prepared to make a sacrifice with accuracy.
3. Turn-around time: How quickly do you need it?
In captioning, there are two kinds of turn-around time: the time it takes to engage a captioning service for delivery, and the time it takes to display live captions on a screen. We look at both of them here.
Engaging an ASR captioning service will generally be quicker than engaging a human captioning service, as you will not need to wait for a human captioner to be available.
However, ASR captions will still need a full 24 hours to caption a piece of media to 99% accuracy, as the transcript will need to be checked by a human to achieve this level of quality.
Display time (for live captions)
If you are using live captions, there is the added concern of how long it takes to display your live captions on the screen.
The display time for human-generated captions is much quicker than with ASR captions. This is because human-generated captions are produced by a professional captioner in real time, and they require no checking. As we mentioned earlier, these captions can be displayed within 3 to 5 seconds of the person speaking the words.
In contrast to this, ASR services generate the captions and then use a human check their captions and transcript for accuracy. This creates significant delays – of up to 15 seconds or more – in displaying live captions on a screen, meaning that the end user can struggle to keep up with the content.
Which should I favor if I need something quickly?
It depends. If you have a deadline and need to organise captions quickly, using ASR captions means you don’t have to wait for a captioner to be available. But if you are using live captions and your priority is a fast display time between the delivery of the words and the display of the captions to be shorter, human-generated captions will be faster.
4. Audio quality: How good is your file/stream?
As we know, it’s difficult to achieve perfect audio in a live stream or audio file. There is often background noise, the volume might be quiet or unpredictable, or the speaker might be turning away from a microphone as they speak.
In these situations, human ears have the benefit of experience. They can understand language, even if it’s muffled or peaky. Unless the audio is completely distorted, captioning will be possible.
Even the most-advanced ASR technology, however, struggles if the conditions aren’t ideal. This is why most demos of ASR use excellent audio.
If the audio used for ASR is not excellent, a human will need to spend much more time checking and fixing the captions after they have been generated by the computer, creating a significant time lag for live captions and doubling up on the resources needed to create them.
Which should I favor if my audio quality is low or unpredictable?
If your audio quality is good, either human-generated or ASR captions can work.
5. Accents, speech variations and technical language
There are so many variables in the human voice, including hundreds of different accents, speech patterns, pronunciation and slang terms. Humans have far less difficulty understanding these than a computer does.
Computers generally need to be told what a certain accent or speech variation sounds like to know how to interpret it. Humans, on the other hand, learn this over their lifetimes, and captioners are skilled in understanding many different ways of speaking.
Technical language will also create problems for computers, because they struggle with any word that they have not been explicitly familiarized with.
For example, in the technical IT video below, the speaker is reading aloud the bottom line of code that says ‘sysctl’ and ‘sysctl.conf’, as we can see at the bottom of the screen. Instead of reflecting this, the captions read ‘see CTL – CC till calm’.
In contrast, a human captioner can infer from the code pictured in the video and the knowledge that this is a niche IT topic, what these captions should say.
Which should I favor if my audio/stream features technical language, accents or unusual speech patterns?
Technologies are developing quickly
Automatic Speech Recognition is an area of rich research and experimentation, as people across the world become more interested in automating captioning and transcription services.
In the past several years, the accuracy of ASR has improved, to the point that YouTube now uses it as an automatic captioning tool for videos on its platform. But for the foreseeable future, human-generated captions will be able to deliver much higher-quality captioning than computers can.
However, if your priority is keeping within a budget or saving time on engaging a captioning service, ASR has a lot to offer.