OpenAI announced the Whisper API, a hosted version of the open-source Whisper speech-to-text model that the business published in September, to coincide with the debut of the ChatGPT API.
Whisper is an artificial voice recognition system that OpenAI says provides “robust” transcription in several languages and translation from those languages into English, costing $0.006 per minute. M4A, MP3, MP4, MPEG, MPGA, WAV, and WEBM are just a few file types it accepts.
Several companies have developed speech recognition systems, which are at the core of the software and services offered by digital behemoths like Google, Amazon, and Meta.
According to OpenAI president and chairman Greg Brockman, the individual accents, background noise, and technical jargon can now be more accurately recognized thanks to Whisper’s training on 680,000 hours of multilingual and “multitask” online data.
“We developed a model, but it actually was not enough to cause the entire developer community to build around it,” Brockman said in a video conversation with TechCrunch yesterday afternoon.
The Whisper API is a highly optimized version of the same big model that is available as open source. It’s really handy and a lot faster.
To emphasize Brockman’s thesis, there are several obstacles to businesses implementing speech transcription technology.
Companies say the main reasons they haven’t used tech like tech-to-speech are accuracy, accent- or dialect-related identification challenges, and expense, according to a 2020 Statista poll.
Whisper does not, however, have all the answers, particularly when it comes to “next-word” prediction.
Whisper may include words in its transcriptions that weren’t really said since the system was trained on a lot of noisy data, presumably because it’s simultaneously trying to anticipate the next word in audio and transcribe the audio recording.
Furthermore, Whisper’s performance isn’t consistent across linguistic boundaries; it has a greater mistake rate when dealing with speakers of languages that aren’t well-represented in the training set.
Sadly, that latter statement is not novel in the field of voice recognition. Biases have long hampered even the greatest systems; a 2020 Stanford research found that systems from Amazon, Apple, Google, IBM, and Microsoft made far fewer mistakes — roughly 19% — with white users than with Black users.
Although this is the case, OpenAI envisions Whisper’s transcribing skills being applied to enhance already-existing tools, services, and products. The Whisper API is already being used to create a new in-app virtual speaking companion by the AI-powered language study app Speak.
For the Microsoft-backed business, OpenAI, a significant entry into the speech-to-text sector might be highly profitable. The market might increase from $2.2 billion in 2021 to $5.4 billion by 2026, according to one research.
Our ideal, according to Brockman, is to become this all-knowing intellect. We want to be a force multiplier for that attention by having the flexibility to take in any sort of data you have and any kind of work you wish to do.