OpenAI, the company behind the DALL-E image and meme generation program and the powerful GPT-3 auto-completion engine, has released a new open-source neural network for transcribing the audio to written text (via Tech Crunch). It’s called Whisper, and the company says it “approaches human-level robustness and accuracy on English speech recognition” and can also automatically recognize, transcribe, and translate other languages like Spanish, Italian and Japanese.
As someone who constantly records and transcribes interviews, I was immediately excited by this news – I thought I would be able to write my own application to safely transcribe audio directly from my computer. While cloud-based services like Otter.ai and Trint work for most things and are relatively secure, there are only a few interviews where I, or my sources, would feel more comfortable if the file audio remained off the internet.
Using it turned out to be even easier than I had imagined; Python and various development tools are already configured on my computer. So installing Whisper was as easy as running a single Terminal command. In 15 minutes, I was able to use Whisper to transcribe a test audio clip I had recorded. For someone relatively tech-savvy who hadn’t already set up Python, FFmpeg, Xcode, and Homebrew, this would probably take upwards of an hour or two. There’s already someone working on making the process a lot easier and more user-friendly, which we’ll talk about in a second.
I compared a Whisper-generated transcript to what Otter.ai and Trint posted for the same file, and I’d say it was relatively comparable. There were enough errors in all of them that I would never copy and paste quotes into an article without double-checking the audio (which is, of course, best practice anyway, regardless of service you are using). But Whisper’s version would absolutely do the trick for me; I can search there for the sections I need and then recheck them manually. In theory, Stage Whisper should work exactly the same since it will use the same model, just with a GUI surrounding it.
Sterne admitted that technology from Apple and Google could make Stage Whisper obsolete within a few years – Pixel’s voice recording app has been able to do offline transcriptions for years, and a version of that functionality is starting to roll out to other Android devices, and Apple has integrated offline dictation into iOS (although there’s currently no good way to transcribe audio files with it). “But we can’t wait that long,” Sterne said. “Journalists like us need good automatic transcription apps today.” He hopes to have a simplified version of the Whisper-based app ready in two weeks.
To be clear, Whisper probably won’t totally obsolete cloud-based services like Otter.ai and Trint, no matter how easy it is to use. For one thing, OpenAI’s model misses one of the biggest features of traditional transcription services: being able to tag who said what. Sterne said Stage Whisper probably wouldn’t support this feature: “we’re not developing our own machine learning model.”
The cloud is just someone else’s computer – which probably means it’s a bit faster
And while you get the benefits of local processing, you also get the downsides. The main one is that your laptop is almost certainly much less powerful than the computers a professional transcription service uses. For example, I fed audio from a 24-minute interview into Whisper, running on my MacBook Pro M1; it took about 52 minutes to transcribe the entire file. (Yes, I made sure he was using Apple Silicon’s version of Python instead of Intel’s.) Otter spat out a transcript in less than eight minutes.
OpenAI’s technology has one big advantage, however: price. Cloud-based subscription services will almost certainly cost you money if you use them professionally (Otter has a free tier, but upcoming changes will make it less useful for people who transcribe things frequently), and transcription features built-in platforms like Microsoft Word or the Pixel, you have to pay for separate software or hardware. Stage Whisper – and Whisper itself – is free and can run on the computer you already own.
Again, OpenAI has higher hopes for Whisper than being the basis of a secure transcription application – and I’m very excited about what researchers will end up doing with it or what they will learn by examining the machine learning model, which was trained on “680,000 hours of multilingual, multitasking supervised data collected from the web.” But the fact that it also has real practical use today makes it all the more exciting.