How to Transcribe Audio Files Accurately and Simply - AWS Transcribe vs Whisper for Transcription

Recently I was tasked with transcribing some podcast audio files for an NLP project, so I had to decide which service I was going to use to get the job done.

AWS Transcribe

Since I'm AWS certified and had come across Transcribe several times, I decided to give it a shot. It always seemed pretty nifty.

To keep it brief, it was alright. The console interface is simple, but like most AWS services it isn't the most intuitive design in the world. I definitely had to click around a bit to figure out how to do what I wanted to do.

There was some confusing terminology when setting up a transcription job as well, so I had to spend some time in the docs/googling to figure out what it meant. You also have to upload your audio file into an S3 bucket first - you can't upload directly to Transcribe.

But before long I had uploaded the audio for the first podcast and my transcription job was in progress. It took about a minute for a ~30 minute audio, not bad. The output is a JSON file that contains the full transcript as well as a breakdown of every word transcribed and the exact timestamp for its transcription. That's pretty cool, and might be very useful for some jobs, but for simple transcriptions it's not really needed.

As far as the most important metric - accuracy of the transcription - I wasn't that impressed. I would've thought a major cloud power like AWS would've had a little more NLP power behind it, even for its general model. The transcriptions it produced required some serious, time-consuming editing. And at over $.02 per minute, the bill could add up fast for a bigger library of audio. I wasn't thrilled about paying decent money for transcriptions that I deemed to be subpar, so I went looking for an alternative. That led me to...

Whisper (OpenAI)

Whisper is an open source speech recognition model first made available in 2022. It's really easy to use - you can find a ready-to-use app here on HuggingFace, or you can make your own and host it.

I tested out Whisper for one podcast and could immediately see the difference in quality. The biggest difference was in recognition of where to put punctuation (commas, etc) and where to end sentences and begin the next. The AWS-produced transcript required serious editing on that front; the Whisper one, not so much.

Chinhon's app (the one that I linked) only runs on the free tier HuggingFace CPU and that takes nearly a minute to run per minute of audio, so instead of sitting around all day waiting for it to run, I duplicated the app and hooked it up to a much more powerful T4 GPU and saw serious speed upgrades, comparable to what you get with AWS. But there's also the option to upgrade to the T4 medium GPU (or others, if you want) which costs a tiny $.30 more per hour and had double the power of the small T4. If I'd have used the medium, both the accuracy and the speed of the transcriptions would've been much better than AWS, all for a fraction of the cost (running the app on HF for ~15 minutes to transcribe all the audios would've cost $.25 or so - transcribing the same audios with Transcribe would've cost over $20).

Transcribe vs Whisper

For transcription jobs on the smaller/simpler side (and virtually anything you would need for just personal use), Whisper is the clear winner. I recommend duplicating this space (click the three dots in the upper right, then 'Duplicate this Space'), hooking it up to a T4 medium GPU, and watch it blast out transcriptions for less than $1/hour. You could also use the Whisper model and build an app from scratch if you wanted a little more customization.

AWS Transcribe is more suited to industrial-level tasks. It has pricing tiers that decrease as you transcribe into the hundreds of thousands of minutes, it offers things like PII redaction and medical field-specific transcriptions, and, as noted above, it breaks down the transcription phonetic-by-phonetic for further analysis and modification. You can also give it a custom vocabulary or load a custom model to make your transcriptions a bit more accurate.

So that's my take on AWS Transcribe and Whisper after my brief foray during this small NLP project. I look forward to exploring the deeper aspects of these services as well as other services (and of course blogging about them) in the future. If I'm missing anything or you have any questions, be sure to let me know!