Transcribe speech to text by AI Whisper model
Whisper is an automatic speech recognition system trained on hours of multilingual data from the web created by OpenAI. There are many implementations of the Whisper model from which you can choose depending on price by minute of data processing or additional features. In this example we decided to use a model delivered by Microsoft Azure, but let’s take a look at what exactly Azure offers us.
In Azure case we can choose from two possible services of use Whisper: Azure OpenAI or Azure AI Speech, the choice of service depends on the case you are in and you can compare them by your own HERE.
From my own observations, I have noticed that Azure OpenAI provides faster and more correct transcription but is limited to a maximum of 25MB files, has fewer details about different parameters of the recording, and doesn't support real-time transcription.
Azure AI Speech, on the other hand, will come in handy when you are dealing with a bigger file or require real-time transcription of conversations, but sometimes the quality of the recognized content leaves a little to be desired.
Solution 1 - Azure OpenAI
Step 1 - Get credentials
First of all we need to login to our Azure control panel and create a resource to get required credentials, more about how to create resource HERE.
For accuracy, we will need three variables to make our env file look like this:
AZURE_OPENAI_ENDPOINT=
AZURE_OPENAI_API_KEY=
AZURE_OPENAI_DEPLOYMENT_NAME=
Step 2 - Install dependencies
To communicate with the Azure OpenAi api we will use the npm package @azure/openai
npm install @azure/openai
Step 3 - Create api call function
Now we can create a function that will get an audio file and specify output format and return transcription as a string ready to display in the browser or save to file.
import fs from “fs”;
import {
AzureKeyCredential,
OpenAIClient,
type AudioResultFormat,
} from "@azure/openai"
const client = new OpenAIClient(
process.env.AZURE_OPENAI_ENDPOINT,
new AzureKeyCredential(process.env.AZURE_OPENAI_API_KEY),
)
const doTranscriptionByOpenAI = async (
file: Buffer,
outputFormat: AudioResultFormat = 'text',
) => {
try {
return await client.getAudioTranscription(
process.env.AZURE_OPENAI_DEPLOYMENT_NAME,
file,
outputFormat,
)
} catch (error) {
console.error(error)
}
}
Step 4 - Call function and get transcription
Now let’s import example audio file and pass it to the doTranscriptionByOpenAI
function as Buffer.
const buffer = await fs.readFile(“./sample.wav”)
const transcription = await doTranscriptionByOpenAI(buffer)
console.log(transcription)
It is worth explaining the possible type of format to be received when calling the function doTranscriptionByOpenAI
. Well, the method getAudioTranscription
takes as the third argument one of several formats: "json" | "verbose_json" | "text" | "srt" | "vtt"
The first two formats return the transcription in the form of an array with the following text fragments and verbose_json
also provides additional information about the timestamps of individual text fragments, etc.
On the other hand, text
returns a transcription in the form of blob of text (string), but if we want to get subtitles ready to apply to our move we should use a srt
or vtt
format, then we get a string with timestamps included.
For most simple cases Azure OpenAI will perform very well, but for example if our file is bigger than 25MB Azure OpenAI can't handle that. Then we should use a Azure AI Speech, recommended for batch processing of large files.
Solution 2 - Azure AI Speech
Step 1 - Get credentials
Like previously we need credentials, but this time it is as key and a region of your resource. Check how to get those HERE.
AZURE_AI_SPEECH_KEY=
AZURE_AI_SPEECH_REGION=
Step 2 - Install dependencies
To communicate with the Azure AI Speech api we will use the npm package microsoft-cognitiveservices-speech-sdk
npm install microsoft-cognitiveservices-speech-sdk
Step 3 - Create api call function
Now let's create a helper function; it will take a file as Buffer or File type.
import {
ResultReason,
CancellationReason,
SpeechConfig,
AudioConfig,
SpeechRecognizer,
AutoDetectSourceLanguageConfig,
} from "microsoft-cognitiveservices-speech-sdk"
const doTranscriptionByAISpeech = (
file: Buffer | File,
) => {
return new Promise(async (resolve, reject) => {
const speechConfig = SpeechConfig.fromSubscription(
process.env.AZURE_AI_SPEECH_KEY,
process.env.AZURE_AI_SPEECH_KEY,
)
const autoDetectSourceLanguageConfig =
AutoDetectSourceLanguageConfig.fromLanguages(["en-US", "pl-PL"])
const audioConfig = AudioConfig.fromWavFileInput(file)
const speechRecognizer = SpeechRecognizer.FromConfig(
speechConfig,
autoDetectSourceLanguageConfig,
audioConfig,
)
let fragments: string[] = []
speechRecognizer.startContinuousRecognitionAsync()
speechRecognizer.recognized = (s, e) => {
if (e.result.reason === ResultReason.RecognizedSpeech) {
const { text } = e.result
fragments.push(text)
} else if (e.result.reason === ResultReason.NoMatch) {
console.log("NOMATCH: Speech could not be recognized.")
}
}
speechRecognizer.canceled = (s, e) => {
if (e.reason === CancellationReason.Error) {
console.log(
`"CANCELED: ErrorCode=${e.errorCode} ErrorDetails=${e.errorDetails()}`,
)
reject(e.errorDetails)
}
speechRecognizer.stopContinuousRecognitionAsync()
}
speechRecognizer.sessionStopped = () => {
speechRecognizer.stopContinuousRecognitionAsync()
const transcription = fragments.join(" ")
resolve(transcription)
}
})
}
Inside the function, we configure the option to automatically detect the language used in the recording, and specify possible source languages by calling fromLanguages
method, then we must provide an array of language symbols.
After that, we creates an AudioConfig
object representing the specified file, in this example this is .wav file, and start recognizing process by calling startContinuousRecognitionAsync
method on it.
Step 4 - Call function and get transcription
Now we can import example audio file and pass it to the doTranscriptionByAISpeech
.
const buffer = await fs.readFile(“./sample.wav”)
const transcription = await doTranscriptionByAISpeech(buffer)
console.log(transcription)
Optionally - Azure AI Speech ans different output formats
Unfortunately, Azure AI Speech does not provide an option to specify the format of the returned transcription to the srt or vtt standard like Azure OpenAI. However if we needed such functionality, it is possible to write such logic on our own, and two additional options to configure would be useful:
speechConfig.outputFormat = 1
speechConfig.requestWordLevelTimestamps()
outputFormat
set to 1 will result in a more detailed response being returned when recognizing. Calling requestWordLevelTimestamps
will include word-level timestamps, it means that this method adds time details about the start point and duration of each word. On that basis, we should have been able to achieve the same or similar results as with Azure OpenAI.Join our monthly knowledgable newsletter
Each month we share our knowledge, much like this article, and post interesting posts from the web. Learn how to grow your SaaS with our planet in mind.