Transcribe speech to text by AI Whisper model

Our comparison of OpenAI and Azure AI Speech services.
0_0 (13) 0_1 (11)
0_2 (8)

Whisper is an automatic speech recognition system trained on hours of multilingual data from the web created by OpenAI. There are many implementations of the Whisper model from which you can choose depending on price by minute of data processing or additional features. In this example we decided to use a model delivered by Microsoft Azure, but let’s take a look at what exactly Azure offers us.




In Azure case we can choose from two possible services of use Whisper: Azure OpenAI or Azure AI Speech, the choice of service depends on the case you are in and you can compare them by your own HERE.

From my own observations, I have noticed that Azure OpenAI provides faster and more correct transcription but is limited to a maximum of 25MB files, has fewer details about different parameters of the recording, and doesn't support real-time transcription.

Azure AI Speech, on the other hand, will come in handy when you are dealing with a bigger file or require real-time transcription of conversations, but sometimes the quality of the recognized content leaves a little to be desired.

Solution 1 - Azure OpenAI

Step 1 - Get credentials


First of all we need to login to our Azure control panel and create a resource to get required credentials, more about how to create resource HERE.

For accuracy, we will need three variables to make our env file look like this:


Step 2 - Install dependencies


To communicate with the Azure OpenAi api we will use the npm package @azure/openai

npm install @azure/openai

Step 3 - Create api call function


Now we can create a function that will get an audio file and specify output format and return transcription as a string ready to display in the browser or save to file.

import fs from “fs”;
import {
 type AudioResultFormat,
} from "@azure/openai"

const client = new OpenAIClient(
 new AzureKeyCredential(process.env.AZURE_OPENAI_API_KEY),

const doTranscriptionByOpenAI = async (
  file: Buffer,
  outputFormat: AudioResultFormat = 'text',
) => {
    try {
      return await client.getAudioTranscription(
    } catch (error) {

Step 4 - Call function and get transcription 


Now let’s import example audio file and pass it to the doTranscriptionByOpenAI function as Buffer.

const buffer = await fs.readFile(“./sample.wav”)
const transcription = await doTranscriptionByOpenAI(buffer)

It is worth explaining the possible type of format to be received when calling the function doTranscriptionByOpenAI. Well, the method getAudioTranscription takes as the third argument one of several formats: "json" | "verbose_json" | "text" | "srt" | "vtt"

The first two formats return the transcription in the form of an array with the following text fragments and verbose_json also provides additional information about the timestamps of individual text fragments, etc.

On the other hand, text returns a transcription in the form of blob of text (string), but if we want to get subtitles ready to apply to our move we should use a srt or vtt format, then we get a string with timestamps included.




For most simple cases Azure OpenAI will perform very well, but for example if our file is bigger than 25MB Azure OpenAI can't handle that. Then we should use a Azure AI Speech, recommended for batch processing of large files.

Solution 2 - Azure AI Speech

Step 1 - Get credentials


Like previously we need credentials, but this time it is as key and a region of your resource. Check how to get those HERE.


Step 2 - Install dependencies

To communicate with the Azure AI Speech api we will use the npm package microsoft-cognitiveservices-speech-sdk

npm install microsoft-cognitiveservices-speech-sdk

Step 3 - Create api call function

Now let's create a helper function; it will take a file as Buffer or File type.

import {
} from "microsoft-cognitiveservices-speech-sdk"

const doTranscriptionByAISpeech = (
  file: Buffer | File,
) => {
  return new Promise(async (resolve, reject) => {
    const speechConfig = SpeechConfig.fromSubscription(

    const autoDetectSourceLanguageConfig =
      AutoDetectSourceLanguageConfig.fromLanguages(["en-US", "pl-PL"])

    const audioConfig = AudioConfig.fromWavFileInput(file)

    const speechRecognizer = SpeechRecognizer.FromConfig(

    let fragments: string[] = []

    speechRecognizer.recognized = (s, e) => {
      if (e.result.reason === ResultReason.RecognizedSpeech) {
        const { text } = e.result
      } else if (e.result.reason === ResultReason.NoMatch) {
        console.log("NOMATCH: Speech could not be recognized.")

    speechRecognizer.canceled = (s, e) => {
      if (e.reason === CancellationReason.Error) {
          `"CANCELED: ErrorCode=${e.errorCode} ErrorDetails=${e.errorDetails()}`,

    speechRecognizer.sessionStopped = () => {
      const transcription = fragments.join(" ")

Inside the function, we configure the option to automatically detect the language used in the recording, and specify possible source languages by calling fromLanguages method, then we must provide an array of language symbols.

After that, we creates an AudioConfig object representing the specified file, in this example this is .wav file, and start recognizing process by calling startContinuousRecognitionAsync method on it.


Step 4 - Call function and get transcription 

Now we can import example audio file and pass it to the doTranscriptionByAISpeech.

const buffer = await fs.readFile(“./sample.wav”)
const transcription = await doTranscriptionByAISpeech(buffer)

Optionally - Azure AI Speech ans different output formats


Unfortunately, Azure AI Speech does not provide an option to specify the format of the returned transcription to the srt or vtt standard like Azure OpenAI. However if we needed such functionality, it is possible to write such logic on our own, and two additional options to configure would be useful:

  speechConfig.outputFormat = 1
Option outputFormat set to 1 will result in a more detailed response being returned when recognizing. Calling requestWordLevelTimestamps will include word-level timestamps, it means that this method adds time details about the start point and duration of each word. On that basis, we should have been able to achieve the same or similar results as with Azure OpenAI.
Written by: Rafał D, on April 30, 2024