Getting Started With GCP's Chirp / Universal Speech Model (USM) Multilingual Speech Transcription: Transcribing A 2.5 Hour Russian Broadcast In 4 Minutes

Kalev Leetaru

3 years ago

How can you leverage GCP's powerful Universal Speech Model (USM) known as "Chirp" for multilingual speech transcription? USM is a "family of state-of-the-art speech models with 2B parameters trained on 12 million hours of speech and 28 billion sentences of text, spanning 300+ languages" offering a wealth of opportunities for high-accuracy massively multilingual transcription. What does it look like to apply Chirp to a 2.5-hour Russian language television news broadcast? The end result is a rich transcript with word-level offsets and automatic punctuation in just 4 minutes 7 seconds flat.

View The Broadcast Through The Visual Explorer. (The translated transcript you see there was provided via the older STT V1 API + Google Translate, but you can click on any thumbnail to view a short clip from that point in order to compare to the Chirp transcription).
View The Raw Chirp JSON Transcription. (The raw JSON output direct from STT V2 using the Chirp model).
View Chirp Transcription In Text Format. (Extracted from the JSON above).

At present, Chirp does not support native MP4 input, so we'll convert our MP4 video to MP3. We'll also convert to mono to reduce processing time and costs:

time ffmpeg -i ./RUSSIA1_20230823_143000_60_minut.mp4 -ac 1 ./RUSSIA1_20230823_143000_60_minut.mp3
gsutil cp ./RUSSIA1_20230823_143000_60_minut.mp3 gs://[YOURGCSPATH]/RUSSIA1_20230823_143000_60_minut.mp3

Now we'll write the configuration JSON file for this ASR request that specifies the language code of the speech and that we want automatic punctuation and word time offsets. We also specify the path of the MP3 file in GCS and the GCS path to output the results to. For more details, see the STT V2 and batchRecognize documentation:

echo "{
  \"config\": {
    \"auto_decoding_config\": {},
    \"language_codes\": [\"ru-RU\"],
    \"model\": \"chirp\",
    \"features\": { \"enable_automatic_punctuation\": true, \"enable_word_time_offsets\": true },
  },
  \"files\": [ {\"uri\": \"gs://[YOURGCSPATH]/RUSSIA1_20230823_143000_60_minut.mp3\"}],
  \"recognitionOutputConfig\": { \"gcsOutputConfig\": { \"uri\": \"gs://[YOURGCSPATH]/RUSSIA1_20230823_143000_60_minut/asr/\" } }
}" > /tmp/data.txt

Now we submit this configuration file to the API:

curl -X POST -H "Content-Type: application/json; charset=utf-8" \
    -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \
    -d @/tmp/data.txt \
    https://us-central1-speech.googleapis.com/v2/projects/[YOURPROJECTID]/locations/us-central1/recognizers/_:batchRecognize

This returns the following response. The most important part is the "name" field at the top which is the JobID of this specific job. We'll need that in a moment.

{
  "name": "projects/[YOURPROJECTID]/locations/us-central1/operations/v2-aaaaaaaaa-0000-bbbb-cccc-ddddddddddddd",
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.speech.v2.OperationMetadata",
    "createTime": "2023-08-24T17:17:06.923368Z",
    "updateTime": "2023-08-24T17:17:06.923368Z",
    "batchRecognizeRequest": {
      "recognizer": "projects/[YOURPROJECTID]/locations/us-central1/recognizers/_",
      "files": [
        {
          "uri": "gs://[YOURGCSPATH]/RUSSIA1_20230823_143000_60_minut.mp3"
        }
      ],
      "config": {
        "features": {
          "enableWordTimeOffsets": true,
          "enableAutomaticPunctuation": true
        },
        "autoDecodingConfig": {},
        "model": "chirp",
        "languageCodes": [
          "ru-RU"
        ]
      },
      "recognitionOutputConfig": {
        "gcsOutputConfig": {
          "uri": "gs://[YOURGCSPATH]/RUSSIA1_20230823_143000_60_minut/asr/"
        }
      }
    }
  }
}

Using the JobID from above, we can periodically poll the API to get the processing status and how close it is to finishing:

curl -X GET \
    -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \
    "https:///us-central1-speech.googleapis.com/v2/projects/[YOURPROJECTID]/locations/us-central1/operations/v2-aaaaaaaaa-0000-bbbb-cccc-ddddddddddddd"

Run seconds after submission we get this:

{
  "name": "projects/[YOURPROJECTID]/locations/us-central1/operations/v2-aaaaaaaaa-0000-bbbb-cccc-ddddddddddddd",
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.speech.v2.OperationMetadata",
    "createTime": "2023-08-24T17:17:06.923368Z",
    "updateTime": "2023-08-24T17:17:06.923368Z",
    "resource": "projects/[YOURPROJECTID]/locations/us-central1/recognizers/_",
    "method": "google.cloud.speech.v2.Speech.BatchRecognize",
    "batchRecognizeRequest": {
      "recognizer": "projects/[YOURPROJECTID]/locations/us-central1/recognizers/_",
      "files": [
        {
          "uri": "gs://[YOURGCSPATH]/RUSSIA1_20230823_143000_60_minut.mp3"
        }
      ],
      "config": {
        "features": {
          "enableWordTimeOffsets": true,
          "enableAutomaticPunctuation": true
        },
        "autoDecodingConfig": {},
        "model": "chirp",
        "languageCodes": [
          "ru-RU"
        ]
      },
      "recognitionOutputConfig": {
        "gcsOutputConfig": {
          "uri": "gs://[YOURGCSPATH]/RUSSIA1_20230823_143000_60_minut/asr/"
        }
      }
    },
    "progressPercent": 3
  }
}

We keep running the status poll request above periodically until we see "done" is "true" and there is a new response block at the end that gives us the GCS filename of the output:

{
  "name": "projects/[YOURPROJECTID]/locations/us-central1/operations/v2-aaaaaaaaa-0000-bbbb-cccc-ddddddddddddd",
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.speech.v2.OperationMetadata",
    "createTime": "2023-08-24T17:17:06.923368Z",
    "updateTime": "2023-08-24T17:17:06.923368Z",
    "batchRecognizeRequest": {
      "recognizer": "projects/[YOURPROJECTID]/locations/us-central1/recognizers/_",
      "files": [
        {
          "uri": "gs://[YOURGCSPATH]/RUSSIA1_20230823_143000_60_minut.mp3"
        }
      ],
      "config": {
        "features": {
          "enableWordTimeOffsets": true,
          "enableAutomaticPunctuation": true
        },
        "autoDecodingConfig": {},
        "model": "chirp",
        "languageCodes": [
          "ru-RU"
        ]
      },
      "recognitionOutputConfig": {
        "gcsOutputConfig": {
          "uri": "gs://[YOURGCSPATH]/RUSSIA1_20230823_143000_60_minut/asr/"
        }
      }
    }
  }
          "uri": "gs://[YOURGCSPATH]/RUSSIA1_20230823_143000_60_minut.mp3"
        }
      ],
      "config": {
        "features": {
          "enableWordTimeOffsets": true,
          "enableAutomaticPunctuation": true
        },
        "autoDecodingConfig": {},
        "model": "chirp",
        "languageCodes": [
          "ru-RU"
        ]
      },
      "recognitionOutputConfig": {
        "gcsOutputConfig": {
          "uri": "gs://[YOURGCSPATH]/RUSSIA1_20230823_143000_60_minut/asr/"
        }
      }
    },
    "progressPercent": 100
  },
  "done": true,
  "response": {
    "@type": "type.googleapis.com/google.cloud.speech.v2.BatchRecognizeResponse",
    "results": {
      "gs://[YOURGCSPATH]/RUSSIA1_20230823_143000_60_minut.mp3": {
        "uri": "gs://[YOURGCSPATH]/RUSSIA1_20230823_143000_60_minut/asr/RUSSIA1_20230823_143000_60_minut_transcript_aaaaaaaaa-0000-bbbb-cccc-ddddddddddddd.json"
      }
    },
    "totalBilledDuration": "9064s"
  }
}

We can see above that the output ASR JSON is "gs://[YOURGCSPATH]/RUSSIA1_20230823_143000_60_minut/asr/RUSSIA1_20230823_143000_60_minut_transcript_64e461f0-0000-20c3-9ed1-94eb2c0cc65c.json" Let's examine:

gsutil ls -alh gs://[YOURGCSPATH]/RUSSIA1_20230823_143000_60_minut/asr/RUSSIA1_20230823_143000_60_minut_transcript_aaaaaaaaa-0000-bbbb-cccc-ddddddddddddd.json
1.45 MiB 2023-08-24T17:21:14Z gs://[YOURGCSPATH]/RUSSIA1_20230823_143000_60_minut/asr/RUSSIA1_20230823_143000_60_minut_transcript_aaaaaaaaa-0000-bbbb-cccc-ddddddddddddd.json#1692897674408551 metageneration=1

Recall that we submitted the job at 2023-08-24T17:17:06 and we can see above that it completed at 2023-08-24T17:21:14. This means it completed in precisely 4 minutes 7 seconds. The totalBilledDuration is 9064 seconds, reflecting that this broadcast was 151 minutes (2.5 hours). Thus, we completed ASR on a 2.5 hour broadcast in 4m7s, meaning a processing speed of 37 times realtime!

You can download the final results here:

RUSSIA1_20230823_143000_60_minut_transcript_aaaaaaaaa-0000-bbbb-cccc-ddddddddddddd.json

What if we just want a text blob of the transcript rather than the JSON? Just pipe through JQ:

cat RUSSIA1_20230823_143000_60_minut_transcript_aaaaaaaaa-0000-bbbb-cccc-ddddddddddddd.json | jq -r .results[].alternatives[0].transcript > RUSSIA1_20230823_143000_60_minut_transcript_aaaaaaaaa-0000-bbbb-cccc-ddddddddddddd.txt

You can see the final results:

RUSSIA1_20230823_143000_60_minut_transcript_aaaaaaaaa-0000-bbbb-cccc-ddddddddddddd.txt

That's all there is to it!