How does state-of-the-art speech transcription (ASR) like Google's Speech-to-Text API (STT) work across the rich diversity of languages, dialects and accents found in television news from around the world? The Internet Archive's Television News Archive offers an ideal testbed through which to explore real-world ASR performance, with global holdings spanning more than 100 channels across 50 countries and territories on 5 continents in at least 35 languages and dialects over 20 years. What would it look like to process one sample broadcast from each of these channels through the STT API? To explore this further, today we are releasing 100 fully automated transcripts generated by the STT API across a selection of television news broadcasts from around the world spanning two decades.
In collaboration with the Television News Archive, we selected one representative broadcast from each of the 100 channels available in the Visual Explorer. The majority of the Archive's international channels do not have web-playable video clips, meaning that you will only have the thumbnail gallery in the Visual Explorer to examine alongside the STT-generated transcript. However, for some international channels the Archive has over the years made one or two broadcasts playable as part of special collections, such as the 9/11 Archive, in which case that was the video we examined here. This means that for some channels, the specific broadcast examined may be extremely short or not as representative of the channel's overall coverage, but has the benefit of being able to compare the transcript with the actual audio of the broadcast. For the other channels we emphasized older broadcasts in many cases to test STT's ability to handle poorer-quality audio. Each broadcast below includes a notation beside it as to whether it has a playable video clip or not.
The audio of each broadcast was extracted from the MP4 container via ffmpeg to generate a FLAC file:
time find *.mp4 | parallel --eta 'ffmpeg -nostdin -hide_banner -loglevel panic -i ./{} -filter_complex "[0:a]channelsplit=channel_layout=stereo:channels=FL[left]" -map "[left]" -f flac ./{.}.flac'
We then submitted each video to the STT API using the following query:
curl -s -H "Content-Type: application/json" -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" -H "x-goog-user-project: [YOURPROJECTID]" https://speech.googleapis.com/v1/speech:longrunningrecognize --data "{ 'config': { 'encoding': 'FLAC', 'languageCode': '[LANGCODE]', 'enableWordTimeOffsets': true, 'enableWordConfidence': true, 'enableAutomaticPunctuation': true, 'maxAlternatives': 30, 'model': '[MODEL]' }, 'audio': { 'uri':'gs://[BUCKET]/[SHOW].flac' }, 'output_config': { 'gcs_uri':'gs://[BUCKET]/[SHOW].asr.json' } }"
Where possible, we use the "latest_long" model which uses the most recent available model that is tailored for long-form spoken word content and is roughly equivalent to the "video" model nomenclature of the Video API. In some cases, only the "default" model is available for certain languages or dialects, which may result in reduced accuracy.
In cases where the API determines multiple possible transcriptions for a given utterance, we request up to 30 alternatives ordered by confidence. To avoid receiving a single massive blob of text, we enable automatic punctuation, which splits the text into sentences. We also ask the API to return the precise timestamp of each recognized word and its confidence in its recognition of that word.
You can see the final results in the table below, with the language code and model used for each broadcast, along with a link to the STT-generated transcript JSON.
Channel | LangCode | Model | Visual Explorer | STT Transcript |
ABC (KGO) | en-US | latest_long | View | KGO_20120102_013000_ABC_World_News_With_David_Muir |
Algeria's Canal Algérie | fr-FR | latest_long | View | CANALALGERIE_20120101_070000 |
Azerbaijan's AzTV | az-AZ | default | View | AZTV_20150330_120000_Azerbaijani_Russian_and_English_Programming_from_Azerbaijan |
BBC News London | en-GB | latest_long | View | BBCNEWS_20120101_170000_BBC_NEWS |
Belarus 24 | ru-RU | latest_long | View | BELARUSTV_20221005_161500 |
Bloomberg | en-US | latest_long | View | BLOOMBERG_20200212_183000_Bloomberg_Markets_Americas |
CBS (KPIX) | en-US | latest_long | View | KPIX_20181018_003000_CBS_Evening_News_with_Jeff_Glor |
China's CCTV News | en-US | latest_long | View | CCTVNEWS_20120916_131312 |
China's CCTV-3 | zh | default | View | CCTV3_20010830_123000_China_Central_TV |
China's CCTV-4 | zh | default | View | CCTV4_20090903_190000 |
China's CCTV-9 News | en-US | latest_long | View | CCTV9_20120101_103000 |
CNBC | en-US | latest_long | View | CNBC_20200212_100000_Worldwide_Exchange |
CNN | en-US | latest_long | View | CNNW_20120101_220000_CNN_Newsroom |
Cubavision International | es-US | latest_long | View | CUBA_20110315_233000 |
Deutsche Welle (DW) English | en-US | latest_long | View | DW_20181017_200000_DW_News_-_News |
Dubai TV | ar-AE | latest_long | View | DUBAI_20111229_080000 |
Egypt's Al Masriyah | ar-EG | latest_long | View | ESC1_20110806_190000 |
Ethiopia's ETV | am-ET | default | View | ETV_20181011_100000 |
FOX (KTVU) | en-US | latest_long | View | KTVU_20120102_010000_News_at_5pm |
Fox Business | en-US | latest_long | View | FBC_20200212_170000_Cavuto_Coast_to_Coast |
Fox News | en-US | latest_long | View | FOXNEWSW_20200213_010000_Tucker_Carlson_Tonight |
France 24 | en-US | latest_long | View | FRANCE24_20120101_170000 |
France's ARTE | de-DE | latest_long | View | ARTEDE_20130103_230000 |
France's TV5Monde | fr-FR | latest_long | View | TV5MONDE_20090617_113000_Le_Journal_de_la_RTBF |
Germany's ARD | de-DE | latest_long | View | ARD_20130103_213000 |
Germany's WDR | de-DE | latest_long | View | WDR_20120101_181000_Aktuelle_Stunde |
Greece's ANT1 | el-GR | default | View | ANT1_20010914_043000_Antenna_1_Greece |
India's NDTV | en-IN | latest_long | View | NDTV_20111230_183000_India |
India's Zee TV | hi-IN | latest_long | View | ZEETV_20120101_050000_Hindi_New |
Iran's Al-Alam | fa-IR | default | View | ALALAM_20121028_130000 |
Iran's IRIB TV2 | fa-IR | default | View | IRIB2_20120101_070000 |
Iran's IRINN | fa-IR | default | View | IRINN_20120101_053000 |
Iran's Press TV | fa-IR | default | View | PRESSTV_20111228_130000 |
Iran's Simaye Azadi | fa-IR | default | View | SAMAYEAZADI_20120101_140100 |
Iraq TV | ar-IQ | latest_long | View | IRAQ_20010917_043000_Iraq_Satellite_Channel |
Iraq's Al Forat Network | ar-IQ | latest_long | View | ALFORAT_20111229_183000 |
Iraq's Al Iraqiya | ar-IQ | latest_long | View | ALIRAQUIA_20120101_050000 |
Iraq's Al-Etejah TV | ar-IQ | latest_long | View | ALETEJAHTV_20130817_133000 |
Iraq's Al-Fayhaa TV | ar-IQ | latest_long | View | ALFAYHAA_20120101_050100 |
Italy's RAI 1 | it-IT | latest_long | View | RAI1_20130102_050000 |
Italy's RAI International | it-IT | latest_long | View | RAI_20010313_003000_Telegiornale_RAI |
Italy's RAI News | it-IT | latest_long | View | RAINEWS_20130101_230000 |
Jordan TV | ar-JO | latest_long | View | JORDANTV_20120101_030000 |
KRON (MyNetworkTV) | en-US | latest_long | View | KRON_20120102_040000_KRON_4_News_at_9 |
Kurdistan Regions Kurdsat | ar-EG | latest_long | View | KURDSAT_20120101_170100 |
Kuwait Television | ar-KW | latest_long | View | KUWAIT_20090809_210000 |
Lebanon's Al Jadeed (New TV) | ar-LB | latest_long | View | NEWTV_20111228_120000 |
Lebanon's Future Television | ar-LB | latest_long | View | FUTURE_20111229_183000 |
Libya's LJBC | ar-EG | latest_long | View | LIBYA_20100910_170000 |
Mexico’s TV Azteca | es-ES | latest_long | View | AZT_20010917_030000_Noticiario_Hechos |
Morocco's Al Maghribia | ar-MA | latest_long | View | ALMAGHRIBIA_20120101_090000 |
Morroco's 2M Monde | ar-MA | latest_long | View | M2MOROCCO_20120101_140100 |
MSNBC | en-US | latest_long | View | MSNBCW_20120101_190000_Meet_the_Press |
NBC (KNTV) | en-US | latest_long | View | KNTV_20120119_013000_NBC_Nightly_News |
Nigeria's NTA International | en-NG | default | View | NTA_20120101_201500 |
North Macedonia's MRT Sat | mk-MK | latest_long | View | MKTV_20121024_210000 |
Oman TV | ar-OM | latest_long | View | OMAN_20120101_183000 |
Palestine Satellite Channel | ar-PS | latest_long | View | PSC_20120101_163000 |
PBS (KQED) | en-US | latest_long | View | KQED_20111231_020000_PBS_NewsHour |
Portugal’s RTP Internacional (RTPi) | pt-PT | latest_long | View | RTPI_20120101_201600 |
Qatar TV | ar-QA | latest_long | View | QATARTV_20120101_160000 |
Qatar's Al Jazeera English | en-US | latest_long | View | ALJAZ_20120101_070100 |
Radio Television of Serbia | sr-RS | default | View | RTSSAT_20120419_060000 |
Republic of Congo’s Télé Congo | fr-FR | latest_long | View | TELECONGO_20120101_200100 |
Romania's TVR Info | ro-RO | latest_long | View | TVRI_20120101_183100 |
Russia 1 | ru-RU | latest_long | View | RUSSIA1_20221005_143000_60_minut |
Russia 24 | ru-RU | latest_long | View | RUSSIA24_20221005_170200_Vesti_s_Alekseem_Kazakovim |
Russia Today | en-US | latest_long | View | RT_20120101_180100 |
Russia's 1TV | ru-RU | latest_long | View | 1TV_20221005_062000_AntiFeik |
Russia's NTV | ru-RU | latest_long | View | NTV_20221005_160000_Segodnya |
Russia's TV Rain | ru-RU | latest_long | View | TVRAIN_20180420_020000 |
Saudi Arabia's Al Saudiya | ar-SA | latest_long | View | SAUDI_20120101_190000 |
SCOLA Jordan News | ar-JO | latest_long | View | SCOLA_20120102_193000_Jordan_News |
SCOLA Lebanon News | ar-LB | latest_long | View | SCOLA3_20120101_060000_Lebanon_News |
SCOLA Qatar News | ar-QA | latest_long | View | SCOLA2_20120102_213000_Qatar_News |
SCOLA Syria News | ar-EG | latest_long | View | SCOLA4_20120102_223000_Syria_News |
SCOLA UAE News | ar-AE | latest_long | View | SCOLA5_20120101_235500_United_Arab_Emirates |
Senegal's RTS Diaspora | fr-FR | latest_long | View | RTSDIASPORA_20110805_033000 |
South Korea's KBS World | ko-KR | latest_long | View | KBSWORLD_20100613_040000_KBS_News_9 |
South Korea's MBC | ko-KR | latest_long | View | MBC_20111230_145000_MBCNewsDesk |
Southern Sudan Television | ar-EG | latest_long | View | SOUTHERNSUDAN_20120101_190000 |
Sudan State TV | ar-EG | latest_long | View | SUDAN_20120101_150000 |
Sweden's SVT1 | sv-SE | default | View | SVT1_20111027_140500_Gomorron_Sverige |
Switzerland's TSR 1 | fr-CH | default | View | TSR1_20120101_103000_Le_Journal |
Syria TV | ar-EG | latest_long | View | SYRIANTV_20120101_190000 |
Telemundo (KSTS) | es-US | latest_long | View | KSTS_20200213_013000_Noticiero_Telemundo_48 |
Thailand's Thai TV Global Network | th-TH | latest_long | View | TGN_20120102_003100 |
Tunisia's El Watania 1 | ar-TN | latest_long | View | TV7TUNIS_20120101_190000 |
Turkey's TRT 1 | tr-TR | latest_long | View | TRT1_20120101_000100 |
Turkey's TRT Türk | tr-TR | latest_long | View | TRTTURK_20120101_173100 |
Ukraine's Espreso TV | uk-UA | latest_long | View | ESPRESO_20221005_143000 |
United Arab Emirates' Sharjah TV | ar-AE | latest_long | View | SHARJAHTV_20120101_200000 |
United Kingdom's BBC Arabic Television | ar-EG | latest_long | View | BBCARABIC_20111229_161000 |
United Kingdom's Sky News | en-GB | latest_long | View | SKY_20090618_160000_Live_At_Five_With_Jeremy_Thompson |
Univision (KDTV) | es-US | latest_long | View | KDTV_20120101_170000_Al_Punto |
US-based Galavisión | es-US | latest_long | View | GALA_20121005_070000_Hasta_Que_el_Dinero_Nos_Separe |
Venezuela's teleSUR | es-US | latest_long | View | TELESUR_20120101_133000 |
VietFace TV | vi-VN | latest_long | View | VIETFACETV_20120101_070100 |
Vietnam's VTV4 | vi-VN | latest_long | View | VTV4_20111230_170000_VTV4Newsreel |
Yemen TV | ar-YE | latest_long | View | YEMENTV_20120101_130000 |