Visual Explorer: 236,000 Broadcasts Totaling 1.1 Billion Words Now Available For Belarusian, Iranian, Russian And Ukrainian TV News

Kalev Leetaru

3 years ago

In January we provided a step-by-step tutorial for how to download the machine transcriptions and translations of the 2022-present Belarusian, Russian and Ukrainian television news channels in collaboration with the Internet Archive's TV News Archive. As of today that archive, along with the new Iranian channel IRINN archive, has reached 66,000 broadcasts totaling 270 million words of original text (3.2GB) translated into 328 million words of English. These channels have been transcribed and translated using Google's Speech-to-Text and Translation APIs. The complete 2010-present archive of English-language Russia Today broadcasts (machine transcription by the Internet Archive) numbers 170,000 broadcasts totaling 740 million words. In all, 236,000 broadcasts totaling 1.1 billion words of English text are now available for narrative mining.

To download the complete 2022-present archive (change the "end" date in the first line to the current date):

start=20220325; end=20230312; while [[ ! $start > $end ]]; do echo $start; start=$(date -d "$start + 1 day" "+%Y%m%d"); done > DATES
rm -rf INVENTORIES
mkdir INVENTORIES
time cat DATES | parallel --eta 'wget -q https://storage.googleapis.com/data.gdeltproject.org/gdeltv3/iatv/visualexplorer/ESPRESO.{}.inventory.json -P ./INVENTORIES/'
time cat DATES | parallel --eta 'wget -q https://storage.googleapis.com/data.gdeltproject.org/gdeltv3/iatv/visualexplorer/RUSSIA1.{}.inventory.json -P ./INVENTORIES/'
time cat DATES | parallel --eta 'wget -q https://storage.googleapis.com/data.gdeltproject.org/gdeltv3/iatv/visualexplorer/RUSSIA24.{}.inventory.json -P ./INVENTORIES/'
time cat DATES | parallel --eta 'wget -q https://storage.googleapis.com/data.gdeltproject.org/gdeltv3/iatv/visualexplorer/1TV.{}.inventory.json -P ./INVENTORIES/'
time cat DATES | parallel --eta 'wget -q https://storage.googleapis.com/data.gdeltproject.org/gdeltv3/iatv/visualexplorer/NTV.{}.inventory.json -P ./INVENTORIES/'
time cat DATES | parallel --eta 'wget -q https://storage.googleapis.com/data.gdeltproject.org/gdeltv3/iatv/visualexplorer/BELARUSTV.{}.inventory.json -P ./INVENTORIES/'
time cat DATES | parallel --eta 'wget -q https://storage.googleapis.com/data.gdeltproject.org/gdeltv3/iatv/visualexplorer/IRINN.{}.inventory.json -P ./INVENTORIES/'
rm IDS; find ./INVENTORIES/ -depth -name '*.json' | parallel --eta 'cat {} | jq -r .shows[].id >> IDS'
wc -l IDS
rm -rf ./INVENTORIES/

mkdir ORIGINAL
mkdir TRANSLATED
time cat IDS | parallel --eta 'wget -q https://storage.googleapis.com/data.gdeltproject.org/gdeltv3/iatv/visualexplorer/{}.transcript.txt -P ./ORIGINAL/'
time cat IDS | parallel --eta 'wget -q https://storage.googleapis.com/data.gdeltproject.org/gdeltv3/iatv/visualexplorer/{}.transcript.en.txt -P ./TRANSLATED/'

To count the total number of words:

time find ./ORIGINAL/ -type f -name '*.txt' -exec cat {} + | wc
time find ./TRANSLATED/ -type f -name '*.txt' -exec cat {} + | wc

To download the complete English-language 2010-present Russia Today archive (transcribed by the Internet Archive):

start=20100715; end=20230312; while [[ ! $start > $end ]]; do echo $start; start=$(date -d "$start + 1 day" "+%Y%m%d"); done > DATES
rm -rf INVENTORIES
mkdir INVENTORIES
time cat DATES | parallel --eta 'wget -q https://storage.googleapis.com/data.gdeltproject.org/gdeltv3/iatv/visualexplorer/RT.{}.inventory.json -P ./INVENTORIES/'
rm IDS; find ./INVENTORIES/ -depth -name '*.json' | parallel --eta 'cat {} | jq -r .shows[].id >> IDS'
wc -l IDS
rm -rf ./INVENTORIES/

mkdir ORIGINAL
time cat IDS | parallel --eta 'wget -q https://storage.googleapis.com/data.gdeltproject.org/gdeltv3/iatv/visualexplorer/{}.transcript.txt -P ./ORIGINAL/'

time find ./ORIGINAL/ -type f -name '*.txt' -exec cat {} + | wc

The resulting local archive can be fed into any text analysis system for deep narrative analysis.