Visual Explorer: A Pipeline For Near-Realtime Visual Analysis Of Russian Television News

Yesterday we demonstrated running facial detection and similarity matching software across a week and a half of Russian television news to catalog Tucker Carlson appearances. For those interested in building near-realtime analytic pipelines on top of the Visual Explorer, you can catalog the visual narratives of Russian television news within a few hours of most broadcasts airing!

Use a pipeline similar to the one below set up as a cronjob to run every 30 minutes and catalog all of the latest available shows from the past three days. Most shows are added to the Internet Archive's Television News Archive shortly after airing, but sometimes shows can take as much as 72 hours to become available due to transient processing delays, so this compiles all IDs from the past three days.

The example below uses GNU parallel's job resumption ability to ensure that it only processes each show once to provide a simple-to-use example pipeline, though this requires the joblog to be regularly expired through an external process. A production application would use a more advanced filtering mechanism, but this offers a simple solution that allows the rapid construction of prototype near-realtime visual analytic pipelines.

We're tremendously excited to see what pipelines like this enable!

rm DATES;
echo $(date -d "today" "+%Y%m%d") >> DATES
echo $(date -d "today - 1 day" "+%Y%m%d") >> DATES
echo $(date -d "today - 2 days" "+%Y%m%d") >> DATES

mkdir JSON
time cat DATES | parallel --eta 'wget -q https://storage.googleapis.com/data.gdeltproject.org/gdeltv3/iatv/visualexplorer/RUSSIA1.{}.inventory.json -P ./JSON/'
time cat DATES | parallel --eta 'wget -q https://storage.googleapis.com/data.gdeltproject.org/gdeltv3/iatv/visualexplorer/RUSSIA24.{}.inventory.json -P ./JSON/'
time cat DATES | parallel --eta 'wget -q https://storage.googleapis.com/data.gdeltproject.org/gdeltv3/iatv/visualexplorer/NTV.{}.inventory.json -P ./JSON/'
time cat DATES | parallel --eta 'wget -q https://storage.googleapis.com/data.gdeltproject.org/gdeltv3/iatv/visualexplorer/1TV.{}.inventory.json -P ./JSON/'
rm IDS; find ./JSON/ -depth -name '*.json' | parallel --eta 'cat {} | jq -r .shows[].id >> IDS'
rm -rf ./JSON/
wc -l IDS

time cat IDS | parallel --resume --joblog ./DONE.LOG --eta './mycommand {}'