Bigtable + GCS + GCE + GCP AI APIs = Digital Twin For At-Scale AI Analytics

Behind the simple user interfaces of the Visual Explorer, Visual Explorer Lenses, Television Explorer and Television AI Explorer and our collaborations with the Internet Archive's TV News Archive lie a wealth of advanced pipelines applying state of the art AI tools to make global television news searchable, from multilingual machine transcription of more than 150 languages (ASR) to onscreen textual transcription (OCR) to basic object and activity detection (CV) to machine translation (NMT / LLMT) to multimodal and textual embeddings to NLP and Knowledge Graph (KG) analysis of the textual transcripts to experimental visual search algorithms, spanning a wide range of GCP AI APIs that draw from GCS with GCE preprocessing. Millions of broadcasts spanning tens of millions of minutes of video in 150 languages from over 100 channels in more than 50 countries over portions of the last 20 years, all of which must be run through an intricate fabric of preprocessing, processing and postprocessing algorithms and APIs, many of which are orchestrated into complex pipelines in which each analytic stage is dependent on earlier stages, such as ASR->LangDet->NMT->NLP->KG->ESIndex.

Dozens upon dozens of processes must be run over each broadcast, with any given operation having a non-zero chance of failure. Adding complexity, different kinds of failures applied to different kinds of content, language or modalities must each be handled differently, with different retry strategies and tolerances. Historically, we addressed this through a landscape of individual queue and caching systems, each of which acted as a SVT for its pipeline, with systems cross-reading each others' queues and caches. Unfortunately, this didn't scale well, especially when processing historical archives.

Instead, we are excited to announce that our deployment of GCP's Bigtable as a Digital Twin over this entire infrastructure is working beyond all expectations. In short, each broadcast is a key and each task is a column that stores a JSON blob with all key data on that task's completion and status. We can trivially add new tasks over time. Bigtable's scalability means we can have thousands of processes deployed across our entire global VM fleet all reading and writing to the Bigtable table just as they do to GCS, with Bigtable as our SVT tracking the status of every task. Queuing systems use this Bigtable digital twin to track in realtime processes that need to be restarted and to queue downstream processes from successful runs.

The GDELT Project

Bigtable + GCS + GCE + GCP AI APIs = Digital Twin For At-Scale AI Analytics

Archives