The GDELT Project

Bigtable As Digital Twin: Benchmarking A Simple Download Status Digital Twin

As we continue our explorations of GCP's Bigtable as a digital twin storage fabric for our global video processing infrastructure, what does it look like to load a master inventory of videos into a Bigtable digital twin, along with a JSON payload for each that contains a range of status indicators and details needed for downstream processing?

We'll start by creating a new table on a brand-new single-node Bigtable SSD instance. Then we take our master inventory list of 6.8M videos and associated details such as file format, status and other key operational details and shard it into 200 even files collectively totaling 750MB of JSON, then use GNU parallel to upload all 200 shards in parallel to our node. We use the Bigtable Python bindings and check the existence of each row before inserting, adding additional overhead. In all, this workflow took 12m48s on a 64-core N1 VM colocated in the same region as the Bigtable instance, while the table scan exact row counter using the CBT CLI for verification during testing takes 1m7s:

cbt -instance [MYINSTANCE] createtable iatv_digtwin1
cbt -instance [MYINSTANCE] createfamily iatv_digtwin1 cf
rm *shard*; rm *.txt; split -n 200 -d INVENTORY.JSON INVENTORY.shard.; wc -l *
time find *shard* | parallel --eta -j 200 "python3 ./insert.py --project-id [MYPROJ] --instance-id [MYINSTANCE] --table-id digitaltwin --input-file {} --output-file {}.results.json --overwrite yes > /dev/null 2>&1"&
time cbt -instance [MYINSTANCE] read digitaltwin | grep cf:DOWN | wc -l

Now let's request all videos from a given day to compare our current inventory against the available inventory to identify videos that have not yet been downloaded or encountered errors. This takes just 0.76s to return not just the list of matching rows, but their columnar data as well. To allow for date-based prefix searching, we prepend each unique video identifier with its capture date ala "YYYYMMDD.ID":

>time cbt -instance [MYINSTANCE] read digitaltwin prefix="20240601" > ./out.json; grep stat out.json | wc -l
1215

Now let's look at a larger prefix scan, compiling a list of all videos and their statuses from the entire month of June 2011. This takes just 0.97s to return 29,776 rows totaling 6.2MB of JSON:

>time cbt -instance [MYINSTANCE] read digitaltwin prefix="201106" > ./out.json; grep stat out.json | wc -l
29776

Overall, Bigtable offers extremely performant and scalable digital twin capabilities.