A Behind-The-Scenes Look At How We Think About Master File Formats And Timestamping

How do we internally store the myriad datasets that arrive to us each day, what file formats do we use and how do we think about authoritative timestamping?

MASTER FILE FORMATS

We take the approach of maintaining the most faithful copy of the incoming data as possible, preserving it as-is as it arrives to us. Any corrections, adjustments, derivatives, etc, are stored as separate files, to ensure we can always walk back through the entire provenance path from the moment an object first arrived at our borders.

For master file formats, we use three primary approaches, one built for large video and audio files that can be many gigabytes in size each, one for still imagery and one for text. We consider these to be our "master" files that preserve objects as they first arrived to us.

Internally, each object contains an audit block that contains full provenance information. Master files record every detail about how the object arrived at our borders, including information about the specific audited machine image that captured it. Derived objects include complete auditing details about the machine image the processing took place under, the version of that processing system and any nonstandard configurations or environmental conditions in place for that run. This allows us to walk any given piece of master or derived data back to its first arrival at our borders and trace its entire path from arrival through to the final data point under consideration.

Video & Audio

Video and audio files are streamed directly into GCS and stored as their own objects. This results in a bucket structure with vast numbers of individual files, but GCS' scalability means individual object requests (the most common access pattern) scale effectively linearly. Our primary access pattern for these files is to process each in isolation, so grouping together into larger containers does not yield any optimizations.

Most importantly, these files are typically handed directly to various GCP AI APIs like the Video and Speech-to-Text APIs, which expect to see individual isolated files in GCS, meaning they have to be stored individually. This also makes it easy to backfill missing content, since missing files can be individually added without disturbing surrounding content.

GCS provides also authoritative upload timestamps and checksums, meaning it is trivial to audit when a file was added and verify its contents match the source material. As needed, one or more text or JSON files can be uploaded under the same filename with a different extension to contain any needed additional metadata about the object.

GCS offers highly scalable bucket inventorying, with a bucket of 1 million files typically taking around 2-5 minutes to fully inventory. In our case, we maintain a parallel inventory file structure of ASCII or JSON files in the bucket that provide daily inventories of all files corresponding to that day. Thus, to request a known file, one can access it directly. To perform large scans across many files, the inventory files can be used to assemble the list of relevant files and then the files processed using massively parallel processing. The scalability of GCS means that we can shard large processing across vast fleets of VMs, each operating on thousands of files in parallel, scaling to extremely high parallel workloads. In short, while there is an overhead to individual file access, GCS offers nearly linear scalability for object access, enabling massively parallel analysis. There is no penalty for random versus sequential workloads, meaning we can process millions of video or audio files in completely random order without issues.

Still Imagery

Unlike the large file sizes of video and audio files, still images are typically much smaller and are typically consumed in endless streams of incoming files that can be extremely bursty. For imagery, we store them as sequences within an XML-based binary file format using GCS' streaming write capability. An ingest process opens a binary streaming GCS file and writes an endless sequence of images, switching to a new output file each X minutes to minimize loss in case of technical issues and to minimize overhead. Within the stream, each image is written within an XML container that contains the request sent to retrieve the image and the response headers and body as-is from the remote server. A parallel ASCII TOC file is written to GCS that records the start/stop offsets of the three pieces of information for each image.

In practice, the TOC file is used to extract an image from this container, allowing offset-based reads to directly pull only the needed information. The XML container is provided for disaster recovery, since in case of file corruption in which the TOC offsets do not correctly align, the reader can revert to the XML structure to directly generate a live TOC file.

The use of streaming GCS writes means that ingest systems can be small 1-core VMs with minimal memory and a 10GB boot disk, since all data is streamed directly to GCS without touching local disk. The linear streaming allows ingest servers to write images as they arrive, minimizing ingest overhead.

Analysis through the Cloud Vision API and EXIF extraction is performed by the ingest server, with the results being written to a separate JSON file. This JSON annotation data is what is used by all subsequent live processes, with the image data itself being accessed only sparingly for specialized research questions. Date stamping is synchronized between the JSON annotation and XML storage files, making it possible for the live access JSON records to be used to identify the storage file containing the actual image data.

Textual Content

Webpages undergo a similar process to still imagery. The raw HTML is written to an XML container file using streaming GCS writing, with a paired TOC file. Here the server output is again written as-is in raw binary without any encoding conversion or other modifications. This provides a pristine copy of the original server response to allow for postmortem analysis of encoding issues. For example, if an outlier site is identified that uses a broken encoding process, we can use the raw server response to diagnose and repair the encoding for downstream reprocessing.

The ingest server performs page extraction using visual rendering to generate the most faithful possible rendition of the actual body text, which is written to a separate JSON data stream that is divided into one minute files. Every minute the output JSON stream is switched to the next file and released within GCS for all downstream consumers to ingest. This allows us to operate on a fixed one minute clock using GCS as the central connective fabric. A few seconds after each minute any process worldwide that needs access to the live stream can ingest the JSON file directly from GCS without regard to its physical location. A process running in Mumbai can access the file exactly the same as a process in Virginia, with the only difference between slightly increased latency, allowing us to abstract away from geography.

TIME STAMPING

How do we handle the concept of time within GDELT?

For content that we ourselves ingest, we record two master timestamps: system time that retrieval began and ended. For web-derived content we use this delta to inform our global crawling strategy in realtime, adjusting recrawl rates across our globally distributed crawling fleets with each page arrival to take into account both the speed at which each page is arriving and the underlying latency to first response and transfer bitrate.

For externally provided content, we preserve three or more timestamps: the system time that retrieval began, the system time that retrieval ended, the timestamp the data source associates with the object's creation and any other timestamps the data source provides (such as a list of timestamps of edits made to the object). For trusted sources like closed data feeds with high metadata quality, we use the attributed object creation date as our public-facing authoritative timestamp for the object. For untrusted sources like the open web, we use the retrieval completion timestamp as our public-facing authoritative timestamp, while providing the object-claimed creation timestamp as a secondary field within the data stream and all other timestamps in tertiary streams. For still images, we provide both our authoritative timestamp (retrieval completion) and the complete list of all dates embedded within the image's EXIF metadata. For textual articles, we provide the separate GEMG dataset that contains all timestamps embedded within the page (creation, last modified and potentially an audit trail of all edits).

For AI APIs we also publicly provide the timestamp the object was submitted to the API which allows for model provenance. Internally we record an audit log of its unique job identifier, execution zone and any other internal operations data provided by the API for diagnostics.

Derived Time

For derived files that are treated as sub-masters, such as TTXT files for videos, we preserve time precisely as recorded in the video file itself. In practice, this means that TTXT files are stored as offsets from the start of the video, with each video starting at 0 seconds. In our case, the arriving videos do not include world time timecodes, only relative timecodes from the start of the video. While the ccextractor utility we use to generate the TTXT files can automatically adjust them based on the recording start timestamp, we don't use this feature since that represents an artificial change to the data that could possibly change over time due to changes or bugs in the ccextractor codebase and relies on an external timestamp. This would signal to downstream consumers of the TTXT file that these converted timestamps were authoritative rather than imputed (since the TTXT format does not provide for multiple timecodes for each captioning line that would allow us to store both the original and imputed timestamps).

For videos, the original pristine source data files contain two forms of native time: the collection of frames for fixed framerate videos and embedded timecode (such as PTS). Since both are provided with the original source file we consider them both to be authoritative time masters. We allow each utility to use the version it deems most relevant to its particular task. Note that the MPEG-TS streams we receive do not include "true timecode" from the broadcaster's own clock, but rather are set by the recording system from the start of each recording, so time is already an artificial construct. At the same time, offsets from the start of the video file are object-centric and authoritative, defined by frame advancement in the case of fixed frame rate or PTS. Regardless of the time path, offsets reflect the time construct a viewer of the video will see at standard speed and thus represent the authoritative time for the self-contained object. In contrast, were we to convert these offsets to absolute world time, we would have to use an external time master. Since our video files do not include a GPS or other high accuracy timeclock timecode, any provider-based time will be via the system clock on the recording system, whereas for postprocessing correction the time would be the system clock of the correcting system.

In other words, we treat the concept of "time" as a set of parallel timelines from different providers. The time offsets that come natively with a video file are the authoritative ones since they are source-provided and immutable, whereas any corrections introduce external influences to this closed object-centric set of times.

Within our global crawler fleet, GCP enforces a set of guarantees on time accuracy and synchronization via GPS receivers and atomic clocks (see Spanner + TrueTime for more details). Thus, we can guarantee within an extreme small error band that all of our crawlers worldwide see the same UTC time at the same instant in world time. That said, each timestamp in our system contains auditing information including the machine image and physical zone it was sampled, allowing our global workflow to treat GCP time as authoritative, but maintaining object-level audit information to assess any system-level problems.

In the end, this means that within GDELT objects do not have a single concept of "time." Rather, they exist simultaneously on multiple timelines, allowing us to understand them through the considerations of each task and requirement.

CONCLUSION

We hope this glimpse at how we handle masters, file formats and timestamps is helpful in your own data journey!

The GDELT Project