Announcing the American Television Global Knowledge Graph (TV-GKG)

We are enormously excited to announce today the unveiling of our latest massive new dataset: the American Television Global Knowledge Graph (TV-GKG). The Internet Archive's Television News Archive, covering more than 150 unique English language domestic American television stations in 20 markets, some dating as far back as June 2009, have been processed through the GDELT GKG 2.0 pipeline, totaling more than 740,000 broadcasts to date. The complete raw closed captioning of each monitored news show was processed, covering every word spoken on each show. For the first time, explore the themes, emotions and geography of American television news, from the national discourse to the narratives dominating local discourse. The dataset is updated each morning, making it possible to trace evolving narratives as they emerge. (Due to the Archive's 48 hour embargo on its TV News Archive, the most recent day available in the dataset each morning is two days prior).

Note that closed captioning has a high level of inherent noise and often takes the form of a single contiguous stream of words lasting 30 minutes to several hours, with no punctuation, capitalization, or other divisions, and often with high levels of transcription error and missing words or sentence fragments. For this reason there is traditionally an elevated error rate in analyses derived from raw captioning data compared with other textual sources, including OCR output. The list of identified person and organization names in particular is reduced due to high levels of typographical and other transcription errors and the frequently lack of contextual cues.

FORMAT

The format of the TV-GKG dataset is in the standard GKG 2.0 format, making it identical to our primary GKG dataset, however some fields are disabled. The SharingImage, RelatedImages, SocialImageEmbeds and SocialVideoEmbeds GKG fields are blank since they are not relevant to television. The Quotations, AllNames, Dates and Amounts fields are also disabled at this time while we explore approaches to enhanced mitigation of captioning error and how to translate these fields to the visual metaphors used on television. We are currently only processing English language broadcasts so the TranslationInfo field is also disabled at this time.

Given its textual roots, the GDELT GKG format encodes object location at the level of character offsets within the source document. In contrast, television content is traditionally referenced through timecode offsets. To help with this, the GKG Extras field contains a "CHARTIMECODEOFFSETTOC" block that provides a mapping between character offsets in the closed captioning stream and the actual broadcast timecodes they correspond to relative to the start of that broadcast. The format of this field is a series of semicolon delimited mappings, with each mapping having the format "CharacterOffset,TimecodeOffset" and mapping a particular character offset to a given timecode offset. Thus, if the V2Themes field records the theme "TERRORISM" being mentioned at character offset 1830, you would use this field to find the corresponding timecode for that character offset (you may need to interpolate from the offsets). Let's say it resolves to the timecode 190, indicating about 190 seconds (about 3.2 minutes) from the start of the broadcast. You would then take the contents of the "DocumentIdentifier" field (such as "KQED_20101007_003000_BBC_World_News") to find the unique Internet Archive identifier for this broadcast and put them together into a URL like "https://archive.org/details/KQED_20101007_003000_BBC_World_News#start/190/end/210", which you can then point the user's browser to to allow them to view the actual clip through the Archive's online viewer (as a general rule of thumb just add 10-20 seconds to the the start time or center the start/end times in the URL +/- 10 seconds around the given theme or point of interest).

ACCESS THE DATA

There are two ways to access the data, which is updated each morning with a 48 hour rolling embargo (meaning the most recent data will be for two days ago):