The GDELT Project

Announcing The Radio News Ngram Dataset

UPDATE (9/26/2020): The Radio News NGram dataset has been massively expanded!

This past June we unveiled the Television News Ngram 2.0 dataset, a compilation of 1-5-gram ngram datasets over 23 television news stations, some stretching back more than a decade in collaboration with the Internet Archive's Television News Archive. Today we are tremendously excited to announce the debut of the Radio News Ngram dataset, the equivalent for the Internet Archive's Radio News Archive, 1-5 gram non-consumptive word frequency tables compiled across more than 4 million ASR-transcribed radio news broadcasts 2016-present spanning 550 stations totaling more than 26 billion spoken words in collaboration with the Internet Archive's Radio News Archive!

Word histograms represent an ideal mechanism through which to explore large textual datasets like radio transcripts. In the case of radio, converting automated speech recognition (ASR) transcripts to 10-minute word and phrase histograms enables non-consumptive analysis of underlying topical and linguistic patterns without access to the actual spoken word itself.

These non-consumptive datasets open the door to an incredible variety of exciting new kinds of analyses, from comparing television, radio and web narrative styles to analyzing the ebb-and-flow of the topics and language covered on radio news over the past several years. Simply by plotting mentions of each word or phrase on a timeline, all kinds of trend analyses become possible, such as surfacing trending topics and uncovering narratives more popular on one station than another and even identifying words whose usage over time appears to be highly correlated or anti-correlated.

This inaugural dataset is based on automated speech recognition transcripts of each broadcast and so unlike the television news ngrams, which are based on human keyed transcripts, this dataset is limited by the accuracy of machine speech recognition and may perform differently across the wide variety of accents, dialects and speech patterns of the rich vibrant array of speakers captured in this dataset and may experience a lag or error in the recognition of novel names or may spell them phonetically or non-traditionally.

To create these ngrams, the machine-generated transcript of each broadcast is lowercased and aggregated into 10 minute blocks in the UTC timezone. Words that begin with punctuation (the regular expression "[!\"()*+,./:;<=>?[\\]^_`{|}~\-]") have it stripped off and are treated as bigram/trigram/quadgram/5-gram dividers (the word is considered the start of a new bigram/trigram/quadgram/5-gram). Words that end with punctuation (same symbols as above) have it stripped off (except for words containing periods like "a.m." or "ph.d.") but unlike television ngrams, punctuation like periods and commas are NOT treated as the end of a bigram/trigram/quadgram/5-gram. Thus, a transcript line "He did. He really did" would for television news yield "he did, he really, really did" as the three 2-grams, whereas for radio the period is ignored and the resulting 2-grams are "he did, did he, he really, really did". This is because for television news, punctuation is added by human transcriptionists, whereas for radio all punctuation is machine-added based on probabilistic models that may or may not be correct for a given context.

Unigrams are computed simply by tallying up the number of unique words spoken in that show's closed captioning transcript in each given 10 minute increment.

Bigrams, trigrams, quadgrams and 5-grams are computed the same way but use a moving window (again, unlike television news ngrams, these do NOT reset at each punctuation symbol). Thus, a captioning line "At the White House today, the president…" would yield unigrams of "at", "the", "white", "house", "today" and "president" and bigrams "at the", "the white", "white house", "house today", "today the" and "the president" (since it does NOT reset at the comma), trigrams "at the white", "the white house", "white house today", "house today the" and "today the president", quadgrams of "at the white house" and "the white house today", "white house today the" and "house today the president" and 5-grams of "at the white house today", "the white house today the" and "white house today the president".

A separate set of five ngram files (unigram, bigram, trigram, quadgram, 5-gram) are saved for each show in the following format, with each row representing a unique word in that given 10 minute period:

Note that the files are stored in UTF8 to account for the fact that some stations contain non-ASCII characters.

To determine which shows are available, a daily inventory file is available in the format YYYYMMDD from August 11, 2016 through present that lists all of the ngram files for that day. For example, for June 1, 2020 the file would be:

The inventory files for the most recent 72 hours are updated every 20 minutes as shows complete processing.

There is a 24-48-hour delay after each broadcast before it is indexed into the ngram files. Thus, the previous three days are typically incomplete and are updated every 20 minutes as new broadcasts finish processing. You can check the available shows for the last few days to see how complete coverage is at a given moment in time as processing completes. Note that not all stations are still being monitored and only selections of shows may have been transcribed in a given day. Like the television news archive, there may also be technical issues that can yield brief outages, so make sure to check the baseline to verify a given query's results.

All ngram files can be downloaded as gzipped TSV files per the inventory files above and are also available in a trio of BigQuery tables:

We're incredibly excited to see what you're able to do with this immense new dataset!

Here is the list of stations being actively updated as of September 2020:

And here are past stations that are not currently being updated and are available for historical research: