What might it be like to use deep learning algorithms to non-consumptively "read" the closed captioning transcripts of more than a decade of television news broadcasts and compile a minute-by-minute summary of the key entities and concepts they mention, resolving spelling variants, alternative names and pronoun and noun references to common disambiguated entries? What kinds of deeper analyses, especially around disinformation and contested narratives research might be enabled by a rich deep learning-produced entity graph over television news?
Today we are incredibly excited to unveil the results: the new Television News Global Entity Graph 2.0, a massive non-consumptive graph of more than 470 million entity annotations spanning 303,000 broadcasts totaling more than 259,000 hours of airtime in the Internet Archive's Television News Archive, including BBC News January 2017 to present, CNN, MSNBC and Fox News July 2009 to present, and ABC, CBS and NBC evening news broadcasts July 2010 to present, all non-consumptively analyzed by Google's Cloud Natural Language API in a special digital library system.
Each broadcast was analyzed by Google's deep learning-powered Cloud Natural Language API to compile a list of all major concepts and entities within, using the surrounding context of each mention to disambiguate it. Thus, a mention of "Cambridge" in the UK will be distinguished from a mention of Cambridge, Massachusetts. A mention of "The Fed" will be recognized as an invocation of the US Federal Reserve, "Donald Trump" as being the same as "President Trump" and so on.
Even more powerfully, the Natural Language API performs complete pronoun and noun coreference resolution. An article that mentions "President Trump" and then repeatedly refers only to "the president" will see each of those subsequent references resolved back to "President Trump" – something not possible with traditional keyword analysis.
The result is a list of major concepts and entities (public figures, locations, organizations, events like the Arab Spring, publications like the Mueller Report and dates) found in each broadcast in 15 second intervals, a unique identifier for each concept and even a link to its relevant Wikipedia page if available. In short, a non-consumptive minute-by-minute chronology of the topics and figures driving the news agenda across some of the world's most influential television news stations.
Using the unique IDs assigned to key concepts and entities, mentions on television news can be linked to mentions in worldwide online news and global online news imagery, making it possible to connect the online and offline and textual and visual worlds. An emerging narrative can be tracked as it spreads across online and broadcast news, traverses languages and spans the visual and written word. For the first time it will be possible to track the most complete picture of the spread of contested narratives, disinformation and inorganic campaigns and compare them against authoritative, organic and uncontested narratives and bridging the textual world of online news with the visual world of television news.
Ultimately, a viral online meme that spread through online news coverage yesterday can be connected through this dataset to authoritative vetted mentions of the same event, topic and entities on television news. Similarly, complex framing questions can be answered, such as who are the major international political figures mentioned most frequently alongside discussions of Joe Biden and Ukraine compared with Rudy Giuliani and Ukraine, offering a rich understanding of narrative framing and agenda setting.
This past December we unveiled the prototype 1.0 release of this dataset. Based on the lessons we've learned from that dataset, this new 2.0 release has two major changes. The first is that we've moved from 1 minute resolution to 15 second resolution, which is the same resolution used by the Television Explorer. This will make it much easier to track narrative transitions across the fast-paced world of television news. The second major change is that the original dataset was organized into daily files, with a single file for each day containing all of the annotations for all shows that day. Since some shows can take 3-4 days for the Television News Archive to complete processing, this meant that there was a rolling five day delay for updates. Instead, the new 2.0 release organizes annotations by show, with one file per show and shows being released after a rolling 24-hour window as soon as they are done processing, with most show annotations becoming available just after 24 hours after airing. This brings the dataset into line with our new television news processing architecture and enables much quicker analyses of breaking events in this ever more fluid news environment.
The dataset consists of one file per broadcast with a rolling 24 hour delay. Each file is in newline delimited JSON format, with each row representing one 15 second interval of a broadcast and containing an array of all of the entities identified by the API from that 15 second block of airtime, ordered by their overall semantic salience to the broadcast as a whole.
- date. The date and time in UTC of a specific 15 second interval of a given broadcast, rounded to the 15 seconds.
- lang. The Google-provided language code of the article used by the API, allowing filtering by language.
- iaShowId. The unique identifier assigned by the Internet Archive to this broadcast.
- station. The station on which this broadcast aired.
- showName. The human-readable name of the show.
- iaClipUrl. The URL of the Internet Archive's Television News Archive page to view the clip of one minute of this broadcast beginning with the start of this 15 second interval.
- entities. A JSON array containing the list of distinct entities identified by the API for this 15 second clip of this broadcast. Multiple references to the same entity in this minute of same type are grouped together. Thus, if "White House" is mentioned 10 times in this 15 seconds as type "Location" it will only appear once here, but if it is mentioned as a "Location" in some contexts and an "Organization" in others, it will appear twice, once for its Location context and once for its Organization context.
- name. The entity as identified by the API. Note that this may be different from the actual literal utterance in the broadcast. A broadcast that mentions "Barack Obama" and then mentions "the president" the rest of the broadcast would list "Barack Obama" in this field for each of the subsequent mentions instead of the literal mention "the president."
- type. The entity "type" as determined by the API.
- mid. Provides the unique Google-assigned ID for entities for which Google has assigned an identifier (typically only for well-known entities). Absent for entities without a MID. Note that the presence of a MID is typically accompanied by a wikipediaUrl entry, but not always and users should not assume that a MID guarantees a wikipediaUrl.
- wikipediaUrl. Provides the URL of the Wikipedia entry for this entity if Google has mapped it to its corresponding Wikipedia page (typically only for well-known entities). Absent for entities without a Wikipedia mapping.
- numMentions. The number of times this entity+type mapping was mentioned during this minute. Thus, if "Ebola" was mentioned 4 times in this minute, it will have a 4 in this field.
- avgSalience. The average "salience" score for this entity across all its mentions during this minute, recording how central and "important" this specific mention was to the overall broadcast.
Each day's files are recorded into an inventory listing for that day in the format YYYYMMDD.txt, ranging from July 2, 2009 through present:
- http://data.gdeltproject.org/gdeltv3/iatv/gegv2/YYYYMMDD.txt
The dataset is also available in Google's BigQuery:
We're tremendously excited to see the kinds of advanced cross-modality analyses you're able to do with this powerful new non-consumptive dataset!