We are incredibly excited to announce the official release of the GDELT Global Knowledge Graph (GKG) Version 2.0. This marks a watershed moment for the GKG, transitioning it for the first time to production status, opening it up beyond purely news media, adding in new capabilities for richer emotions, citations, dates, and other kinds of data, and adding for the first time proximity context to allow for vastly richer contextualization.
The new GKG 2.0 format will be phased in over the coming weeks, debuting first in two special collections to be released later today, and then will become available for the daily updates following in a few weeks, followed by reprocessing of the historical GKG files to convert them to the 2.0 format. We will also continue to generate the GKG 1.0 files into the future to ensure backwards compatibility with existing systems built for the 1.0 format.
It has been just short of eleven months since the original prototype introduction of the GKG 1.0 system on November 3, 2013 and in those eleven months the GKG system has found application in an incredible number and diversity of fields. The uniqueness of the GKG indicators in capturing the latent dimensions of society that precede physical unrest and their global scope has enabled truly unimaginable new applications. We’ve learned a lot over the past year in terms of the features and capabilities of greatest interest to the GKG community, and with this Version 2.0 release of the GKG, we are both integrating those new features and moving the GKG into production status from its original Alpha Experimental Release status of the past 11 months in recognition of the widespread production use of the system today.
The new GKG format preserves most of the previous fields in their existing format for backwards compatibility (and we will continue to generate the Version 1.0 files in parallel into the future), but adds a series of new capabilities that greatly enhance what can be done with the GKG data, opening entirely new analytic opportunities. Some of the most significant changes:
- Proximity Context. Perhaps the greatest change is the introduction of the new Proximity Context capability. The GKG records an enormously rich array of contextual details from the news, encoding not only the people, organizations, locations and events driving the news, but also functional roles and underlying thematic context. However, with the previous GKG system it was difficult to associate those various data points together. For example, an article might record that Barack Obama, John Kerry, and Vladimir Putin all appeared somewhere in the article and that the United States and Russia appeared in that article and that the roles of President and Secretary of State were mentioned in that article, but there was no way to associate each person with the corresponding location and functional roles. GKG 2.0 addresses this by providing the approximate character offset of each reference to an object in the original article. While not allowing for deeper semantic association, this new field allows for simple proximity-based contextualization. In the case of the example article above, the mention of United States likely occurs much closer to Barack Obama and John Kerry than to Vladimir Putin, while United States and Secretary of State likely occur much closer to John Kerry than to the others. In this way, critical information on role, geographic, thematic association, and other connectivity can be explored. Pilot tests have already demonstrated that these proximity indicators can be highly effective at recovering these kinds of functional, thematic, and geographic affiliations.
- Extensible XML Block. GDELT has historically relied primarily on mainstream news coverage for its source material. Whether from print, broadcast, or web-based mediums, news coverage across the world is relatively consistent in the kinds of information it captures. As GDELT encodes an ever-increasing range of materials, including academic journal articles and government reports, additional types of information are available to codify. As a first example of this, Leetaru, Perkins and Rewerts (2014) apply the GKG to encode more than 21 billion words of academic literature, including the entire contents of JSTOR, DTIC, CORE, CireSeerX, and the Internet Archive’s 1.6 billion PDFs relevant to Africa and the Middle East. Academic literature contains a list of cited references at the bottom of each article that indicate the papers cited within that paper. This citation list is extremely valuable in constructing citation graphs over the literature to better understand trends and experts. Yet, such citation lists are unique to this class of literature and will not be found in ordinary news material and thus it would be cumbersome to add additional fields to the GKG file format to handle each of these kinds of specialized data types. Instead, the GKG now includes a special field called V2EXTRASXML that is XML formatted and includes these kinds of specialized data types that are applicable only to subsets of the collection. Moving forward, this will allow the GKG to encode highly specialized enhanced information from specialized input streams.
- Global Content Analysis Measures (GCAM). There has been enormous interest and application of the GKG Tone fields over the past year with some really exciting results. We’ve heard from many of you that you’d love to move beyond just positive/negative tone towards a far richer and more diverse ecosystem of emotional measures. In answer to this, we have added the Global Content Analysis Measures (GCAM) system that runs each document through an array of content analysis systems to assess several thousand latent dimensions for each article.
- Date Mentions. We’ve heard from many of you the desire to encode the list of date references found in news articles and documents in order to identify repeating mentions of specific dates as possible “anniversary violence” indicators. All day, month, and year dates are now extracted from each document.
- Unique Record Identifiers. To bring the GKG in line with the practices of the GDELT Event Database, every GKG record is now assigned a unique identifier. As with the event database, sequential identifiers do not indicate sequential events, but an identifier uniquely identifies a record across the entire collection. The addition of unique record identifiers to the GKG will make it easier to uniquely refer to a particular GKG record.
- Single Data File. Previously there were two separate GKG data files, one containing Counts only and one containing the full GKG file. The original rationale for having two separate files was that users interested only in counts could download a much smaller daily file, but in practice nearly all applications use the full GKG file in order to make use of its thematic and other data fields to contextualize those counts and to tie them into the GDELT Event Database. Thus, we are eliminating the separate counts-only file to simplify the GKG data environment.
- Production Status. The GKG has now moved out of Alpha Experimental Release status and into production status. This means that the file format is now stabilized and will not change.