We are tremendously excited to announce today the debut of the GDELT Global Similarity Graph (GSG), which computes the pairwise similarity of each online news article monitored by GDELT in the 65 languages it live-translates, yielding a realtime multilingual graph of the similarity of the world's news landscape. Every 15 minutes, GDELT compiles all of the coverage it has monitored in the last 15 minutes in its 65 live-translated languages and compares each article to every other article published in those 15 minutes and the prior 15 minutes, recording each pair of articles that exceeds a minimum similarity threshold.
Using this dataset you can perform a number of powerful tasks:
- Find Articles Like This. Given the URL of a news article, you can scan this dataset for related coverage to identify a broader range of perspectives, narratives and framings across communities, countries and languages.
- Find Alternative Language Coverage. While machine translation has advanced in leaps and bounds in recent years, it is still imperfect and can struggle with the nuanced detail of complex stories. Using this dataset you can take an article in a given language and scan for highly similar articles in a language you speak. Alternatively, you might be interested in coverage of a given story in a specific language and can use this dataset to take an article in one language and identify highly similar coverage in another language.
- Group Articles By Storylines. Applying graph analysis and clustering tools to this dataset you can break the global news landscape into topical clusters or even discrete storylines, making it easier to identify emerging events, narratives and stories.
How It Works
The Global Similarity Graph performs brute-force pairwise similarity scoring across all coverage monitored every 15 minutes against itself and the previous 15 minutes. While extremely computationally demanding, this brute-force approach offers the highest possible resolution for understanding the similarity of global news coverage and avoids skewing results through random clustering seeds or the edge cases of approximative approaches.
To compute the similarity of each article, the GSG uses either the original text (for English language articles) or the machine translated version (for GDELT's 65 live-translated languages). Given that an article's lead paragraph is typically a concise summary of its most important details, we primarily focus on the opening text of each article. We filter the text to reduce it to primarily nouns since they are typically the most distinguishing features of a news article and compute the cosine similarity of that filtered text, along with the number of unique words they share in that filtered opening text. At present, only articles with a cosine similarity of 0.3 or greater and a minimum of 5 unique shared words are included in the GSG. While there are myriad similarity metrics, after several years of in-depth experimentation across global news coverage, we've found that limiting the similarity analysis to the opening article text, reducing to nouns and using a dual cosine similarity + unique shared words metric offers the greatest flexibility in the kinds of possible downstream analyses and the greatest robustness against translation error and the linguistic, descriptive and narrative differences of the world's languages. In particular, the naive who-what-where-when-why entity grouping that is traditionally used to group news coverage yields a similarity stream that is far too coarse for narrative analysis, grouping for example all coverage mentioning Joe Biden and the US Congress together in a given day and collapsing myriad independent storylines together. In contrast, the approach used here, while significantly more computationally intensive, allows applications to devise their own clustering approaches that can yield as fine-grained or coarse clusters as desired.
In addition to content similarity analysis, articles which share an identical title are also grouped together, since articles with identical titles typically represent wire stories, syndicated content or content shared across a group of news outlets with shared ownership.
It is important to note that machine translation error and linguistic differences can have an outsized impact on clustering and you may observe systematic differences in clustering scores across language pairs that may partially reflect the linguistic similarity of those language pairs. Errors in part of speech filtering, document extraction and lead extraction may further impact the accuracy of the scores presented here. Cosine similarity represents just one possible mechanism for measuring the "similarity" of two textual passages and is an approximation based on their literal word overlap. Words are lemmatized, but no semantic normalization is performed, meaning the system for example doesn't know that a "golden retriever" and a "dog" refer to the same animal. This can particularly impact similarity scores for non-proper nouns and decrease the similarity of language pairs that have substantial divergences in noun use. Thus, all similarity scores here represent a rough approximation and are imperfect.
Only article pairs with greater than 0.3 cosine similarity and 5 or more shared words are included in the GSG. Only articles for which a "high confidence" machine translation is available are considered for analysis and whose translation passes certain linguistic tests are included, meaning not all articles are considered for similarity scoring. This means that if an article pair appears in the GSG it means that pair has sufficient similarity, but if a pair does not appear it does not necessarily mean they are not similar, it simply means that based on the processing pipeline, methodology and thresholds here they did not pass this similarity threshold.
While the Global Similarity Graph only examines similarity over a running 30-minute window, you can extend this window further back by transitively chaining similarity scores. A given URL will only have similarity scores comparing it to articles published in the same 15 minute duration as it and the previous 15 minutes. To transitively chain articles into larger clusters through time, first compile a list of all URLs sufficiently similar to the URL of interest. For each of those articles, compile a list of all articles similar to them. Then compile a list of all articles similar to those and so on backwards through time for the desired time duration. Note that this can quickly cause topical drift: if article A is 50% similar to B and B is 50% similar to C and C is 50% similar to D, you may group in larger and larger collections of topics into your cluster. You can address this by requiring higher similarity scores at the expense of excluding a lot of relevant coverage or simply design your application to accommodate this drift. While more computationally intensive, an alternative approach is to use time-aware graph clustering techniques to use the broader context of each article in the overall landscape to determine when drift has exceeded a preset maximum by observing the position and context of each additional article in the overall graph.
Clustering Into Stories
Using transitive grouping above, you can cluster articles into larger "stories" such as a major event or narrative. For example, select a single article about a major news event and transitively follow its similarity edges out several levels to arrive at an instant cluster of coverage about that story across countries and languages. One challenge in such clustering lies in how to handle "bridge articles" which connect clusters. For example, one cluster of stories might detail the Biden-Putin summit, while another cluster focuses on Biden's domestic agenda before Congress. A bridge article might discuss both stories and how they relate, connecting the two clusters as it is equally similar to articles in both clusters. How to handle such bridge articles depends on the specific application. Some applications might wish to discard bridge articles, assign them to both clusters without connecting the clusters themselves, identify these bridges as special kinds of articles or merge clusters. Graph clustering approaches are highly relevant to these questions, offering myriad approaches for handling such questions.
Eventually the Global Similarity Graph will update every minute, but for the moment it updates every 15 minutes and processes coverage monitored 15 minutes before that. Applications should thus look for new files every 15 minutes and search for those 30 minutes ago.
The actual format of the GSG is a series of UTF8 JSON-NL files, one per minute (though most files will appear in the minutes after each 15 minute marker, so :00, :01:, :02, :03, :15, :16, etc. Each record includes the following fields:
- Type. Either "sim" indicating a cosine similarity match or "title" indicating that the two articles share the same title.
- simScore. For sim matches this ranges between 0.3 and 1.0 and is the raw unitized cosine similarity score. For title matches it is always 1.0.
- simWords. For sim matches this is the number of unique words the two articles share in the lead text. For title matches it is always -1.
- fromDate. The date/time GDELT saw the first article.
- fromUrl. The URL of the first article.
- fromTitle. The title of the first article. May be blank.
- fromLang. The CLD2 human-readable language name of the first article's primary publication language. Typically uppercase, but may be mixed case for some languages.
- fromImage. The primary illustrative image of the first article (such as its Open Graph Image). May be blank.
- toDate. The date/time GDELT saw the second article.
- toUrl. The URL of the second article.
- toTitle. The title of the second article. May be blank.
- toLang. The CLD2 human-readable language name of the second article's primary publication language. Typically uppercase, but may be mixed case for some languages.
- toImage. The primary illustrative image of the second article (such as its Open Graph Image). May be blank.
You can download the files directly in the following format. The first available file is midnight July 2, 2021. There is one file per minute, but as noted above, most files will cluster in the minutes right after each 15 minute mark and will always be 30 minutes ago (so at 3:00PM UTC you would search for the minutes from 2:30PM UTC).
You can also query the dataset in BigQuery:
We are immensely excited by the opportunities this new dataset brings to understand global news narratives!