The GDELT Project

Announcing The GDELT Article List & RSS Feed

We are excited to announce today a new dataset called the GDELT Article List (GAL), based on the Global Embedded Metadata Graph (GEMG), that provides a standardized and minimized basic set of metadata for every article GDELT monitors, along with an RSS feed of all of the URLs we monitor. Instead of the entire rich raw collection of META and JSON-LD metadata compiled in the GEMG, the new GDELT Article List standardizes this data, looking across the GEMG metadata fields for each article to produce a single standardized record. For example, it looks across all of the various ways of specifying the author for an article to produce one single "author" field, performing all of the normalization tasks for you.

The goal of this new minimized dataset is to create an easy-to-use dataset for creating rich graphical interface displays for users to provide them links to relevant news coverage. For example, perhaps you've built a "find more like this" browser plugin using our Global Similarity Graph Document Embeddings that displays a list of other articles highly similar to the current one. The GSG contains just the URL and title of each article, which is sufficient for a simple text link, but does not allow for the kinds of rich graphical "card" displays so common on today's web. Instead, using the new GAL dataset, you would look up the article's image, full outlet name, outlet logo, one-sentence summary description and author to display in a compact news card to the user.

The final dataset is a UTF8 JSON-NL file produced each minute (though at present typically those files will cluster in the minutes after every 15 minutes). Each line is a single article and contains the following fields:

The dataset can be downloaded directly every minute as a JSON file with the following URL structure, with the date represented as "YYYYMMDDHHMMSS" in the UTC timezone. The first available file is seen below.

You can also query the dataset in BigQuery:

Note that a high level of duplicate URLs will appear in this dataset at this time while we are in the processing of updating it to GDELT 3.0, which you can safely filter out. The historical backfile contains a substantially elevated number of duplicate records due to how it was computed – as with the frontfile, these can be safely filtered out.

As part of the GAL dataset, we are unveiling a new RSS feed endpoint that updates every minute with a rolling 15 minute window that lists all of the URLs we have monitored over the last 15 minutes. This is a standard RSS feed containing the URL and title of each article. It is designed for those that simply want to ingest a mirror of what GDELT is discovering across the world, such as by web archives preserving global news.

We are tremendously excited to see what kinds of powerful new applications this new dataset enables!