We are excited to announce today a new dataset called the GDELT Article List (GAL), based on the Global Embedded Metadata Graph (GEMG), that provides a standardized and minimized basic set of metadata for every article GDELT monitors, along with an RSS feed of all of the URLs we monitor. Instead of the entire rich raw collection of META and JSON-LD metadata compiled in the GEMG, the new GDELT Article List standardizes this data, looking across the GEMG metadata fields for each article to produce a single standardized record. For example, it looks across all of the various ways of specifying the author for an article to produce one single "author" field, performing all of the normalization tasks for you.
The goal of this new minimized dataset is to create an easy-to-use dataset for creating rich graphical interface displays for users to provide them links to relevant news coverage. For example, perhaps you've built a "find more like this" browser plugin using our Global Similarity Graph Document Embeddings that displays a list of other articles highly similar to the current one. The GSG contains just the URL and title of each article, which is sufficient for a simple text link, but does not allow for the kinds of rich graphical "card" displays so common on today's web. Instead, using the new GAL dataset, you would look up the article's image, full outlet name, outlet logo, one-sentence summary description and author to display in a compact news card to the user.
The final dataset is a UTF8 JSON-NL file produced each minute (though at present typically those files will cluster in the minutes after every 15 minutes). Each line is a single article and contains the following fields:
- date. The timestamp the article was published. Roughly 30% of articles specify their exact publication timestamp, otherwise this is when we saw the article. Some articles may have older timestamps, which can mean we just saw them for the first time or older content was republished under a new URL.
- url. The URL of the article.
- domain. The full domain name of the URL.
- outletName. For the 71% of articles that include the full human-readable name of the news outlet (such as "The Wall Street Journal" instead of "wsj.com"), this field will contain that name, otherwise it will repeat the domain name.
- outletLogo. The thumbnail image to display as the logo of the news outlet that published the article. Around 87% of articles include this information and this image may not be the same as the standard "/favicon.ico" site image.
- outletTwitter. The Twitter handle of the news outlet that published the article. Around 49% of articles include this information, allowing the creation of rich display cards that connect users to the outlet's social media presence.
- title. The title of the article.
- image. The primary image of the article. Around 82% of articles specify an image to be displayed when sharing or linking to the article.
- desc. The contents of the "desc" metadata field. More than 91% of articles include this field. In some cases it may simply be the first sentence of the article, but in many cases it is a one-sentence summary that summarizes the general gist and focus of the article.
- lang. The language code returned by CLD2 for this article.
- author. Around 25% of articles specify the author(s) of the page. The format of this field differs widely across outlets and many contain multiple authors, outlet names, etc.
The dataset can be downloaded directly every minute as a JSON file with the following URL structure, with the date represented as "YYYYMMDDHHMMSS" in the UTC timezone. The first available file is seen below.
You can also query the dataset in BigQuery:
Note that a high level of duplicate URLs will appear in this dataset at this time while we are in the processing of updating it to GDELT 3.0, which you can safely filter out. The historical backfile contains a substantially elevated number of duplicate records due to how it was computed – as with the frontfile, these can be safely filtered out.
As part of the GAL dataset, we are unveiling a new RSS feed endpoint that updates every minute with a rolling 15 minute window that lists all of the URLs we have monitored over the last 15 minutes. This is a standard RSS feed containing the URL and title of each article. It is designed for those that simply want to ingest a mirror of what GDELT is discovering across the world, such as by web archives preserving global news.
We are tremendously excited to see what kinds of powerful new applications this new dataset enables!