Announcing the GDELT Global Difference Graph (GDG): Planetary Scale Change Detection For The Global News Media

Just under a decade ago I showed how coupling web archiving with large scale automated change analysis could be used to track how the White House was quietly rewriting America’s history in realtime. The success of this model led to many other projects like Propublica’s ChangeTracker and NewsDiffs. Yet, beyond a few efforts focused on a small number of articles from a handful of mostly US outlets, there has been little focus on the grander question: planetary scale change detection for the global news media.

What would it look like to essentially perform a global “diff” on a growing fraction of the entire daily output of the world’s online news outlets each day and each week to construct a running catalog of every change made to the online news coverage that forms the basis of our understanding of the world around us?

Such a catalog would allow us to see for the first time the extent of just how much of our online news landscape is rewritten on a daily basis, the kinds of changes that are made, whether there are specific topics or stories that endure greater rewriting or even deletion than others and to allow us to begin to quantify both the scope and impact of the growing practice of “stealth editing.”

Today we are immensely excited to announce the alpha release of the GDELT Global Difference Graph (GDG) that does exactly that. The GDG recrawls every news article monitored by GDELT after 24 hours and again after one week and compiles a running catalog of every change made to the article, from deletion to URL redirection to title changes to the article text itself being rewritten. When cataloging textual changes, only changes to the actual article text itself are considered – the surrounding navigation bars, headers, footers, trending sections, insets and other sections are excluded. This means that instead of flagging every time the trending stories box changes or an ad is rotated, the GDG only reflects actual changes to the story itself.

At a technical level, every 60 seconds the GDG compiles a list of every article monitored by GDELT during that one-minute period the previous day and 7 days prior and recrawls each of the URLs, checking for any changes. (Until GDELT 3.0 launches, the GDG will update every minute, but be largely limited to seeing new content only every 15 minutes due to GDELT 2.0’s 15 minute update cycle). Changes in the title or body text are compiled using an optimized fast differencing engine that compares at either the “word” level (for space-delimited languages) or character level (currently for Burmese, Chinese, Dzongkha, Japanese, Khmer, Laothian, Thai, Tibetan and Vietnamese, with more being added as GDELT 3.0 launches). The text of any hyperlinks found in the article are included in the analysis, but text found in image captions is currently excluded to focus results on the body text (due to the number of outlets that only display partial captions and the high rate of caption editing on some sites resulting in large numbers of counter intuitive edits – we may add a dedicated "captions" category in future).

The GDG currently records four classes of change: download errors, redirects, title changes and text changes. Download errors record cases where the crawlers were unable to fetch the original page due to a 404, 410 or 451 response (it is too difficult to separate other error classes from IP targeted responses or network transient errors, but we may expand the list of reported responses over time). Redirects record when the server redirects the client to a different URL (in which case the crawlers follow the redirect and continue their change analysis). Title changes reflect any edits made to the page title and text changes reflect edits made to the body text of the article itself. Pages that were found to be unchanged are recorded as either having exactly the same HTML code or having exactly the same text (in case the surrounding HTML of the page has changed, but the article text itself was not edited).

For web archives we also offer an RSS feed that updates every 15 minutes and provides a list of all of the changed URLs identified over the previous 15 minutes. This makes it trivial for archivers to ensure they recrawl changed pages as quickly as possible.

In the end this is an alpha release so it is important to note that the results won’t always be perfect and we’d love your feedback, error reports, suggestions, etc.

 

ALPHA RELEASE

This inaugural release of the GDELT Global Difference Graph (GDG) should be considered an alpha grade release. Based on user feedback and our own monitoring over time we will be continuing to improve this service as we roll it towards beta and may make breaking changes to the data format, assumptions, workflows and end points. It also means that while we have worked hard to make this service as accurate as possible, the sheer number of moving parts and the enormous complexity of tasks like document extraction at global scale mean there will always be some level of error in the resulting data. There may also be brief outages from time to time as we roll out major infrastructure changes to the system, finetune our crawler fleet and adjust outputs and assumptions. The alpha GDG release also enforces a versioning requirement on document extraction upgrades that means that after major updates to our extraction infrastructure there will be a 24 hour pause on text change comparisons (see the section below for more details), which will be removed as we move towards beta.

To put it more succinctly, we’ve done our best to create a service that is as accurate as we can make it and that we think will be tremendously powerful for understanding the fluidity of the global news landscape, but just keep in mind that as an alpha release there will be a certain level of roughness around the edges and errors.

 

ACCURACY

The GDG identifies a number of different kinds of changes to articles. Server errors (currently only 404, 410 and 451 errors) are recorded as-is from the remote server. It is important to remember that many kinds of errors may be transient and errors like “403 Forbidden” or a timeout may be unique to a specific crawler’s interaction with the server rather than a broader error response. GDELT’s crawler fleet is distributed across Google’s immense global data center footprint, meaning at any given moment a new crawler might launch on an IP address that previously belonged to a different user that was blocked by a given site, inadvertently affecting that crawler for the duration of its life (typically 30 minutes to an hour) or may encounter a limited temporary issue with the remote server that causes it to timeout or receive a server error. Transient network errors can also result in a code 200 response that disconnects before the page content itself is transferred. Most permanent error responses like 404’s will not change, but keep in mind that the GDG merely reports the experience of our crawlers when attempting to access the page at that specific instant in time and you may see different results from your specific IP addresses. For precision tasks you may wish to feed error URLs into your own crawlers to test yourself.

GDG crawlers use a 30 second timeout to give the best possible chance to very slow websites. For timeouts and 500-series errors, the crawlers will retry the URL a second time using a brand-new connection and allow for another 30 second window to receive the results.

To perform cross-domain or longitudinal analyses, use the UNCHANGED_HTML and UNCHANGED_TEXT statuses (documented below) as a normalization factor, comparing the number of statuses indicating content changes to the number of UNCHANGED_* statuses. This way if a site has a problem where most requests time out for a few hours and there are only a handful of change reports during that period, you can see from the low number of UNCHANGED_* statuses during those hours that the reduced number of change reports is not a result of the site performing less editing, but rather than our total monitoring volume from that site was reduced during that period of time.

When evaluating change reports, especially those involving large scale changes to an article’s text, it is important to consider whether the page also has a redirection status indicating that the original URL now redirects to a new URL. In the vast majority of cases redirections point to an updated version of the article, but in some cases sites may redirect to administrative pages, such as this article that now redirects to a GDPR compliance page (instead of just displaying the notice as a popover on the original page like most sites), which causes a page text changed status to report that the entire article text has been replaced. Some sites also do not correctly return 404’s for deleted pages and instead return 200 with a page describing in textual or visual terms to the user that the original page no longer exists. This can cause the page text changed status to report the entire page has changed. Checking the replacement text for boilerplate mentions of "404" or a "page not being found" error in its respective language can help identify these cases.

Given the simplicity of extracting page titles, changes to title text identified by the GDG can typically be trusted as-is. For precision tasks it is important to note that GDELT combines the HTML <TITLE></TITLE> tags with other elements of the page like the OG:TITLE and other attributes to determine the “best” title and handle TITLE tag abuses by some sites, as well as attempting to remove extraneous elements like the site name from the title. However, in nearly all cases title changes can be accepted as accurate. This can include the actual removal of a page’s title, in which for unknown reasons a page that formerly had a title is republished without its title.

Article body text changes, on the other hand, are entirely dependent on the ability of the document extraction infrastructure’s ability to correctly, robustly and repeatably extract the article text from the rest of the page and do so with absolute stable precision.

There are many factors that can negatively affect our ability to precisely extract the body text, introducing error, particularly around the exact start and stop boundaries of the text and the removal of insets, allowing extraneous content to slip in that may change between page fetches. For example, the server might truncate sections of the page without reporting a problem. If a sufficient number of critical sections are removed, that may impact how we see the page or even remove the text entirely if it is buried at the bottom of a lengthy HTML page that is truncated. Major structural changes to the page can impact the extractor’s ability to correctly locate document boundaries. In extreme cases a temporary misconfiguration of a server or CMS template might cause sufficiently large blocks of badly encoded text to be injected into a page (or even corrupt the encoding of the entire page) that it interferes with our ability to correctly detect and decode the characterset encoding of the page, leading us to incorrectly detect the entire page as changed. In practice, certain changes to a page may cause our extraction infrastructure to change its estimates of what precisely constitutes the body of an article, allowing small amounts of unrelated text to seep in that may change between page fetches.

In short, the sheer magnitude of the number of moving parts involved in correctly and precisely extracting each article’s body text across the incredible stylistic, technical and linguistic diversity of the world’s news outlets means there are necessarily going to be a certain percentage of errors in our extractions, no matter how accurate they are. Stable errors are those that persist despite small changes to the page between fetches. Unstable errors are those where small but critical changes to the page can yield small changes to the extracted text’s start and end boundaries and inset removal, leading to false positive change detections.

For example, some sites have quirks in their HTML that can make them difficult to properly extract, such as embedded “popover” text that is rendered in HTML as standard body text, with JavaScript code that dynamically hides it at runtime or “last updated” text that reports how many hours it has been since the page was published and is integrated into the body text with no styling or other visual or structural indicators to allow it to be separated.

In the end this means that not all changes the algorithms detect are correct – there will always be a certain percentage of false positives. For precision comparison needs, we recommend you manually compare the current version of the page against a historical version and search for the changes we report. Since 2014 GDELT has sent a realtime stream of all of the URLs we monitor to the Internet Archive for ingestion into their Wayback Machine, which as of last November totaled more than 221TB across 5.3 billion URLS. However, there can be a delay of several hours between the URL being received by the Archive and when they are able to crawl it, meaning that changes made in the first hour or two after publication may not be reflected in the Archive’s version of the page. We are currently working on several fronts to make it easier to manually verify changes for high precision use cases.

Caveats aside, in the majority of cases when you see a text change recorded in the GDG it means there was a legitimate change to the article’s text.

 

DOCUMENT EXTRACTION INFRASTRUCTURE UPGRADES

Correctly and robustly extracting the body text of a news article from the rest of the page text and doing so accurately at planetary scale, spanning a growing fraction of accessible online news outlets in every country and more than 65 languages requires an immensely complex and powerful document extraction infrastructure that we are constantly improving over time. Ordinarily these continual improvements are invisible – the quality of extractions just continues to improve over time. Periodically however, we release substantial upgrades that materially improve handling for a large number of cases. These kinds of large changes create a problem when comparing the article text extracted today using the upgraded infrastructure from the article text extracted yesterday using the original infrastructure – the changes between the two text versions may merely reflect adjustments in the functioning of the algorithm rather than legitimate changes to the article itself.

To address this, the alpha GDG release will only compare versions of an article that were extracted using the same document extraction infrastructure versions. This means that when we roll out a major upgrade to the extraction infrastructure, the version numbers of its various components will change. If we release an upgrade today, the GDG crawlers will immediately begin extracting using the new codebase, but the version signature of the extracted text won’t match that of the previous versions of the article extracted yesterday and one week ago. In this case, the GDG crawlers will still perform the HTTP and HTML checks but will not proceed to document extraction and title and text comparison.

Thus, during the alpha phase, any major upgrade to the extraction infrastructure will mean a 24-hour period from that point forward where the GDG system will not perform title or text comparison checks. After 24 hours the versions of the articles being dispatched for their 24-hour check will now match the previous version and the comparisons will resume and after one week the weekly checks will resume as well.

As we progress through the alpha phase towards beta we are working on a fallback infrastructure that will allow the GDG crawlers to directly fetch the previous original HTML and reperform document extraction using the new engine so that the results are matched. This approach poses considerable infrastructure and computational requirements and so likely will not be implemented until the beta release but we hope to have such dynamic reextraction in place for beta.

 

TECHNICAL DETAILS AND FILE FORMAT

The GDG is available as either newline delimited JSON files, produced once per minute or as a BigQuery table updated every 15 minutes. The BigQuery table is loaded directly from the JSON files so mirrors their format precisely. An additional RSS feed of the changed URLs is also available.

 

  • JSON Files. All files are in newline delimited gzipped UTF8 JSON format and available at http://data.gdeltproject.org/gdeltv3/gdg/YYYYMMDDHHMMSS.gdg.v3.json.gz, starting from the first file http://data.gdeltproject.org/gdeltv3/gdg/20180827133100.gdg.v3.json.gz and running through one minute ago. The “SS” (seconds) component of the date is always zero and the files are added one per minute. To download all of the available files, iterate from the first file to one minute ago in one-minute increments. Note that due to the 15 minute update cycle of GDELT 2.0, the GDG receives a new batch of URLs every 15 minutes and may finish them after a few minutes, meaning there may be gaps of no files when it was waiting for new URLs between input batches, so be aware that there are many missing minute files representing periods of no data.
  • BigQuery. The entire dataset is also available as a public BigQuery table as gdelt-bq:gdeltv2.gdg_partitioned and is partitioned on the “fetchdate_check” field to allow you to minimize the amount of data you have to query. The BigQuery table is populated directly from the JSON files, so is an exact duplicate in terms of the meaning of each of the fields.
  • RSS Feed. For web archives and others that want a continual updating stream of changed URLs, there is also an RSS 2.0 feed at http://data.gdeltproject.org/gdeltv3/gdg/RSS-GDG-15MINROLLUP.rss that updates every 15 minutes (typically a few minutes after :00, :15, :30 and :45 after the hour) and contains a list of all of the URLs that the GDG detected in the last 15 minutes as having changed in some fashion since they were originally crawled. The feed contains only those URLs found in the last 15 minutes, meaning it must be monitored every 15 minutes (ideally at :05, :20, :35 and :50 minutes after the hour to give room in case it takes the system a minute or two to generate the file). At this time the feed includes all URLs that were flagged as having any kind of change, so for archives that wish to record only specific kinds of changes (such as only 404’s), they should process the JSON files and filter based on the status code.

 

The JSON format is relatively simple. Each JSON file contains a series of records, one per line, that document a specific page status observed by the GDG system. A page status might indicate that the HTML of the page is unchanged, that the HTML is changed, but the text has not been edited, that the title has changed, that the URL now redirects to a new URL or that the URL now yields a error, among other indicators. Each record includes a “status” field that indicates the type of status it represents and then a series of additional fields distinct to that status. Some fields are present in all statuses, while many are present only in some statuses. A given URL may have multiple entries in the JSON file reflecting multiple observed status changes. For example, a URL that now redirects to a new URL and has edits to its title and body text would appear as three separate rows in the JSON file, one each for the redirect, title and body text changes.

The table below is organized by the value of the “status” field and shows under each status the fields that are present (though some fields may be blank).

 

  • HTTP_ERROR. This indicates a fatal error in retrieving the page. Since the page content could not be successfully downloaded, no further analysis will be performed. Presently only code 404, 410 and 451 errors are returned – all other errors are too transient or related to potential IP targeting to reliably return at this time, though we will likely update this list over time.
    • page_url. The URL of the article being checked.
    • page_title. The original title of the article.
    • page_domain_full. The full domain of the article, including all subdomains (such as “arabic.cnn.com”).
    • page_domain_root. The root domain of the article, minus subdomains (“arabic.cnn.com” would be recorded as just “cnn.com” in this field).
    • page_lang. The human readable name of the language the article is written in (as detected when it was first crawled).
    • fetchdate_orig. The precise UTC time (to the second) that the article was originally crawled.
    • fetchdate_check. The precise UTC time (to the second) that the article was crawled by the GDG system to perform the comparison check.
    • http_code. The HTTP code reported by the server.
    • http_size. The total number of bytes returned by the server.
  • HTTP_REDIRECT. This indicates that when the URL was fetched, the server redirected the client to a new URL. Redirects that purely change from HTTP to HTTPS or only insert a “www” at the front of the domain name or a slash at the end of the URL are not reflected here. The GDG crawler automatically follows the redirect to the new URL and otherwise treats it as a normal URL, meaning that an article that appears with this code will still be checked for title and content changes. In the case that a server redirects the client multiple times, only the final redirected URL is recorded.
    • page_url. The URL of the article being checked.
    • page_title. The original title of the article.
    • page_domain_full. The full domain of the article, including all subdomains (such as “arabic.cnn.com”).
    • page_domain_root. The root domain of the article, minus subdomains (“arabic.cnn.com” would be recorded as just “cnn.com” in this field).
    • page_lang. The human readable name of the language the article is written in (as detected when it was first crawled).
    • fetchdate_orig. The precise UTC time (to the second) that the article was originally crawled.
    • fetchdate_check. The precise UTC time (to the second) that the article was crawled by the GDG system to perform the comparison check.
    • http_code. The HTTP code reported by the server.
    • http_size. The total number of bytes returned by the server.
    • redirect_url. The final URL the client was redirected to. In the case of multiple redirects, only the last one is recorded.
  • UNCHANGED_HTML. This indicates that the HTML of the entire article page remains unchanged. Since this necessarily indicates that there are no content changes to the page, the comparison process is ended for the article at this point.
    • page_url. The URL of the article being checked.
    • page_title. The original title of the article.
    • page_domain_full. The full domain of the article, including all subdomains (such as “arabic.cnn.com”).
    • page_domain_root. The root domain of the article, minus subdomains (“arabic.cnn.com” would be recorded as just “cnn.com” in this field).
    • page_lang. The human readable name of the language the article is written in (as detected when it was first crawled).
    • fetchdate_orig. The precise UTC time (to the second) that the article was originally crawled.
    • fetchdate_check. The precise UTC time (to the second) that the article was crawled by the GDG system to perform the comparison check.
  • UNCHANGED_CONTENT. This indicates that the HTML of the page changed, but that neither the title nor the article text changed. Typically this means a change to headers, footers, related content insets, etc.
    • Same fields as UNCHANGED_HTML.
  • PAGE_TITLECHANGE. This indicates that one or more changes were detected in the page title. If the body text also changed there will also be a separate entry for PAGE_TEXTCHANGE.
    • page_url. The URL of the article being checked.
    • page_title. The original title of the article.
    • page_domain_full. The full domain of the article, including all subdomains (such as “arabic.cnn.com”).
    • page_domain_root. The root domain of the article, minus subdomains (“arabic.cnn.com” would be recorded as just “cnn.com” in this field).
    • page_lang. The human readable name of the language the article is written in (as detected when it was first crawled).
    • fetchdate_orig. The precise UTC time (to the second) that the article was originally crawled.
    • fetchdate_check. The precise UTC time (to the second) that the article was crawled by the GDG system to perform the comparison check.
    • title_new. The new title of the article.
    • num_changes. The total number of “changes” in the page where a change is a run of text that can be a single word/character or a large block of them.
    • change_unit. This will either be “word” indicating that the document was split into “words” (specifically the text is divided on spans of spaces) or “char” indicating that the document was split into individual characters (specifically the Unicode definition of a discrete “character” in the given language). For space-segmented languages, word-level comparison yields more semantically meaningful results, while character-level segmentation ensures fidelity to the original text for non-space-segmented languages (since word segmentation algorithms are imperfect and would introduce error to the process).
    • from_numchars. Total length of the original text in characters.
    • to_numchars. Total length of the new text in characters.
    • from_changedchars. Total number of characters of the original text that were replaced.
    • to_changedchars. Total number of characters of the new text that were different from the original.
    • tot_changedchars. Total number of characters changed in both the original and new text (calculated as “from_changedchars + to_changedchars”).
    • perc_changedchars. Percent of the total number of characters in both the original and new text that changed (calculated as “tot_changedchars / (from_numchars + to_numchars)”).
    • changes. This is a JSON array that contains a series of objects representing each of the individual changes identified by the differencing algorithm.
      • This is the first 75 characters of the original text block (text longer than this is truncated with a “…” added to the end).
      • This is the first 75 characters of the new text block that replaced it (text longer than this is truncated with a “…” added to the end).
      • from_range. This has the format “start-end” and records the starting token position and ending token position of the text in the original version that was changed. If “change_unit” is “word” this will record word offsets, otherwise character offsets. Offsets begin with 0. A single word change will result in both start and end having the same value. A value of “3-3” with “change_unit” being “word” means that word position 3 (the fourth word in the document since offsets begin with 0) was replaced. A value of “0-3” means the words at positions 0, 1, 2 and 3 were all replaced (a four word span). The purpose of this field is to allow you to understand where approximately in the text the change occurred (at the start, at the end, in the middle, etc) and its length in words or characters to estimate its overall size.
      • to_range. Same format and meaning as “from_range” but applies to the replacement text.
  • PAGE_TEXTCHANGE. This indicates that one or more changes were detected in the article text. If the page title also changed there will also be a separate entry for PAGE_TITLECHANGE.
    • Same fields as PAGE_TITLECHANGE.

 

We’re tremendously excited by the potential this incredible new dataset offers and we can’t wait to see what you are able to do with it!