Looking Back On Four Months Of The Global Difference Graph

Four months ago we launched the GDELT Global Difference Graph, which recrawls each online article we monitor after 24 hours and after one week and assesses whether the page has been deleted, moved or changed in any way. Over that time we have recrawled a total of 88,027,716 distinct URLs. Using Google BigQuery to explore the dataset, it takes just 6 seconds per query to search across 116GB of data and collapse the master URL list into distinct URLs.

Some interesting results:

  • Deletions: 1.5% of articles were deleted within a week (0.68% within 24 hours). In all, 42% of deletions come within the first 24 hours after an article is published.
  • Redirections: 2.59% of articles redirected to a different URL within a week (2% within 24 hours). In all, 75% of redirections come in the first 24 hours.
  • Errors + Redirections: Combined, errors and redirections account for 4.12% of URLs within a week (2.71% in 24 hours). In all, 63% of URL-level changes occur within 24 hours.
  • Unchanged HTML: 4.63% of articles had completely unchanged HTML within a week (6.43% in 24 hours).
  • Title Change: 6.06% of articles had changes to their title within a week (5.09% within 24 hours). Roughly 80% of title changes come within the first 24 hours.
  • Body Text Change: 17.85% of articles had changes to their body text within a week (15.10% within 24 hours). Roughly 81% of body text changes come within the first 24 hours.
  • Title + Body Text Change: 19.52% of articles had changes to their title, body or both within a week (16.5% within 24 hours. Roughly 81% of title and body text changes come within the first 24 hours.

Of course, as noted in the Global Difference Graph documentation, body text change comparison is particularly difficult given that any errors in the document extraction process can lead to misalignments that can lead to changing insets/headers/footers/advertisements triggering false change alerts. However, overall these results should be reasonable as a whole.

The extremely low percent of articles that had an exact HTML match reminds us that performing naive change detection through exact page matches simply does not work and that accurate change detection requires robust global page extraction infrastructure capable of working universally across news outlets from every country in every language and on page structures from basic HTML on through the most complex dynamic layouts.

It is worth noting that most deletions occur after 24 hours, while redirections and content changes are most common within the first 24 hours. This suggests most edits are likely typographical corrections, live-updating stories or tonal or factual changes made as the story gains traction and attention in its first 24 hours after publication. We'll be taking a look at the most common kinds of changes in a future blog post so stay tuned!

 

TECHNICAL DETAILS

SELECT count(distinct page_url) FROM `gdelt-bq.gdeltv2.gdg_partitioned`
88027716

SELECT count(distinct page_url) FROM `gdelt-bq.gdeltv2.gdg_partitioned` where status='HTTP_ERROR'
1358028

SELECT count(distinct page_url) FROM `gdelt-bq.gdeltv2.gdg_partitioned` where status='HTTP_REDIRECT'
2280999

SELECT count(distinct page_url) FROM `gdelt-bq.gdeltv2.gdg_partitioned` where status='UNCHANGED_HTML'
5643060

SELECT count(distinct page_url) FROM `gdelt-bq.gdeltv2.gdg_partitioned` where status='UNCHANGED_CONTENT'
71412678

SELECT count(distinct page_url) FROM `gdelt-bq.gdeltv2.gdg_partitioned` where status='UNCHANGED_HTML' or status='UNCHANGED_CONTENT'
75320115

SELECT count(distinct page_url) FROM `gdelt-bq.gdeltv2.gdg_partitioned` where status='PAGE_TITLECHANGE'
5334506

SELECT count(distinct page_url) FROM `gdelt-bq.gdeltv2.gdg_partitioned` where status='PAGE_TEXTCHANGE'
15708965

SELECT count(distinct page_url) FROM `gdelt-bq.gdeltv2.gdg_partitioned` where status='PAGE_TITLECHANGE' or status='PAGE_TEXTCHANGE'
17181117

SELECT count(distinct page_url) FROM `gdelt-bq.gdeltv2.gdg_partitioned` where status='HTTP_ERROR' or status='HTTP_REDIRECT'
3624945

SELECT count(distinct page_url) FROM `gdelt-bq.gdeltv2.gdg_partitioned` where TIMESTAMP_DIFF(fetchdate_check,fetchdate_orig, HOUR) < 30
83889393
SELECT count(distinct page_url) FROM `gdelt-bq.gdeltv2.gdg_partitioned` where TIMESTAMP_DIFF(fetchdate_check,fetchdate_orig, HOUR) > 30
77537534

SELECT count(distinct page_url) FROM `gdelt-bq.gdeltv2.gdg_partitioned` where TIMESTAMP_DIFF(fetchdate_check,fetchdate_orig, HOUR) < 30 and (status='HTTP_ERROR' or status='HTTP_REDIRECT')
2275033
SELECT count(distinct page_url) FROM `gdelt-bq.gdeltv2.gdg_partitioned` where TIMESTAMP_DIFF(fetchdate_check,fetchdate_orig, HOUR) > 30 and (status='HTTP_ERROR' or status='HTTP_REDIRECT')
3407012

SELECT count(distinct page_url) FROM `gdelt-bq.gdeltv2.gdg_partitioned` where TIMESTAMP_DIFF(fetchdate_check,fetchdate_orig, HOUR) < 30 and (status='UNCHANGED_HTML' or status='UNCHANGED_CONTENT')
71737836
SELECT count(distinct page_url) FROM `gdelt-bq.gdeltv2.gdg_partitioned` where TIMESTAMP_DIFF(fetchdate_check,fetchdate_orig, HOUR) > 30 and (status='UNCHANGED_HTML' or status='UNCHANGED_CONTENT')
63198603

SELECT count(distinct page_url) FROM `gdelt-bq.gdeltv2.gdg_partitioned` where TIMESTAMP_DIFF(fetchdate_check,fetchdate_orig, HOUR) < 30 and (status='HTTP_ERROR')
572737
SELECT count(distinct page_url) FROM `gdelt-bq.gdeltv2.gdg_partitioned` where TIMESTAMP_DIFF(fetchdate_check,fetchdate_orig, HOUR) > 30 and (status='HTTP_ERROR')
1291587

SELECT count(distinct page_url) FROM `gdelt-bq.gdeltv2.gdg_partitioned` where TIMESTAMP_DIFF(fetchdate_check,fetchdate_orig, HOUR) < 30 and (status='HTTP_REDIRECT')
1702860
SELECT count(distinct page_url) FROM `gdelt-bq.gdeltv2.gdg_partitioned` where TIMESTAMP_DIFF(fetchdate_check,fetchdate_orig, HOUR) > 30 and (status='HTTP_REDIRECT')
2116666

SELECT count(distinct page_url) FROM `gdelt-bq.gdeltv2.gdg_partitioned` where TIMESTAMP_DIFF(fetchdate_check,fetchdate_orig, HOUR) < 30 and (status='UNCHANGED_HTML')
5393207
SELECT count(distinct page_url) FROM `gdelt-bq.gdeltv2.gdg_partitioned` where TIMESTAMP_DIFF(fetchdate_check,fetchdate_orig, HOUR) > 30 and (status='UNCHANGED_HTML')
3587120

SELECT count(distinct page_url) FROM `gdelt-bq.gdeltv2.gdg_partitioned` where TIMESTAMP_DIFF(fetchdate_check,fetchdate_orig, HOUR) < 30 and (status='UNCHANGED_CONTENT')
66749071
SELECT count(distinct page_url) FROM `gdelt-bq.gdeltv2.gdg_partitioned` where TIMESTAMP_DIFF(fetchdate_check,fetchdate_orig, HOUR) > 30 and (status='UNCHANGED_CONTENT')
59821505

SELECT count(distinct page_url) FROM `gdelt-bq.gdeltv2.gdg_partitioned` where TIMESTAMP_DIFF(fetchdate_check,fetchdate_orig, HOUR) < 30 and (status='PAGE_TITLECHANGE')
4271006
SELECT count(distinct page_url) FROM `gdelt-bq.gdeltv2.gdg_partitioned` where TIMESTAMP_DIFF(fetchdate_check,fetchdate_orig, HOUR) > 30 and (status='PAGE_TITLECHANGE')
4608771

SELECT count(distinct page_url) FROM `gdelt-bq.gdeltv2.gdg_partitioned` where TIMESTAMP_DIFF(fetchdate_check,fetchdate_orig, HOUR) < 30 and (status='PAGE_TEXTCHANGE')
12666246
SELECT count(distinct page_url) FROM `gdelt-bq.gdeltv2.gdg_partitioned` where TIMESTAMP_DIFF(fetchdate_check,fetchdate_orig, HOUR) > 30 and (status='PAGE_TEXTCHANGE')
13689925

SELECT count(distinct page_url) FROM `gdelt-bq.gdeltv2.gdg_partitioned` where TIMESTAMP_DIFF(fetchdate_check,fetchdate_orig, HOUR) < 30 and (status='PAGE_TITLECHANGE' or status='PAGE_TEXTCHANGE')
13844794
SELECT count(distinct page_url) FROM `gdelt-bq.gdeltv2.gdg_partitioned` where TIMESTAMP_DIFF(fetchdate_check,fetchdate_orig, HOUR) > 30 and (status='PAGE_TITLECHANGE' or status='PAGE_TEXTCHANGE')
14976911