Four months ago we launched the GDELT Global Difference Graph, which recrawls each online article we monitor after 24 hours and after one week and assesses whether the page has been deleted, moved or changed in any way. Over that time we have recrawled a total of 88,027,716 distinct URLs. Using Google BigQuery to explore the dataset, it takes just 6 seconds per query to search across 116GB of data and collapse the master URL list into distinct URLs.
Some interesting results:
- Deletions: 1.5% of articles were deleted within a week (0.68% within 24 hours). In all, 42% of deletions come within the first 24 hours after an article is published.
- Redirections: 2.59% of articles redirected to a different URL within a week (2% within 24 hours). In all, 75% of redirections come in the first 24 hours.
- Errors + Redirections: Combined, errors and redirections account for 4.12% of URLs within a week (2.71% in 24 hours). In all, 63% of URL-level changes occur within 24 hours.
- Unchanged HTML: 4.63% of articles had completely unchanged HTML within a week (6.43% in 24 hours).
- Title Change: 6.06% of articles had changes to their title within a week (5.09% within 24 hours). Roughly 80% of title changes come within the first 24 hours.
- Body Text Change: 17.85% of articles had changes to their body text within a week (15.10% within 24 hours). Roughly 81% of body text changes come within the first 24 hours.
- Title + Body Text Change: 19.52% of articles had changes to their title, body or both within a week (16.5% within 24 hours. Roughly 81% of title and body text changes come within the first 24 hours.
Of course, as noted in the Global Difference Graph documentation, body text change comparison is particularly difficult given that any errors in the document extraction process can lead to misalignments that can lead to changing insets/headers/footers/advertisements triggering false change alerts. However, overall these results should be reasonable as a whole.
The extremely low percent of articles that had an exact HTML match reminds us that performing naive change detection through exact page matches simply does not work and that accurate change detection requires robust global page extraction infrastructure capable of working universally across news outlets from every country in every language and on page structures from basic HTML on through the most complex dynamic layouts.
It is worth noting that most deletions occur after 24 hours, while redirections and content changes are most common within the first 24 hours. This suggests most edits are likely typographical corrections, live-updating stories or tonal or factual changes made as the story gains traction and attention in its first 24 hours after publication. We'll be taking a look at the most common kinds of changes in a future blog post so stay tuned!
TECHNICAL DETAILS
SELECT count(distinct page_url) FROM `gdelt-bq.gdeltv2.gdg_partitioned` 88027716 SELECT count(distinct page_url) FROM `gdelt-bq.gdeltv2.gdg_partitioned` where status='HTTP_ERROR' 1358028 SELECT count(distinct page_url) FROM `gdelt-bq.gdeltv2.gdg_partitioned` where status='HTTP_REDIRECT' 2280999 SELECT count(distinct page_url) FROM `gdelt-bq.gdeltv2.gdg_partitioned` where status='UNCHANGED_HTML' 5643060 SELECT count(distinct page_url) FROM `gdelt-bq.gdeltv2.gdg_partitioned` where status='UNCHANGED_CONTENT' 71412678 SELECT count(distinct page_url) FROM `gdelt-bq.gdeltv2.gdg_partitioned` where status='UNCHANGED_HTML' or status='UNCHANGED_CONTENT' 75320115 SELECT count(distinct page_url) FROM `gdelt-bq.gdeltv2.gdg_partitioned` where status='PAGE_TITLECHANGE' 5334506 SELECT count(distinct page_url) FROM `gdelt-bq.gdeltv2.gdg_partitioned` where status='PAGE_TEXTCHANGE' 15708965 SELECT count(distinct page_url) FROM `gdelt-bq.gdeltv2.gdg_partitioned` where status='PAGE_TITLECHANGE' or status='PAGE_TEXTCHANGE' 17181117 SELECT count(distinct page_url) FROM `gdelt-bq.gdeltv2.gdg_partitioned` where status='HTTP_ERROR' or status='HTTP_REDIRECT' 3624945 SELECT count(distinct page_url) FROM `gdelt-bq.gdeltv2.gdg_partitioned` where TIMESTAMP_DIFF(fetchdate_check,fetchdate_orig, HOUR) < 30 83889393 SELECT count(distinct page_url) FROM `gdelt-bq.gdeltv2.gdg_partitioned` where TIMESTAMP_DIFF(fetchdate_check,fetchdate_orig, HOUR) > 30 77537534 SELECT count(distinct page_url) FROM `gdelt-bq.gdeltv2.gdg_partitioned` where TIMESTAMP_DIFF(fetchdate_check,fetchdate_orig, HOUR) < 30 and (status='HTTP_ERROR' or status='HTTP_REDIRECT') 2275033 SELECT count(distinct page_url) FROM `gdelt-bq.gdeltv2.gdg_partitioned` where TIMESTAMP_DIFF(fetchdate_check,fetchdate_orig, HOUR) > 30 and (status='HTTP_ERROR' or status='HTTP_REDIRECT') 3407012 SELECT count(distinct page_url) FROM `gdelt-bq.gdeltv2.gdg_partitioned` where TIMESTAMP_DIFF(fetchdate_check,fetchdate_orig, HOUR) < 30 and (status='UNCHANGED_HTML' or status='UNCHANGED_CONTENT') 71737836 SELECT count(distinct page_url) FROM `gdelt-bq.gdeltv2.gdg_partitioned` where TIMESTAMP_DIFF(fetchdate_check,fetchdate_orig, HOUR) > 30 and (status='UNCHANGED_HTML' or status='UNCHANGED_CONTENT') 63198603 SELECT count(distinct page_url) FROM `gdelt-bq.gdeltv2.gdg_partitioned` where TIMESTAMP_DIFF(fetchdate_check,fetchdate_orig, HOUR) < 30 and (status='HTTP_ERROR') 572737 SELECT count(distinct page_url) FROM `gdelt-bq.gdeltv2.gdg_partitioned` where TIMESTAMP_DIFF(fetchdate_check,fetchdate_orig, HOUR) > 30 and (status='HTTP_ERROR') 1291587 SELECT count(distinct page_url) FROM `gdelt-bq.gdeltv2.gdg_partitioned` where TIMESTAMP_DIFF(fetchdate_check,fetchdate_orig, HOUR) < 30 and (status='HTTP_REDIRECT') 1702860 SELECT count(distinct page_url) FROM `gdelt-bq.gdeltv2.gdg_partitioned` where TIMESTAMP_DIFF(fetchdate_check,fetchdate_orig, HOUR) > 30 and (status='HTTP_REDIRECT') 2116666 SELECT count(distinct page_url) FROM `gdelt-bq.gdeltv2.gdg_partitioned` where TIMESTAMP_DIFF(fetchdate_check,fetchdate_orig, HOUR) < 30 and (status='UNCHANGED_HTML') 5393207 SELECT count(distinct page_url) FROM `gdelt-bq.gdeltv2.gdg_partitioned` where TIMESTAMP_DIFF(fetchdate_check,fetchdate_orig, HOUR) > 30 and (status='UNCHANGED_HTML') 3587120 SELECT count(distinct page_url) FROM `gdelt-bq.gdeltv2.gdg_partitioned` where TIMESTAMP_DIFF(fetchdate_check,fetchdate_orig, HOUR) < 30 and (status='UNCHANGED_CONTENT') 66749071 SELECT count(distinct page_url) FROM `gdelt-bq.gdeltv2.gdg_partitioned` where TIMESTAMP_DIFF(fetchdate_check,fetchdate_orig, HOUR) > 30 and (status='UNCHANGED_CONTENT') 59821505 SELECT count(distinct page_url) FROM `gdelt-bq.gdeltv2.gdg_partitioned` where TIMESTAMP_DIFF(fetchdate_check,fetchdate_orig, HOUR) < 30 and (status='PAGE_TITLECHANGE') 4271006 SELECT count(distinct page_url) FROM `gdelt-bq.gdeltv2.gdg_partitioned` where TIMESTAMP_DIFF(fetchdate_check,fetchdate_orig, HOUR) > 30 and (status='PAGE_TITLECHANGE') 4608771 SELECT count(distinct page_url) FROM `gdelt-bq.gdeltv2.gdg_partitioned` where TIMESTAMP_DIFF(fetchdate_check,fetchdate_orig, HOUR) < 30 and (status='PAGE_TEXTCHANGE') 12666246 SELECT count(distinct page_url) FROM `gdelt-bq.gdeltv2.gdg_partitioned` where TIMESTAMP_DIFF(fetchdate_check,fetchdate_orig, HOUR) > 30 and (status='PAGE_TEXTCHANGE') 13689925 SELECT count(distinct page_url) FROM `gdelt-bq.gdeltv2.gdg_partitioned` where TIMESTAMP_DIFF(fetchdate_check,fetchdate_orig, HOUR) < 30 and (status='PAGE_TITLECHANGE' or status='PAGE_TEXTCHANGE') 13844794 SELECT count(distinct page_url) FROM `gdelt-bq.gdeltv2.gdg_partitioned` where TIMESTAMP_DIFF(fetchdate_check,fetchdate_orig, HOUR) > 30 and (status='PAGE_TITLECHANGE' or status='PAGE_TEXTCHANGE') 14976911