Over the past two years we've witnessed a significant rise in the number of "451 Unavailable For Legal Reasons" errors our crawlers receive. The errors affect around 560 news outlets, primarily small local US news outlets and appear across the most mundane topics imaginable like a local restaurant opening announcement or a winter weather closing of a school, making the errors all the more remarkable, since 451's were originally intended to represent cases of extraordinary government censorship. While they comprise up to 0.8% of all requests on a typical day, they appear regularly enough and account for a sufficient enough number of requests that we wanted to better understand this behavior. Worse, when fetching the same URLs later on we could not replicate the errors, suggesting they might represent a complex transient networking behavior that could be extremely difficult to diagnose.
What could possibly be driving this steady increase in 451 errors?
GDELT's crawler fleets are globally distributed, meaning they operate across most of GCP's global datacenter footprint. Each individual crawler is ephemeral, launching for a given period of time and then exiting and being replaced by a new crawler. Each time a new crawler launches it has a new IP address, meaning we range across the full IP space of GCP. IP addresses are handed out by GCP at random for ephemeral VM's (servers use fixed IPs), meaning crawlers do not have control over the IP address they land on when they start up.
Initially we thought this rise in 451's could be simply bad luck of crawlers landing more often on IP addresses that had been used by a previous cloud user for something bad and we were simply inheriting the derating of that IP. This seemed to be bolstered by the fact that if we fetched the URL again from a different crawler immediately after we received a 451 response, the error resolves and the same URL now returns a 200 OK. At the same time, derated IPs typically return errors like 429 or similar, rather than 451, making it less likely it was an IP reputational issue.
Upon further investigation we discovered that the errors occur exclusively in our fleets located in EU member countries – crawlers outside the EU are not affected. Looking more closely at our refetches of 451's, we determined that when a crawler received a 451 and the URL was retried from a different crawler, due to random chance those retries were happening on VMs outside of the EU, so it wasn't an IP issue, it was a geographic issue.
Given that this error occurs only for crawlers in EU member state data centers, it suggests this could be targeted geofencing, perhaps related to GDPR.
Indeed, we discovered that some of the affected news websites have disclaimers like this one on KPC News, while the majority of them share common hosting platforms that likely enforce the geofencing at the platform level:
You may see this message while you are traveling in Europe:
Error 451 Unavailable For Legal Reasons
Due to specific regulations (GDPR) this site is not intended for use by persons located within the European Economic Area (EEA). We do not request or accept personal information concerning or supplied by persons who are located within the EEA at the time they access this Site. If you have accessed this Site from within the EEA, you should immediately discontinue your use. If you have supplied personal information to us in violation of this provision, whether through the registration/subscription of new user accounts or otherwise, please contact us at email@example.com for a refund or further explanation of this policy.
Fetching several of the pages at random from an EU IP address yields this common message, confirming the geofencing is real and is shared across the sites:
451: Unavailable due to legal reasons
We recognize you are attempting to access this website from a country belonging to the European Economic Area (EEA) including the EU which
enforces the <a href="https://gdpr-info.eu/">General Data Protection Regulation</a> (GDPR) and therefore access cannot be granted at this time.
Thus, two and a half years after GDPR's implementation, there are still a number of US-based news outlets that are restricting access to their content in the EU rather than comply with GDPR.
Interestingly, this even affects university-affiliated newspapers under university domains, including PSU's The Collegian (www.collegian.psu.edu), which returns 451's for EU visitors, in effect blocking potential students and collaborators in the EU from seeing the latest university news.
Each day we encounter yet another fascinating hidden story of the web.