GDELT 3.0: More HTTP 451 Curiosities

We noted two weeks ago that the majority of the HTTP 451 errors our crawlers confront each day are the result of GDPR geofencing by small American news outlets to prevent visitors physically located in the EU from accessing their sites to avoid complying with GDPR privacy regulations. Interestingly, as we've been observing anomalies in our routing data, we've seen that there are other more interesting cases of HTTP 451 at work beyond simple GDPR geofencing.

It appears some outlets use HTTP 451 to control access to sponsored content, in which an advertiser pays to publish what looks like an ordinary article on the news outlet's site (sites typically denote these via a separate URL, subdomain and/or notation at the top). With Sfgate.com, for example, "New blood test poised to change how cancer is found" and "Enter For Your Chance to Win an Apple Watch – PayByPhone Makes Contactless Parking in SF Easy" were both blocked from US-based crawlers, one on the West Coast and one in the Midwest, respectively. It is unclear whether this is a unique effort by Sfgate or a broader initiative by news outlets to constrain the visibility of sponsored content. The first article is marketing copy, while the second does contain a competition that may have geographic constraints. However, both are available to Comcast users on the East Coast and are indexed in Google Web Search, so the purpose of the HTTP 451 restrictions are unclear, though we confirmed they appear to target GCS IP ranges across multiple data centers.

Another interesting example is this article from Peruvian news outlet Libero, "Debate Presidencial del JNE: conoce las propuestas de Urresti, Humala, Alcántara, Castillo y De Soto" that returned an HTTP 451 error to an EU crawler. Interestingly, requesting the same URL from the same country a day later did not yield an HTTP 451 error, while a review of the crawler's raw audit logs shows clearly the 451 headers and body response from that website to a crawler in that country just hours before, so it is unclear why the HTTP 451 rejection was lifted, whether it was a temporal block, an errant geofence or other cause.

It seems the story of HTTP 451 errors on the web becomes curiouser and curiouser.