Archiving the World: GDELT Joins the Internet Archive’s “No More 404” Initiative

The GDELT Project’s mission is to create a free open computational observatory to watch our world unfold, cataloging the events and underlying context driving global society each day. Among its many streams of information, GDELT constantly scours the open web, monitoring media outlets from across the globe in order to capture events as they unfold, even in the most remote regions of earth.

One of the greatest challenges when working with web-based news sources is their ephemerality: URLs often go dead a few weeks or months after GDELT has processed an article. This is problematic in that if GDELT records that a violent riot took place somewhere, a scholar wishing to learn more about that riot a few months later will frequently be left staring at nothing more than a blank “404 Page Not Found” error. In some cases upwards of 70% of the URLs that GDELT processed on a particular day no longer function today.

Articles can disappear from the web or change URLs for a variety of reasons, everything from a routine site redesign through armed hostile takeover of the media. With the average webpage lasting just 100 days as of 2003, the preservation of online news in particular has been of great interest to the library community. For GDELT this means that the majority of the underlying source material for the events it catalogs from web-based outlets disappears within a few months of GDELT finding and processing it, making it impossible to refer back to the event's details and context down the road. This would be akin to academic journal articles constantly being moved and retitled every few months, instantly breaking all of the citations to a work.

In 2012 I gave the opening keynote at the General Assembly of the International Internet Preservation Consortium (IIPC) at the Library of Congress, where I spoke on how web archives will fundamentally reshape how we conduct research on the web at large. (Transcript: Part 1, Part 2, Part 3). Could we leverage the work of the internet preservation community to eliminate the 404?

In October 2013 the Internet Archive launched an initiative to do precisely this, called the “No More 404” project. Partnering with WordPress.com and Wikipedia.org, it constantly monitors for new hyperlinks added to public blogs hosted on WordPress.com’s platform or articles contributed to Wikipedia, and creates a snapshot of each linked page to add to the Internet Archive’s 400 billion page collection. In this way, no page linked from a hosted public WordPress.com blog or a Wikipedia article will ever become unavailable – there will always be a copy preserved for eternity and available via the Archive’s Wayback Machine.

Today we are incredibly excited to join the Archive’s “NO404” initiative. Each morning the Internet Archive now does a special crawl of all of the URLs that GDELT extracted events from the previous day and adds them to the Wayback Machine’s 400-billion-strong index of the web. Even when whole media outlets are silenced by conflict, they will live on preserved forever and those researching conflict across the globe will no longer suddenly have an entire side of the story disappear overnight.  We're incredibly excited about how the NO404 initiative will largely make dead links a relic of the past for GDELT and make it possible for scholars to research global conflict and world history without worrying about the record of history being swept under the digital rug.

To a brave new archived world!