The Incredibly Short Lifespan of an Online News Article

A year ago this past Friday we announced a new collaboration with the Internet Archive to preserve the world's online journalism.  Today our collaboration is the largest such program in the world to preserve the online journalism of the non-Western world.   In just the last six months alone we've sent over 109 million URLs of online news coverage from nearly every corner of the earth to the Archive for permanent preservation.

Around 1.5-2% of the articles we monitor on a typical day will no longer exist two weeks later, rising to 5% after a month and around 7-14% after two months.  Just over 7% result in "404 Not Found errors", while another 7% yield everything from "500 Internal Server Error" responses to servers being offline or non-responsive (both of which could represent transient errors) to domains no longer existing (more frequent in certain areas of the world).  Given that GDELT finds around 650,000 to one million news articles a day that it deems relevant, this means that 70,000 to 140,000 news articles a day are being lost.  Over just the last six months, between 7.6 and 14 million articles have been lost, which represents up to twice the total output of the New York Times over the last half century.  That's an incredible amount of news coverage that is being lost each day, forever.  In some areas of the world the numbers can be considerably higher due to government censorship and interference in the press sector.

That's why we're incredibly proud and excited about our collaboration with the Internet Archive to preserve the world's online journalism, especially the journalism of the non-Western world and in countries with high levels of retroactive government censorship, in order to preserve these voices and this critical historical record for perpetuity.  Happy First Birthday to an incredible collaboration!