Learning At Web-Scale: Errors & Edge Cases In Realtime News Coverage

One of the challenges in having algorithms learn about the world through the web is that even when they are limited to narrow domains like major mainstream news outlets, they can quickly be led astray by the myriad inadvertent errors across the media each day. News publishing is an inherently fast-paced industry and digital news publishing is increasing shifting towards a live-editing model in which articles are published and then edited over time to correct or expand their contents.

This means that each day our crawlers encounter myriad errors, from major reporting failures that are later retracted to isolated typographical errors and everything in between.

Sometimes these errors are due to accurate reporting of inaccurate information. For example, after a State Department briefer updated the Senate earlier today to say there were 4,100 Americans remaining in Afghanistan, a wave of news coverage reported this number until the State Department later retracted it and said its brief "misspoke." We see these kinds of retractions and updates every day across the world and they can dominate early information in fast-breaking events, making it difficult to achieve a high-level view of events. On the other hand, our ability to look across sources from international down to hyperlocal across languages allows us to identify these contradictions, but it reminds us that systems that analyze the news in realtime must contend with a steady stream of errors.

On the other hand, a constant daily source of error lies in simple editorial mistakes. While typically found in the text of articles, we also see a certain number of miscaptioned images each day. For example, this Politico article is headlined today by an image of Representative Peter Meijer captioned "Rep. Peter Meijer (R-Mich.) is one of two lawmakers who have traveled to Afghanistan | AP Photo/Carlos Osorio." Yet when the article was first published yesterday and for almost half an hour afterwards, the image was captioned "Speaker of the House Nancy Pelosi | AP Photo/Carlos Osorio" as you can see in this snapshot of the page taken at 8:46PM EST from the Internet Archive.

This leads to situations where we index a page based on information that is later removed from the article and is one of the reasons we strongly recommend users monitor the Global Difference Graph (GDG) dataset's change catalog which surfaces these kinds of article changes.

What are the implications of a news media environment in which information is ever more fluid and rewriting articles is now routine? These are key questions we hope to explore soon, tracking the spread of information across the global news landscape to understand how statements spread.