Announcing Our Newest Generation Document Extraction Platform

We are enormously excited to announce that over the last several days we have gradually phased in the newest generation of our document extraction platform, responsible for identifying and extracting the body of each web page from its surrounding headers, footers, navigation bars, insets, related inserts, advertisements, and the myriad other clutter that fills up the screen. This enables us to properly identify the body of a news article from the rest of the page to ensure we correctly process it and to eliminate spurious false positives from breaking news insets and other unrelated content.

This fall will mark 21 years that GDELT founder Kalev Leetaru has been building systems to crawl and data mine the open web, dating back to launching his first startup less than a year after the debut of Mosaic itself. Just a few years later in 2000 he launched what was at the time one of the largest academic web mining platforms in the world at Mosaic's birthplace of NCSA, while just a freshman in computer science. This enormously unique in-the-trenches experience building global-scale web mining platforms for more than two decades is one of the driving forces in GDELT's pioneering reshaping of how we think about the computational processing of global information to understand human society, from mass translation to realtime event coding to contextualizing the thematic and emotional narratives that undergird society.

This week we have quietly unveiled the newest generation of the document extraction platform that powers GDELT, creating a massive new system to carry GDELT into the coming years. This new system actually fully visually renders each page, with full support for the latest CSS and HTML5 technical standards, to identify the core article content of each news article with accuracy and robustness beyond any current system and with full support for news sites from every corner of the earth all the way back to sites still running 1990's-era hosting technology through the most powerful dynamic responsive sites of today and in all 65 core GDELT languages.

We are immensely excited about this new platform and all of the new services and capabilities it will enable over the coming months. Keep an eye out for a number of announcements over the coming weeks!

The GDELT Project

Announcing Our Newest Generation Document Extraction Platform

Archives