As we constantly evolve our underlying crawling and document identification infrastructures, we regularly evaluate the state of underlying internet technologies, from changing web server and CDN behaviors like geotargeting and personalization to centralized hosting with poorly designed load balancing, shifts in file formats and format feature use like the shift towards forced WEBP transcoding, to stylistic and markup changes that influence both the visual rendering of a page and embedded structural and semantic enrichments. As we've watched Schema.org markup grow from a small selection of mostly tier 1 international and large national outlets largely in the US and Western Europe towards steadily more globalized use, it has reached a point where we are seeing enough articles carrying Schema.org markup to warrant extracting it into its own dataset.
Thus, we're excited to announce that the new GDELT 3.0 crawlers automatically compile all JSON-LD blocks found in each article. All blocks within a page are compiled into an array, allowing access to different sections on sites that differentiate them, such as site and article-level blocks, ClaimReview blocks and blocks relating to specific content sections within the page. While many sites still use only the most basic JSON-LD markup that provides no additional information beyond URL and page title, other sites incorporate a much richer array of insights into the article's contents, so we're excited to see what this data enables.
Look for more details coming soon about this new dataset!