Replacing Synthetic Data With GDELT's Real World Data For Algorithmic Development

Much of today's research into scaling algorithms to new dataset sizes, complexities and questions relies on synthetic or highly constrained sampled data. This is largely due to necessity: there are precious few large open datasets that are both based on real world data and are widely accessible to the academic community.

Lab-constructed datasets simply cannot capture the chaotic conflicting cacophony of the real world. Constrained sampled data is often repurposed from research efforts far removed from its new application, often with existential biases and representative challenges that make its results difficult to interpret. For example, while many multilingual NLP models perform extremely well on widely-used benchmark datasets, they perform vastly worse, in some cases abysmally so, on GDELT's real-world data that reflects actual language use in the wild.

Even the largest open datasets are extremely small by modern standards and are typically fixed one-off static extracts. Static datasets present unique challenges for algorithmic benchmarking, since some models will quietly train on the entire dataset or non-random selections, making it difficult to compare against models that used a more standardized testing holdout. In contrast, GDELT's combination of historical data for training and live realtime data means there is always new unseen data that can be used to test models in realtime for competitions, etc.

GDELT encompasses today more than 8 trillion datapoints, with vast historical archives that update in realtime, making it an ideal testbed for even the largest algorithmic development work spanning every field, from graphs and geography to text and visual assessment across more than 150 languages spanning news and events from almost every corner of the globe.

Uniquely, the majority of existing training datasets typically have very strong US and European-centric biases, whereas GDELT's datasets capture a globally inclusive view of the diversity of the world's societies, making them critical to testing the inclusiveness of new algorithms.

We'd love to hear from researchers interested in better understanding how they can use GDELT's myriad datasets to test new algorithmic developments, from ANN to graph analytics and algorithmic competitions interested in large test+train benchmark datasets!