The GDELT Project

AI Bias & Why It Is So Important To Understand The Intrinsic Biases Of Real World Data

Powering the exponential rise of AI over the past decade has been the digital revolution that has made unimaginably large archives of human society accessible to train ever-larger neural networks. Yet, few of these datasets are bespoke creations designed specifically to train AI systems. Instead, most AI systems are trained on existing massive datasets, but in doing so they absorb the intrinsic biases of the data they learn from. This makes it absolutely critical to better understand the latent biases in the datasets that increasingly power our AI systems.

Over the last few days we've explored the use of pronouns in American television news, from "i/me versus us/we" to "us versus them" to the massive gender bias of "he over she." Why do these pronoun imbalances matter? They matter because they reflect the existential biases that underlie even the most austere and curated data: journalism.

Unlike random web pages harvested from the web or the anything-goes world of social media, journalism reflects some of the most editorially curated material in existence, with levels upon levels of editorial oversight. If such heavily editorially curated material can suffer from such existential biases, it suggests that emerging efforts to construct bias-free datasets will struggle to overcome these innate challenges and that a better approach may lie in efforts to model and measure such biases that can leverage statistical models to mitigate some of these biases. Yet, when datasets skew so strongly towards a particular gender, group dynamic and personalization, despite all of their editorial safeguards and curation, it raises questions of just how well even statistical modeling can address such existential biases that reflect wholesale skews.

Moreover, the degree to which the strong pronoun imbalance of journalism reflects systematic biases in what the world's media consider "newsworthy" such biases are likely entrenched in the digital landscape, suggesting that the challenges confronting the AI community are even greater than acknowledged.