Readability Scores Dataset For Worldwide Online News Coverage in 2020

As we seek new ways of assessing and understanding how news conveys the world each day, one thread of our work revolves around the characteristics and behaviors of reporting that enable trust and accessibility and help create a more informed society. We've been intrigued by the ways news coverage around the world and across topics can vary so wildly not just in its framing and contextualization but in terms of how accessible it is to the public. A COVID-19 article about mask guidelines written in dense medical terminology, incorporating reams of specialized statistical methods and referencing copious clinical peer-reviewed journals might on the one hand convey a sense of greater trust and gravitas than a colloquial piece written by a general assignment reporter, but on the other hand is far less accessible to the general public and thus potentially less effective in influencing public behaviors.

This raises the question of just how accessible or "readable" daily news coverage is today, whether there are major differences across topics and genres (such as science versus politics), across countries and publications and over time. There are myriad conceptualizations of "readability" from classic metrics like the Fog Index to a plethora of modern developments from new grade level measures to entirely new ways of measuring accessibility using comparisons against benchmark language models. There are models using dictionaries of "easy" and "hard" words, repurposing of sentiment dictionaries around notions like "abstract" and "concrete", syllable measures, use of grammatical structures and tense and even attempts at assessing the cognitive expenditure required to internalize a given passage of text. In short, there are almost infinite ways of estimating how a given reader might internalize a given passage and the ease with which they might do so, with new or tweaked metrics being invented almost daily. Languages like English have vast arrays of available metrics, while many languages have few or no widely accessible algorithmic readability implementations.

Which readability measures might make the most sense applied to global news coverage and what might it look like to try and assess readability at a global scale?

To seed a broader dialog around how we might assess readability across the global news landscape, today we're excited to unveil a brand-new "readability" dataset composed of basic readability scores for much of the worldwide English-language online news coverage GDELT has monitored thus far this year, totaling around 88 million articles. Only English language articles were considered here and only those articles with a high CLD2 language scores and for which GNU style returned results were included. Since we were interested in providing a wide range of basic metrics spanning classic readability grades, basic counts and a handful of grammatical cues like passivity and nominalization and wanted to use a well-known algorithmic implementation whose details and assumptions were fully documented and widely available for comparison baselines, we chose GNU style.

You can see the results below of applying style to the CNN article White House vaccine chief says first Americans could be vaccinated next month, showing all of the metrics it computes about each article.

readability grades:
     Kincaid: 14.4
     ARI: 17.3
     Coleman-Liau: 10.8
     Flesch Index: 51.1/100
     Fog Index: 18.6
     Lix: 55.3 = school year 11
     SMOG-Grading: 14.5
sentence info:
     4069 characters
     870 words, average length 4.68 characters = 1.44 syllables
     26 sentences, average length 33.5 words
     50% (13) short sentences (at most 28 words)
     34% (9) long sentences (at least 43 words)
     20 paragraphs, average length 1.3 sentences
     0% (0) questions
     80% (21) passive sentences
     longest sent 84 wds at sent 2; shortest sent 6 wds at sent 7
word usage:
     verb types:
     to be (33) auxiliary (32)
     types as % of total:
     conjunctions 5% (40) pronouns 7% (62) prepositions 13% (109)
     nominalizations 2% (18)
sentence beginnings:
     pronoun (2) interrogative pronoun (2) article (2)
     subordinating conjunction (1) conjunction (0) preposition (1)

The complete dataset is available in 1 minute files spanning January 1, 2020 through November 21, 2020 via the URL structure, where the date ranges from "20200101000100" to "20201122234800", inclusive:

  • http://data.gdeltproject.org/gdeltv3/readability/YYYYMMDDHHMMSS.readability.json.gz

To download the complete dataset, request the desired date range at one minute resolution. The entire dataset is 17GB compressed and more than 150GB uncompressed and follows the following structure:

  • date. The date/time the article was seen by GDELT. Due to the way the dataset was constructed there may be duplicate entries for some articles with multiple monitoring dates – in those cases, just use the first entry.
  • url. The URL of the article.
  • title. The original title of the article at the time it was seen by GDELT.
  • readabilityScores. A JSON-escaped string containing the complete raw output of GNU style as seen above. This is the literal output of style as-is.

The entire dataset is also available in BigQuery:

We're enormously excited to see what kinds of new questions this dataset enables and to hear your recommendations about other readability metrics that would be the most valuable to your work! We are also considering updating this dataset in realtime – let us know if having realtime updates to this dataset would be of interest.