NGramming 9.5 Billion Words of Arabic News

One of the amazing things that happens when you monitor the world at GDELT's scale is that you begin to observe language evolving in realtime. New words and grammatical constructs come into being, older words and structures fade from use, punctuation rules change and the topics, contexts and ways in which language use are in constant flux. Perhaps most uniquely, GDELT operates what is likely the world's deepest monitoring program for local sources in local languages across the world, processing more than 100 languages total, 65 of which are live machine translated in realtime. This means that GDELT is uniquely able to reach far beyond the "usual suspect" languages to offer insights into the myriad of global languages for which computational resources and linguistic corpuses are few and far between.

As a first glimpse of the kinds of open community linguistic resources that we are starting to create, we wanted to see what it would take to make a trigram dataset for Arabic. Following in the footsteps of our new NGrams At BigQuery Scale tutorial, we used the same approach to process the more than 9.5 billion words of worldwide Arabic news coverage that GDELT has monitored over the last 14 months (February 2015 to June 2016), resulting in an Arabic trigram table of the 6,444,208 trigrams that appeared more than 75 times over the time period. Perhaps most amazingly of all, the entire process took just a single line of code and 186 seconds (just over 3 minutes) from start to finish, including the time it took to output the results to the final table!

You can access this trigram two different ways:

Download the UTF-8 CSV (61MB / 203MB uncompressed)
Access Through Google BigQuery

Due to limitations of CLD2, which groups all Arabic under the same language code, we cannot currently break this further into dialectal/MSA Arabic, so what you are seeing here is a mixture of all Arabic worldwide monitored by GDELT February 2015 to June 2016. In addition, for this first experimental trigram table we put spaces around all characters found in the Unicode "Punctuation" class meaning that words with internal apostrophes and some other special cases may be incorrectly split. We are currently working on a more robust mechanism for punctuation splitting. Note that you will also see the occasional non-Arabic word, caused by the fact that some Arabic news outlets will write some words in English. Some non-words will also occur due to errors in the monitoring pipeline, especially in the proper identification of the core article body text.

Based on user feedback we plan to release a series of 1-grams, 2-grams, 3-grams, 4-grams and 5-grams over all 65 core languages that GDELT monitors in the hopes of spurring additional NLP and digital humanities research beyond the handful of "core" languages that ngrams and other linguistic resources are currently available for. Stay tuned!

The GDELT Project

NGramming 9.5 Billion Words of Arabic News

Archives