Announcing GDELT’s New “Tools for Journalists” Initiative

Kalev Leetaru

11 years ago

UPDATE 5/23/2015: In discussion with all of the users of this program, we've found that all of you have now migrated to the new GDELT Trends Dashboard, which offers an identical export option that is far more powerful, including advanced filters. We're therefore halting new update files for this program as of 5/23/2015. Please contact us with any questions.

What an incredible week it has been! This past Thursday we announced the public debut of GDELT 2.0 and GDELT Translingual, and just two days later, we are incredibly excited to announce the release this afternoon of the first phase of GDELT’s new “Tools for Journalists” initiative!

We’ve been hearing from many of you for a while now that you find GDELT to be an unprecedented and immensely powerful tool for exploring global events and narratives, but that the technical barrier to entry is simply too high to be able to use it in your day-to-day work, whether you are a journalist trying to access local reporting on the ground where a situation is unfolding, an NGO trying to understand how your aid programs are being received or discussed in the daily narrative of areas you are working, aid workers trying to map the local security and aid climate, or even private citizens on the ground trying to better understand what the world is saying about your region.

Your requests have been heard loud and clear and over the next few months we will be rolling out a wide array of new tutorials and tools to make it easier than ever to use GDELT to visualize, analyze, and explore the world even if you don’t have extensive technical experience working with large datasets.

Today we release the very first results of this new initiative: the first phase of our new “Tools for Journalists” initiative. While we’re calling them “tools for journalists”, in fact these tools are designed for use by anyone from NGOs to aid workers to citizens on the ground. Our first tool is designed to make it possible to leverage GDELT 2.0 and GDELT Translingual’s ability to listen to media coverage in 65 languages from the further corners of the earth, without having to be an expert in data analytics of massive datasets. This new tool generates a daily CSV spreadsheet each morning containing a summarized and human-friendly reformatting of the GDELT Global Knowledge Graph for the previous day. Each article monitored by GDELT across all 65 languages has its own row in the spreadsheet and the timestamp, URL, wordcount, language, locations, names, and themes for the article are displayed, reformatted into a human-friendly format. Due to the enormous volume of material monitored by GDELT, there are two spreadsheets generated for each day, one covering only English-language coverage and the other covering all other languages. To minimize the size of the files, an entry that appears two or more times in a field is limited to just its first two mentions. Thus, if an article mentions "Estonia" 12 times, it will be listed only twice in the Locations field. Differentiating between a single mention and two or more mentions makes it possible to filter articles that simply casually reference something versus discuss it in greater depth.

You can load this CSV spreadsheet directly into Microsoft Excel and use Excel’s “Filter” capability to interactively slice-and-dice the data on your desktop or laptop computer to rapidly find coverage of interest. While these files are large (typically around 50MB compressed and 150MB uncompressed) and require at least Microsoft Excel 2010 or newer (due to the large number of rows), we’ve tested these files extensively on a three-year-old Macintosh Air laptop without issue, meaning that most users with access to a broadband internet connection should be able to use these files even while on the road using a laptop. Note that unlink the “.csv” files used in the raw GDELT files, these spreadsheets are true CSV files that can be read directly into Microsoft Excel without trouble.

This is only the very first start and we realize that having to load a 150MB spreadsheet into Microsoft Excel on a laptop or desktop obviously is still far from ideal, but at the very least this is a first step towards allowing you to make use of GDELT’s powerful filters without needing any technical background in data analytics. In the next month we hope to release several interactive web-based visualization and analytics tools that will make it possible to work with GDELT in a vastly more powerful and user-friendly format that doesn’t require downloading any data files and gives access to a much broader range of capabilities.

You can download the latest daily update files, generated by around 1-2AM EST each morning from the URL below:

http://data.gdeltproject.org/toolsforjournalists/

Getting Started

To help you get started quickly with these files, the following brief tutorial will demonstrate how to use the daily spreadsheet to quickly filter for specific geographies, languages, and topics.

Download the most recent daily spreadsheets from http://data.gdeltproject.org/toolsforjournalists/ and open the “.translated.csv” file in Microsoft Excel. It may take a minute or more to fully load into Excel on a machine that does not have much memory. Once the file is loaded, you will see all of the coverage that GDELT monitored the previous day in languages other than English that it was able to successfully extract one or more pieces of information from. Choose the Data tab in Excel’s top menubar and then choose the “Filter” button to access Excel’s Filter feature. This will add a tiny dropdown button to each column header. Click on the dropdown beside your column of interest to have Excel filter the spreadsheet to just those rows containing the desired value in that column. (See this tutorial for more details on how to use the Filter feature).

For example, let’s say you are interested in yesterday’s coverage of Boko Harem. Once you’ve chosen the Filter button from the Data tab of the menubar at the top of Excel, you should see the dropdown beside each column heading. Click on the dropdown under the Names column. If you are using the Windows version of Excel, choose “Text Filters” and then “Contains” from the popup menu that appears. If you are using the Macintosh version of Excel, choose “Contains” from the dropdown under the Filter tab of the popup menu that appears. Then type “Boko Har” and hit enter (we leave the ending of “Harem” off to catch variant spellings like “Harem” vs “Haram”). Within a few seconds (sometimes a bit longer depending on the speed of your computer), you should now see a filtered version of the spreadsheet where only rows containing a mention of Boko Harem are visible. Most of these articles only mention the terror organization a single time in passing, so you can change the search phrase to “Boko Har*Boko Har” (no space around the “*”) and this will tell Excel to only match rows where “Boko Har” appears twice in the Names field (an article that mentions the group twice is much more likely to actually be about that group).

Using the Boko Harem example above, click on the Language column’s filter dropdown and add a filter on the Language column to only display rows containing “Arabic” in the Language column. At this point the remaining rows still visible are those that contain Boko Harem in the Name column and Arabic in the language column (Arabic language coverage of Boko Harem). Remember to click on the filter dropdown on each column and choose “Clear Filters” to remove filters that you no longer want, as filters are cumulative.

Remove the Language filter and select the entire Language column. Copy the column to the clipboard and create a new tab in Excel and then do a “Paste Special – Values Only” into the new tab. This should paste a new column in the new tab that contains only the contents of the Language column for the rows that matched your filters (all Boko Harem coverage). Now create a PivotTable from this new column and you’ll get a histogram of the language breakdown of the coverage GDELT monitored that day discussing Boko Harem, showing which languages emphasized the group more in its reporting (at least according to what GDELT was able to monitor). (On occasion the Macintosh versions of Excel sometimes generate an error why trying to copy-paste the column values, you may need to try this several times). Note that this tends to reflect the availability of each language and so should be interpreted carefully (there is simply more Spanish-language news coverage than there is Swahili coverage).

Alternatively, try narrowing down to Farsi coverage of Cyber (“Farsi” in the Language column and “CYBER” in the Themes column) or Arabic coverage of Estonia (“Arabic” in the Language column and “Estonia” in the Themes column) or French coverage of Israel (“French” in the Language column and “Israel” in the Locations column).

While these are simple examples, they demonstrate how to quickly use these daily spreadsheets to rapidly triage the world’s news media as monitored by GDELT across 65 languages and should allow you to perform basic filters and queries over GDELT without requiring any technical experience in large data analytics.

Once again, we recognize that even this system still requires downloading and working with a fair amount of data and so will not be ideal for all users or usage scenarios. So stay tuned over the coming weeks and months as we release a number of new tools that will allow you to fully explore and analyze GDELT 2.0 entirely in the cloud without writing a line of code.