BigQuery makes it possible to explore data at scales limited really only by your imagination and to help get you started, we've pulled together some advanced analyses below courtesy of the incredible Felipe Hoffa.
FINDING THE TOP CITIES WORLDWIDE WITH CURRENT DISASTER SITUATIONS (GKG)
Earlier we found that we can use BigQuery’s advanced string functions to parse the Global Knowledge Graph’s delimited structure and create a histogram of the top geographic locations mentioned in coverage matching a particular query. Recall that the following query will return a ranked list of the top 100 city- and state-level geographic locations mentioned in global coverage from 10AM through 8PM UTC on March 6, 2015:
SELECT location, COUNT(*) FROM ( select REGEXP_EXTRACT(SPLIT(V2Locations,';'),r'^[2-5]#(.*?)#') as location from [gdelt-bq:gdeltv2.gkg] where DATE>20150306100000 and DATE < 20150306200000 ) where location is not null group by location ORDER BY 2 DESC LIMIT 100
This is an example of a “characterization” query where we have a specific search we’re interested in and we want to use the GKG to tell us more about the results – in this case we want the GKG to help us understand the geography of that particular 10 hour period. However, in a simple date-based search like the one above, the results will largely correspond to the world’s major cities in order of population and economic impact.
What if instead we wanted to use BigQuery to tell us what’s happening anywhere in the world right now that is “interesting” and that we might want to take a closer look at? In other words, instead of ranking cities by number of mentions, we want to find those cities that may not have the highest volumes of coverage, but stand out for some particular reason.
One great example would be to use the GKG to find the latest disaster situations worldwide on a rolling basis. Using a single line of SQL, can we ask BigQuery to search through the GKG and return to us a list of worldwide cities most closely associated with discussion of disasters within a given time interval (such as the last 10 hours)? The query below does exactly this:
select location, percentdisaster, numwithdisaster, total from ( select location, numwithdisaster / total * 100 as percentdisaster, numwithdisaster, total from ( SELECT location, sum(hasdisaster) as numwithdisaster, COUNT(*) total FROM ( select REGEXP_EXTRACT(SPLIT(V2Locations,';'),r'^[2-5]#(.*?)#') as location, IF(V2Themes like '%DISASTER%DISASTER%', 1, 0) as hasdisaster from [gdelt-bq:gdeltv2.gkg] where DATE>20150306100000 and DATE < 20150306200000 ) where location is not null group by location ORDER BY 2 DESC ) where total > 50 ) where percentdisaster > 25 order by percentdisaster desc, total desc
Congratulations! In a single line of SQL and in less than 2.5 seconds, BigQuery has compiled in rank order the list of worldwide cities for which the majority of mentions in worldwide news coverage during this 10 hour period were with regards to a disaster of some kind. In essence, you’ve built yourself your very own “find all active global disasters” dashboard in a single line of code!
You’ll notice the very first result is “Galena, Illinois,” where a large train derailment of crude oil tanker cars caught fire and required an evacuation. This received considerable news attention due to the tanker cars being a new generation car supposedly immune from leaks or fire. Other top results include “Buchanan County, Virginia”, where massive flooding made roads impassable and swept a car off the road, and “Irondale, Alabama” with a 20-vehicle accident on I-459.
At first the query may seem a bit complex with its use of nested SELECT() statements, but those simply add additional layers of filtering and sorting to the output results. If you look closely, the innermost SELECT() is nearly identical to our original one, but adds a new IF() call to the SELECT():
IF(V2Themes like '%DISASTER%DISASTER%', 1, 0) as hasdisaster
What this does is add a flag to each location mention that indicates whether that mention occurred in an article that mentioned “DISASTER” at least twice in the article text. You can change this IF() statement to filter for any topic of interest or to narrow the filter to look only for disasters that involved injury, death, or property destruction. The reason we use an IF() statement to filter for DISASTER themes instead of making this part of the WHERE clause that does the DATE search is that we need to count all mentions of each location, both in disaster-related articles and in non-disaster-related articles so that we can determine the percentage of the location’s mentions that were disaster-related.
The date filter gives the date range to run the analysis for – in a realworld application, you would set this to be the most recent few hours (experiment with the time window to get the best results based on how wide or narrow you make your filters). The output of this query is then passed into another SELECT() statement that counts the total number of mentions of each location across all articles and separately counts the total number of mentions of each location in articles that mention the “DISASTER” theme at least twice. This is passed to another SELECT() that converts this to a percentage that represents the percentage of the mentions of each location that occurred in coverage of disasters, and finally the outermost SELECT() that drops locations with less than 25% of their coverage being from disaster articles and ranks the final results by their volume and affinity with disaster coverage.
COMPUTING AN ANNUAL RANKING OF DESTABILIZING COUNTRIES (EVENTS)
Following in the footsteps of our last query, what if you wanted to rank all countries in the world by the annual change in the total volume of events in that country from the previous year? In other words, instead of ranking countries by the total number of events from that country (which favors larger and more media-rich nations), this query computes the number of events by country and in each year lists those countries experiencing the greatest increase in event volume from the previous year.
The query below does precisely this. The innermost SELECT() returns the total number of Material Conflict events (QuadClass = 4) by country and year, while the remaining SELECT()’s perform the rank and rank change calculations and finally return the results.
SELECT YEAR, GROUP_CONCAT(country) countries, GROUP_CONCAT(STRING(rank)) rankings, GROUP_CONCAT(STRING(jump)) jumps FROM (SELECT YEAR, country, rank, jump FROM (SELECT YEAR, country, rank, prevrank, prevrank-rank jump, (prevrank-rank)/rank jumpratio, RANK() OVER(PARTITION BY YEAR ORDER BY jumpratio DESC) jumprank FROM ( SELECT YEAR, ActionGeo_CountryCode, rank, LAG(rank) OVER(PARTITION BY ActionGeo_CountryCode ORDER BY YEAR) prevrank FROM (SELECT YEAR, ActionGeo_CountryCode, RANK() OVER(PARTITION BY YEAR ORDER BY c DESC) rank, c FROM (SELECT YEAR, ActionGeo_CountryCode, COUNT(*) c FROM [gdelt-bq:full.events] WHERE ActionGeo_CountryCode IS NOT NULL and QuadClass=4 GROUP BY 1, 2))) a JOIN [gdelt-bq:extra.countryinfo] b ON ActionGeo_CountryCode=fips WHERE prevrank-rank > 5 AND rank < 50) WHERE jumprank<6 ORDER BY 1,rank) GROUP BY 1 ORDER BY 1 DESC
Congratulations! In just around 2.5 seconds BigQuery has analyzed the entire 298 million record Event database, computed the number of Material Conflict events by country and year and each country’s change in conflict event number from the previous year, and output a master ranked list of the countries experiencing the greatest destabilization each year. Keep in mind that an increase in destabilizing events may not necessary mean an increase in the risk of total country collapse, and in some cases a relatively low-fatality event (such as the Charlie Hebdo attacks in Paris) can lead to intense global attention spanning weeks or event months, increasing the rank of a country like France in 2015.
If all went well, you should see the following results (the results for 2015 will likely differ, as these results are circa-March 2015):
|2||2014||Iraq,Ukraine,Nigeria,Palestinian Territory,South Sudan||5,9,10,38,44||8,40,6,19,15|
|17||1999||Serbia,East Timor,Australia,Macedonia,Serbia and Montenegro||17,22,30,32,47||28,72,22,47,53|
|19||1997||Pakistan,Albania,Cambodia,Democratic Republic of the Congo,Zambia||7,11,23,26,44||10,48,14,32,33|
|23||1993||Bosnia and Herzegovina,Egypt,Somalia,Angola,Haiti||4,9,12,17,32||8,14,17,13,35|
|24||1992||Turkey,Bosnia and Herzegovina,Azerbaijan,Moldova,Tajikistan||6,12,15,35,49||8,99,30,70,96|
|31||1985||South Africa,Mexico,Uganda,Taiwan,New Zealand||2,22,35,42,46||7,16,28,31,60|
|32||1984||India,Iraq,Sri Lanka,Nigeria,Saudi Arabia||3,12,31,35,42||17,11,17,15,27|
ESTIMATING THE LOCATION OF NEWS OUTLETS BASED ON GEOGRAPHIC FOCUS (EVENTS)
There are many cases where you need to estimate the likely geographic location of a given news outlet, either for analytic reasons (such as determining which countries portrayed a situation positively vs negatively) or disambiguation (knowing a news outlet is based in Ohio, USA allows a reference to “Urbana” to be disambiguated to “Urbana, Ohio” instead of to “Urbana, Illinois”).
GDELT constantly assesses the unambiguous geographic focus of every news outlet it monitors across all 65 languages and uses this to estimate its likely physical location on earth. While imperfect, this is one of the few approaches that works at a global scale. GDELT uses the Global Knowledge Graph to assess all geographic mentions from each outlet, but it is also possible to use the Event database to replicate this “news outlet geocoding” process. In fact, the query below does precisely this:
SELECT domain, GROUP_CONCAT(CONCAT(ActionGeo_CountryCode, '(', STRING(INTEGER(CEIL(ratio*100))), '%)' )) countries_50pct_news FROM ( SELECT domain, ActionGeo_CountryCode, ratio FROM ( SELECT domain, ActionGeo_CountryCode, ratio, SUM(ratio) OVER(PARTITION BY domain ORDER BY ratio) ratio_sum FROM ( SELECT domain, ActionGeo_CountryCode, c, RATIO_TO_REPORT(c) OVER(PARTITION BY domain) ratio FROM ( SELECT DOMAIN(SOURCEURL) domain, ActionGeo_CountryCode, COUNT(*) c FROM [gdelt-bq:full.events] WHERE SOURCEURL CONTAINS 'http' GROUP BY 1, 2 ) ) ) WHERE ratio_sum > 0.5 ORDER BY domain, ratio DESC ) GROUP BY 1 ORDER BY 1
What this query does is to process the entire 298 million record Event database, taking the “ActionGeo_CountryCode” field of each event (the FIPS country code where the event took place) and the web domain of the URL the event was found mentioned in, and constructs a histogram that counts how many events from each country are recorded in news coverage from each domain. The RATIO_TO_REPORT and OVER(PARTITION BY) functions are more advanced BigQuery functions that allow the query to determine the percentage of events from each domain that occurred in each country (doing this in a single query instead of requiring multiple queries). Note that since the GDELT Events database doesn’t include source information before March 2013 (when it switched to daily updates – though in early Summer 2015 it will be releasing a new version of the historical backfile with all of the source information back to its start date of January 1, 1979) and not all coverage is from web-based sources (such as print and broadcast outlets) the “WHERE SOURCEURL CONTAINS ‘http’” clause ensures that only events from web-based outlets are considered.
For each unique news outlet that GDELT monitored one or more events from, a list of countries is compiled that collectively represent 50% of the events extracted from that outlet’s news coverage. In parentheses beside each country code is the percent of events from that outlet that occurred in that country. Thus, the returned row for “02elf.net” returns “US(39%),UK(11%),CH(9%)” for the list of countries indicating that events occurring in these three countries collectively accounted for 50% of the events from that news outlet and that 39% of events occurred in the United States, 11% in the United Kingdom, and 9% in Switzerland.