New One Minute Maps: BigQuery UDF + CartoDB

We're incredibly excited to announce today the ability to generate rich interactive geographic maps in less than 60 seconds from the GDELT Global Knowledge Graph 2.0 spanning 150 million articles in 65 languages over the last 8 months and updating every 15 minutes. This incredible new capability comes through Google BigQuery's new User Defined Function (UDF) capability which allow you to define an arbitrarily-complex MapReduce "map" function in JavaScript and have BigQuery execute it inline as part of your SQL query, running it directly on your data across potentially tens of thousands of processors.

Up until now creating maps from GDELT required running an initial query on BigQuery to get the matching list of documents, then downloading potentially hundreds of gigabytes of data to your local computer and then running a PERL script on your machine that could require tens of gigabytes of RAM and the greater part of an hour to complete. The PERL script was needed because part of making a map from the GKG requires processing the list of locations and themes found in each document to identify the themes appearing closest to each location mention, in order to map just those locations mentioned in proximity to the theme of interest.

Now, with just two lines of code, BigQuery handles the entire process, including the CPU-intensive task of associating each location mentioned in an article with the themes mentioned closest to it, all of the aggregation and filtering, and generates a final output CSV file that can be drag-dropped directly into CartoDB to make the final map. The entire process, from pasting the initial query into BigQuery through publishing the final map from CartoDB, can be accomplished in less than 60 seconds.

We'd like to thank Google BigQuery Tech Lead Jordan Tigani for writing the initial UDF and guiding us through the best practices on UDF use.

RUNNING THE QUERY

This tutorial makes use of Google's BigQuery data analysis platform, which requires that you sign up for Google BigQuery account (as of the time of this writing you get 1TB of free data querying per month, of which the tutorial below will use around 150GB).

Once you have signed up for Google BigQuery, navigate to the GDELT GKG 2.0 table hosted in BigQuery. Click on the grey "Query Table" button at the top right of the page. You should now see a text entry box. Click on the "UDF Editor" button at the top right and the textbox should display some sample UDF JavaScript code. Select-all and delete the same code. Then paste the entire block of code below as-is into the textbox (it is quite lengthy, so make sure you copy-paste the ENTIRE block of code and don't miss any of it). You can adjust the "MAX_DISTANCE" variable at the top to adjust the window that the function searches to connect a theme to a location. Essentially what this UDF does is take each input document, break it into locations, and for each location compile a list of the themes that appeared within 250 characters before or after the theme in the document. In other words, it finds the themes mentioned closest to each location mention. For each document it outputs a list of rows, one per location mentioned in the document, with a column containing a list of all themes mentioned within 250 characters on either side of the theme. It also perhaps a set of filtering and cleaning tasks on the SharingImage and DocumentIdentifier URLs.

function locationParserFun(row, emitFn) {
  var MAX_DISTANCE = 250;
  if (row.V2Locations === null || row.V2Themes === null) {
    return;
  }
  var locations = String(row.V2Locations).split(';');
  var themes = String(row.V2Themes).split(';');
  // emitFn({Location: locations[0], V2Themes: themes.join(';')})
  for (var location of locations) {
    var locationFields = location.split('#');
    
    if (locationFields.length < 2) {
      continue;
    }
    var locationOffset = parseInt(locationFields[locationFields.length - 1], 10);
    var closeThemes = [];

    for(var theme of themes) {
      var themeFields = theme.split(',');

      if (themeFields.length < 2) { continue; }
      var themeName = themeFields[0];
      var themeOffset = parseInt(themeFields[1], 10);
      if ((locationOffset > themeOffset && locationOffset - themeOffset < MAX_DISTANCE)
         || (locationOffset < themeOffset && themeOffset - locationOffset < MAX_DISTANCE)) {
        //closeThemes.push(themeName + ',' + themeOffset); this emits the theme offsets if desired
        closeThemes.push(themeName);
      }
    }

      if (closeThemes.length > 0) {
      //output the final results if we had a set of affiliated themes...

      //clean up our SharingImage...
      var sharingimage = row.SharingImage;
      if (sharingimage != null) {
          var match = sharingimage.toLowerCase();
          if (match.indexOf('button') > -1 || match.indexOf('bttn') > -1 || match.indexOf('.gif') > -1 || match.indexOf('template') > -1 || match.indexOf('default') > -1 || match.indexOf('logo') > -1 || match.indexOf('img_fb') > -1 || match.indexOf('facebook') > -1 || match.indexOf('figaro') > -1 || match.indexOf('og-angop') > -1 || match.indexOf('gannett-cdn') > -1 || match.indexOf('no_preview') > -1 || match.indexOf('risingkashmir') > -1 || match.indexOf('townnews.com') > -1 || match.indexOf('-square') > -1 || match.indexOf('apmobile') > -1 || match.indexOf('mynorthwest.com') > -1 || match.indexOf('the_tribune_sq') > -1 || match.indexOf('g-mtn.png') > -1 || match.indexOf('story-thumb-large') > -1 || match.indexOf('top_stories_stopimg') > -1 || match.indexOf('fb_image') > -1 || match.indexOf('cubadebate-ipad') > -1 || match.indexOf('jdsupra.com') > -1 || match.indexOf('nohotlinks') > -1 || match.indexOf('og_image_meridianstar') > -1 || match.indexOf('strat_n') > -1 || match.indexOf('skins') > -1 || match.indexOf('banner') > -1 || match.indexOf('analytics') > -1 || match.indexOf('.icon') > -1 || match.indexOf('-icon') > -1 || match.indexOf('/icon') > -1 || match.indexOf('blank.') > -1 || match.indexOf('fb.png') > -1 || match.indexOf('opengraph_default') > -1 || match.indexOf('fai2.png') > -1 || match.indexOf('http-equiv') > -1 || match.indexOf('deramit.jpg') > -1 || match.indexOf('recomendovano.jpg') > -1 || match.indexOf('story-thumb-large.jpg') > -1 || match.indexOf('no_image.jpg') > -1 || match.indexOf('30001090.cms') > -1 || match.indexOf('pic_fb.jpg') > -1 || match.indexOf('mw-dr328_mw_soc_ns_20150801233302.png') > -1 || match.indexOf('fft81_mf2574590.Jpeg') > -1 || match.indexOf('udn_baby.png') > -1 || match.indexOf('news_no_image') > -1 || match.indexOf('nzhfbcover') > -1 || match.indexOf('no_headline') > -1 || match.indexOf('lotus_img') > -1 || match.indexOf('tr_white_square') > -1 || match.indexOf('favicon') > -1 || match.indexOf('top_stories_thumb') > -1 || match.indexOf('/0.jpg') > -1 || match.indexOf('newog600x600') > -1 || match.indexOf('/w.png') > -1 || match.indexOf('eldiaesfb') > -1 || match.indexOf('donga_icon') > -1 || match.indexOf('dot200.png') > -1 || match.indexOf('_stopimg.jpg') > -1 || match.indexOf('herald_h.jpg') > -1 || match.indexOf('squaresun_web') > -1 || match.indexOf('avatar.jpg') > -1 || match.indexOf('news.jpg') > -1 || match.indexOf('fb/wm.png') > -1 || match.indexOf('share_noimage') > -1 || match.indexOf('twp-3000x1568.jpg') > -1 || match.length > 1200) { sharingimage = ''; }
	  sharingimage = sharingimage.replace(/\\/g,'/');
	  sharingimage = sharingimage.replace(/['"]/g,'');
	  sharingimage = sharingimage.replace(/[^A-Za-z0-9./%$&!?#():;-_=]/g,'');
      } else { sharingimage = ''; }

      //output back to BigQuery...
      var result = {LocationName: locationFields[1], Latitude: locationFields[5], Longitude: locationFields[6], AssociatedThemes: closeThemes.join(';'), DocumentIdentifier: row.DocumentIdentifier, SharingImage: sharingimage};
      emitFn(result)
    }
  }
}
bigquery.defineFunction(
  'GKGThemeListByLocation',               // Name of the function exported to SQL
  ['V2Locations', 'V2Themes', 'DocumentIdentifier', 'SharingImage'],                    // Names of input columns
  [{'name': 'LocationName', 'type': 'string'},  // Output schema
   {'name': 'Latitude', 'type': 'string'},
   {'name': 'Longitude', 'type': 'string'},
   {'name': 'AssociatedThemes', 'type': 'string'},
   {'name': 'DocumentIdentifier', 'type': 'string'},
   {'name': 'SharingImage', 'type': 'string'}
  ],
  locationParserFun // Reference to JavaScript UDF
);

Now click on the "Query Editor" button at the top right of the page and you should see some sample SQL code – once again select-all and delete the same SQL code and copy-paste in the SQL query below (as with the UDF code above, this query is quite long, so make sure you copy-paste the ENTIRE block of code). This query actually involves two separate queries joined together. The first query compiles the list of documents mentioning the theme of interest (in this case it is a complex query designed to filter for articles that are really about poaching and not just mentioning it casually), which are then fed to the UDF above to compile the list of all locations within those documents and the themes mentioned alongside of them, which in turn is then filtered to just those locations mentioned within 500 characters of the POACHING theme. This output is collapsed on Latitude+Longitude to compile a list of documents mentioning each unique location in the context of poaching. The DENSE_RANK() analytic window operator is used to limit the number of articles returned per location to a maximum of 50 (since a major location like New York City might have tens of thousands of mentions). The second query performs the same tasks, but within the limit of 50 results per city, and counts the actual total number of matching documents. These are then joined together so that the final result returns up to 50 articles per location, along with the actual total number of matching documents.  Change the two inner WHERE queries to map a different topic, change the "WHERE AssociatedThemes like" query to place additional requirements on the themes that must appear alongside each location (such as requiring the theme to be mentioned twice within 500 characters), and change "where Rank <= 50" to control the maximum number of documents returned for each location, and finally, change "where b.NumArticles >= 2" at the bottom of the query to filter out locations with too few articles mentioning them.

select a.LocationName LocationName, b.NumArticles NumArticles, a.Latitude Latitude, a.Longitude Longitude, a.ArticleList ArticleList, a.SharingImage SharingImage from (

select max(LocationName) LocationName, Latitude, Longitude, count(distinct(DocumentIdentifier)) NumArticles, GROUP_CONCAT_UNQUOTED(UNIQUE(DocumentIdentifier),'
') ArticleList, max(SharingImage) SharingImage from (
select DENSE_RANK() OVER (PARTITION BY Latitude, Longitude ORDER BY DocumentIdentifier DESC) Rank, LocationName, Latitude, Longitude, CONCAT('<a href="', DocumentIdentifier, '" target="blank">Article Link</a>') DocumentIdentifier, SharingImage from (
 SELECT LocationName, Latitude, Longitude, AssociatedThemes, DocumentIdentifier, SharingImage  FROM
 GKGThemeListByLocation((
     SELECT V2Locations, V2Themes, DocumentIdentifier, SharingImage FROM [gdelt-bq:gdeltv2.gkg] where DocumentIdentifier like 'http%' AND
     (V2Themes like '%ENV_POACH%' and (V2Themes like '%ENV_POACH% ENV_POACH %' or V2Themes like '%SEIZE%' or V2Themes like '%ENVIR%' or V2Themes like '%BLACK_MARKET%' or V2Themes like '%ENV_SPECIESENDANGERED%' or V2Themes like '% SLFID_NATURAL_RESOURCES%'))
 ))
) where AssociatedThemes like '%POACHING%' 
) where Rank <= 50 group by Latitude, Longitude order by NumArticles desc ) a JOIN EACH ( select Latitude, Longitude, count(distinct(DocumentIdentifier)) NumArticles from ( select Latitude, Longitude, DocumentIdentifier from ( SELECT LocationName, Latitude, Longitude, AssociatedThemes, DocumentIdentifier, SharingImage FROM GKGThemeListByLocation(( SELECT V2Locations, V2Themes, DocumentIdentifier, SharingImage FROM [gdelt-bq:gdeltv2.gkg] where DocumentIdentifier like 'http%' AND (V2Themes like '%ENV_POACH%' and (V2Themes like '%ENV_POACH% ENV_POACH %' or V2Themes like '%SEIZE%' or V2Themes like '%ENVIR%' or V2Themes like '%BLACK_MARKET%' or V2Themes like '%ENV_SPECIESENDANGERED%' or V2Themes like '% SLFID_NATURAL_RESOURCES%')) )) ) where AssociatedThemes like '%POACHING%' ) group by Latitude, Longitude ) b ON a.Latitude = b.Latitude and a.Longitude = b.Longitude where b.NumArticles >= 2 order by b.NumArticles DESC

Finally, click on the bright red "Run Query" button in the row under the textbox. BigQuery will now run your query, which should take just under one minute (during periods when BigQuery is under heavy load or if your query returns an especially large number of results, it may take slightly longer than one minute to complete).

Once the query has completed, the results will display in a table at the bottom of the page. Once you see the results, click on the "Download as CSV" button just above the query results table. If you get an error that the results are too large to download as a CSV, modify the "b.NumArticles >= 2" portion of the SQL query above to set it to a higher number to filter out locations with too few mentions and reduce the number of returned rows.

MAPPING THE RESULTS

Now, sign up for a CartoDB.com account and log into your account and click on the dropdown at the top of the page to switch into your "Datasets" dashboard. Now drag-drop the CSV file you just downloaded from BigQuery onto the page and CartoDB will upload it to your account and import it into a new table. Once the table opens, click on the "Map View" button at the top center of the page. Click on the Map Wizard button on the right-hand toolbar, pick "Bubble" as the map type and set to the following parameters:

one-minute-maps-bigquery-udf-cartodb-1

The screen should now look like this:

one-minute-maps-bigquery-udf-cartodb-2

Now, click on the InfoWindow tab and set to the following options (note you will have to drag-reorder the rows):

one-minute-maps-bigquery-udf-cartodb-3

That's it! You should now have the following interactive clickable map! Happy Mapping!