GeoJSON API Migration To ElasticSearch

First announced in May 2015, our GeoJSON API has become an immensely popular way of visualizing and interacting with the geography of the global news media through GDELT. As part of our migration to our new ElasticSearch infrastructure, we will be slowly transitioning all of our legacy APIs to the new system before rolling out the new V2 APIs that support an incredible array of new features. The V1 APIs that you've grown to know and love will still continue on as before – we will be maintaining them as-is, with a few minor modifications to both accommodate the nuances of ElasticSearch and to remove features that have not been heavily used in the APIs.

Several key changes will be occurring to the GeoJSON API:

  • Limiting Max Returned Records. Our old APIs made use of a massively optimized data flow architecture that allowed effectively unlimited streaming of records with little throughput cost, but high search cost. Our new ElasticSearch system inverts this equation – search is now incredibly cheap, but returning the complete contents of hundreds of thousands or millions of records is extremely expensive, even on SSD disk. In our own extensive experiments over the past year we've found that when map densities are increased to these levels the useful insights typically are drowned out. A map with a few thousand points capturing the most significant datapoints is nearly always far more useful than a map with 10 million points. Thus, we are reducing the cap on MAXROWS. This will also affect "location+time" (animated) maps in that they internally will consider fewer rows. We've not seen many users hitting these upper caps and in those cases we've found ways to transition them to more efficient interfaces to the GKG data that better achieves their needs.
  • Dropping WordCount Field. The "wordcount" field is not widely used, as most people are more interested in the number of matching articles, rather than the total word count of those articles. We believe we've reached out all users currently using this field and found ways of accomplishing their needs in other ways. This field will be phased out later this week. To ensure we don't break current applications, the field will still appear in the JSON, but will always be 0 from now on.
  • Dropping NumCounts Field. Similar to the "wordcount" field this field has not been widely used and we've worked with all known current users to transition away from it. It will be set to 0 from now on.
  • Dropping GCAMVAR. While the GCAM system itself has found immense application in multidimensional modeling of global emotion and teasing apart realtime reactions and historical trends to major events, we've found that most users prefer to interact with multiple GCAM variables at once as part of larger modeling programs, rather than through the GeoJSON API. In short – instead of simply drawing GCAM variables on a map, users are actually bringing multiple emotional threads together as input to larger forecasting and modeling efforts and those that do map GCAM typically are mapping multiple GCAM emotions at once through new synthetic dimensions constructed from complex blending of GCAM variables and thus are using the raw GKG 2.0 files rather than the GeoJSON API. The immense size and dimensionality of the GCAM data is not well aligned with ElasticSearch's internals and so we are removing the GCAMVAR field from the GeoJSON API and instead the forthcoming V2 API suite will feature a new field giving the GKG 2.0 file offsets for each record that can be used to pull its full GCAM data as desired.

Based on extensive testing we believe that these changes will only affect a very small number of users and we've already worked with those users to transition their applications to more efficient interfaces with the GKG 2.0 data so their applications will be unaffected by this migration. For the rest of you, you should see no change other than your queries returning much faster.