As we work to extend the searchable date range of the DOC 2.0 API beyond a rolling window of 3 months towards a longer time horizon and look to expand its capabilities based on all of the feedback you've given us, we will also be phasing out some older features that are simply not being used, to allow us to rearchitect our search infrastructure to support some of the new capabilities that you've made clear are critical to your work.
Phasing Out Native Language Search
Since the introduction of the DOC 2.0 API six months ago, 99.6% of all searches have been conducted in English and the 0.4% of searches that have used the "&searchlang=" parameter to specify keywords in a language other than English were overwhelmingly for rare terms that returned no results. At the same time, maintaining a fulltext search infrastructure that correctly and robustly supports keyword and keyphrase search across 65 languages, including nonsegmented Asiatic languages, requires considerable algorithmic complexity and computational and storage resources. By phasing out native language search we can make some architectural changes that allow us to considerably expand the time horizon searchable by the DOC 2.0 API without any additional hardware.
We've heard you loud and clear that English is how you want to search the DOC 2.0 API, so we will be phasing out the native language search capabilities of the DOC 2.0 API shortly. There will be a new API forthcoming that provides some of the key functionality of the DOC API, but custom-built for foreign language searching that will be far more amenable to this kind of search activity.
Adjusting Textual WordClouds
Word clouds take the top 75 most relevant documents returned for a query and generate a histogram of the most common 200 non stop words that appeared in them. They can be quite useful for assessing the overall topical focus of a given search, but are also limited in the actual insights they provide.
It is important to remember that when using word clouds with queries that do not include a keyword, the resulting word cloud has limited to no meaning and can be highly misleading. For example, if you generate a word cloud from a search for Russian-language articles published in the last 24 hours, the word cloud will be based on 75 randomly selected articles that may or may not bear any resemblance to the major themes emerging over the last 24 hours in the Russian press.
We will be making some changes to how word clouds function, including limiting the time horizon they consider and potentially shifting them to rely more on the title text, as we transition to a new indexing architecture that allows us to considerably expand our searchable time horizon and adjust how certain functions work to help guide users towards the most productive and meaningful outputs.
Stemming
Some of you have asked if we plan to support word stemming in the forseeable future to make it easier to search across word forms. The short answer is that this is an area in which we are actively experimenting, but to date we have found that given GDELT's global scope, the edge cases of current generation stemmers are just too detrimental to typical search behavior in GDELT to make it possible to implement them without severely harming accuracy. Algorithmic stemmers in particular tend to suffer edge cases that are closely aligned with the kind of searches common to GDELT. For example, the Porter2 algorithm (ElasticSearch's default English stemmer is a base Porter stemmer) stems all six of the words "general,generalities,generality,generalization,generally,generals" to "general." Thus a search for military "generals" will return any document talking about vague "generalizations." Similarly, "intern,internal,internally,international" all resolve to "intern," meaning a search for "international" agreements, "internal" medicine, and "intern" work will all yield the same stem. Dictionary-based approaches struggle to cope with the immense variety of global language, especially the flow of new words and contexts entering emergent use each day given that GDELT covers the entire planet and nearly every topic in human conversation.
Non-research document retrieval services like Google's search engine, that restrict users to a small number of the most relevant results, make heavy use of stemming to maximize recall, but this works poorly in a research oriented environment like GDELT where a typical use case involves tracking the popularity of a topic over time and looking exhaustively at its usage. Having a search for military generals show a large spike a month ago due to a conversation about generalizations in the healthcare debate is problematic to say the least.
There are a myriad stemmers, both algorithmic and dictionary based, as well as more sophisticated newer approaches such as neural language comprehension that are able to provide the necessary contextual disambiguation to support robust lemmatization. We are exploring a number of these approaches, especially these newer methodologies that are able to integrate contextual disambiguation, but to date our experiments still show confusion rates that are simply too high for production use in a research-grade environment like GDELT.
Moreover, it turns out that the majority of requests we receive along these lines involve not the simple cases like "russians" into "russian," but rather mapping "russian" and "russians" and "russia" to the same root, which conflates their very different meanings, or mapping a "car" to the same as a "truck" to the same as a "vehicle," which involves semantic class mappings that, while accessible for portions of the language through datasets like WordNet, impede searches where such precision is important, or, most commonly, mapping "russia,russian,russians,kremlin,moscow,putin,kgb,fsb" and a myriad other "synonyms" to the general concept of "russia." The problem with synonym mapping in particular is that, while partially accessible through WordNet, word embeddings and other approaches, they are typically highly domain dependent such that one set of "synonyms" for the concept of "russia" that work well for one research question might be completely mismatched for a different question.
The short answer is that we are exploring a wide array of approaches, including neural-enhanced lemmatization and leveraging word embeddings to provide machine-assisted domain-specific dynamic query expansion based both on static embeddings and the specifics of the query results being returned. We are at the earliest stages of this journey, but hope to roll out some early experiments in 2018.