As part of our transition of all of GDELT's current and forthcoming full text search features over to an ElasticSearch infrastructure, we are making some adjustments to how we handle foreign language search. Through its Translingual platform, GDELT today monitors and mass machine translates news coverage in 65 languages (plus an additional 35 lower volume hand translated languages). This makes full text search of the original native content far more difficult than plain ASCII English search, given the immense creativity and complexity of the world's languages and characters.
Towards this end, while it is a minor change, we will be processing native language text through ElasticSearch's ICU Analysis Plugin. Specifically, we will be using its "icu_tokenizer" and "icu_normalizer" features, the latter of which transforms the text into NFKC_CF (Compatibility Decomposition, Canonical Composition, Case Folding). Note that for Asiatic languages that do not use spaces to split words (such as Chinese, Japanese, Thai and Vietnamese), text is presegmented by GDELT Translingual using the Stanford Chinese Word Segmenter (Peking University Standard), KyTea for Japanese, SWATH for Thai and JVnSegmenter for Vietnamese. Searches in those four languages may yield better results if you manually split your search terms into individual words.
We are tremendously excited about these new capabilities that we will be rolling out over the next few weeks, so stay tuned!