Using The Global Similarity Graph To Bootstrap Categorization Models Using Web NGrams 3.0

A common question from organizations building document classifiers on top of the Web NGrams 3.0 dataset is how to accelerate the model training process. Some organizations already have robust suites of existing classification models that can typically be modified through a few different approaches to work with the ngram dataset's KWIC environment. But for many, GDELT represents the first time they have attempted to apply document classifiers at scale or to expand them to a larger selection of topics, especially over realtime real world multilingual content.

The simplest approach to building document classifiers for GDELT's Web NGrams 3.0 dataset is to use a variety of SME-guided keyword searches and snowball expansion to identify candidate articles and then having human analysts categorize each article. This yields the highest quality results, but also requires the greatest effort in terms of human resources to classify all of the training and test dataset articles.

An alternative that can yield rapid high-quality results and in some cases even do a better job at capturing edge cases, is to use the Global Similarity Graph Document Embeddings dataset. Human analysts identify a small number of highly relevant articles and these "seed" articles are then scanned against the broader GSG corpus for highly similar articles as a form of "more like this" semantic search. This rapidly yields a much smaller selection of typically higher relevancy articles, greatly reducing the amount of articles that the human analysts must review. For proof of concept classifiers, this workflow can often yield a workable MVP prototype right out of the box.