Custom Document-Level GCAM-Like Sentiment Analysis Over The News Using Web NGrams 3.0

Most traditional sentiment analysis systems use a "bag of words" approach to assessing the emotional undercurrents of texts, stripping away punctuation, converting the text to lowercase, splitting it into words and counting how many appear in a predefined dictionary. While incredibly simplistic, this basic workflow typically yields the same result as state of the art neural approaches, but typically with higher accuracy on edge cases. Best of all, this unordered word lookup is a perfect match for the Web NGrams 3.0 dataset. In fact, you can create your own GCAM-like sentiment modeling of global news coverage right out of the box!

Computing document-level sentiment scores using the NGrams dataset is simple for space-segmented languages:

  1. For each record, skip if it is not in the language of the sentiment dictionary, then take the "ngram" field, strip away any punctuation from the start or end of the word (a Unicode-aware regular expression makes this trivial), convert it to lowercase (for uncased dictionaries), look it up in the sentiment dictionary and record its score if found.
  2. If the dictionary also contains two-word phrases, split the "post" field into "words," extract the first word and concatenate it to the "ngram" field with a space in between. Strip punctuation from the start and end of the concatenated phrase (not the two component words), convert to lowercase as needed and look up in the sentiment dictionary. Only punctuation appearing at the start/end of the phrase is removed to ensure that phrases do not span sentences. Thus, in the snippet "…missed them terribly. Entertaining is the new…" where "ngram" is "terribly." and "post" is "Entertaining is the new," the concatenated two-word phrase will be "terribly. entertaining," with the period in between preventing it from matching "terribly entertaining" in the sentiment dictionary. This can be extended to multiple words to allow for multi-word dictionaries.

For non-space segmented languages, the simplest option is to concatenate the "ngram" and "post" fields together, then use the Unicode word segmentation dictionaries and logic or a language-specific word segmentation tool to identify the boundaries of the first (and optionally second) "words" in the ngram+post snippet and then proceed according to the workflow above.

To showcase this workflow in action for space-segmented languages, we've created a simple Perl script that does all of this. Download "demo-docsentiment.pl" and make it executable, as well as installing several additional Perl modules and supporting tools:

chmod 755 ./demo-docsentiment.pl
apt-get -y install pigz
apt-get -y install curl
apt-get -y install libjson-xs-perl
apt-get -y install liblingua-en-tagger-perl

Also download the two toy sentiment dictionaries, "SENTIMENT-POS.TXT" and "SENTIMENT-NEG.TXT" to the same directory. These dictionaries just contain two words each and thus won't yield useful results. Download any off-the-shelf sentiment dictionary and replace their contents with any list of words, one word per line – the files are read as UTF8 Unicode.

Now run the script each minute:

./demo-docsentiment.pl

It will automatically download the latest Web NGrams 3.0 and GDELT Article List files, compile their entities and write to the "./RESULTS/" subdirectory. Remember that not all minutes have data and thus you will typically see clusters of output files every 15 minutes, with gaps in between.

You can set this up to run in realtime each minute by following the cron instructions in our tutorial on keyword searching from earlier this week.