Terascale Sentiment Analysis: BigQuery + Tone Coding Books

Kalev Leetaru

10 years ago

As we continue to add new sentiment dictionaries to GDELT on a regular basis, a common request has been the ability to extend the dictionaries backwards over time, especially over historical collections like books. It turns out that Google BigQuery is exceptionally adapted to this kind of massive coding workflow and with just a single SQL query and a table with your tone dictionary, you can sentiment analyze vast quantities of text at incredible speeds.

In fact, using the query below and changing it to run over 1800 to 1922 (WARNING: doing so will consume half of your monthly free BigQuery quota) takes just 196 seconds (three minutes and 16 seconds) to process 785,586 books totaling 416GB and 67 billion words, which works out to tone coding at around 341 million words per second. There is a substantial startup cost to this query, meaning that scaleup is not linear, so coding a small amount of text will not be linearly faster than a larger amount of text. But, at the scales where BigQuery is most useful, such as tens of billions of words, the ability to tone code at 341 million words per second is quite extraordinary.

The query below uses a toy tone dictionary with nine words to calculate the "tone" of each of the 785,586 historical books digitized by the Internet Archive from 1800 to 1922 for which fulltext is available. It counts the density of those nine tone words in each book, reporting the result as a density percentage, replicating a traditional content analysis topic measurement. Each word has a floating point numeric score associated with it scoring how positive or negative it is (or whatever the emotion the dictionary measures) and the average score of all matching words found in the document is also reported, replicating tone coding via a value-based tone dictionary.

Here the tone dictionary is a so-called "toy" dictionary that contains made-up data, but all you have to do is create your own table with the contents of a real tone dictionary to be able to perform sentiment analysis at terascale. Happy sentiment analysis!

select DocumentIdentifier, TotWordCount, TotalMatchingWords, SumToneScore, (TotalMatchingWords/TotWordCount*100) ToneIntensity, (SumToneScore/TotalMatchingWords) ToneScore from (

select DocumentIdentifier, max(TotWordCount) TotWordCount, sum(ThisWordCount) TotalMatchingWords, sum(ThisWordScore) SumToneScore from (

select a.DocumentIdentifier DocumentIdentifier, a.totwordcount TotWordCount, a.word Word, a.count ThisWordCount, b.Score ThisWordScore from (

select DocumentIdentifier, word, count(*) as count, totwordcount FROM (
SELECT DocumentIdentifier, SPLIT(REGEXP_REPLACE(LOWER(BookMeta_FullText),'[^a-z]', ' '), ' ') as word, count(SPLIT(REGEXP_REPLACE(LOWER(BookMeta_FullText),'[^a-z]', ' '), ' ')) as totwordcount
FROM (TABLE_QUERY([gdelt-bq:internetarchivebooks], 'REGEXP_EXTRACT(table_id, r"(\d{4})") BETWEEN "1800" AND "1800"')) 
) group EACH by DocumentIdentifier, word, totwordcount

) a JOIN EACH (
select Word, Score from [gdelt-bq:extra.toytonelookup]
) b on a.word = b.Word

) group EACH by DocumentIdentifier

)