Using Television News Ngrams To Analyze Station-Specific Language

The trio of CNN, MSNBC and Fox News offer a fascinating microcosm of the language differences that define society at large. Is it "us" and "we" or "those" and "them"? Are they "peaceful" "protesters" or "violent" "looters"? Is it "immigration" or "borders"? By looking at how mentions of each word differ across the three stations we can understand how station-specific each word is. Some words might appear more often on a given station because they form part of the name of an on-air personality or frequent guest, while given that a percentage of commercials are closed captioned, the higher prevalence of a term could indicate different advertising priorities. At the same time, it is also true that CNN, MSNBC and Fox News represent a triad encapsulating the US political spectrum, from left to middle to right. Thus, normalizing for the factors above, understanding word frequency skews across the three stations can help offer clues to the partisanship of a given term.

Using the new Television News Ngram 2.0 Dataset we can use a single SQL query to count up how many times each word appeared on CNN, MSNBC and Fox News.

Here are the results thus far this year capturing the Covid-19 pandemic, from January 1, 2020 through yesterday:

SELECT NGRAM, sum(CNN) CNN, sum(MSNBC) MSNBC, sum(FOXNEWS) FOXNEWS from (
SELECT NGRAM, sum(COUNT) CNN, 0 MSNBC, 0 FOXNEWS FROM `gdelt-bq.gdeltv2.iatv_1gramsv2` WHERE DATE(TIMESTAMP) >= "2020-01-01" AND STATION='CNN' group by NGRAM
UNION ALL
SELECT NGRAM, 0 CNN, sum(COUNT) MSNBC, 0 FOXNEWS FROM `gdelt-bq.gdeltv2.iatv_1gramsv2` WHERE DATE(TIMESTAMP) >= "2020-01-01" AND STATION='MSNBC' group by NGRAM
UNION ALL
SELECT NGRAM, 0 CNN, 0 MSNBC, sum(COUNT) FOXNEWS FROM `gdelt-bq.gdeltv2.iatv_1gramsv2` WHERE DATE(TIMESTAMP) >= "2020-01-01" AND STATION='FOXNEWS' group by NGRAM
) group by NGRAM ORDER BY CNN+MSNBC+FOXNEWS desc

The results can be downloaded as "1grams-cnnmsbcfox-20090702-20200703.csv", counting how often each word was used on each of the three stations thus far this year. You can divide each term by the total number of all mentions of all words on that station to yield a density that normalizes for different speaking rates, etc.

Alternatively, simply dropping the date restrictors of the query above yields:

SELECT NGRAM, sum(CNN) CNN, sum(MSNBC) MSNBC, sum(FOXNEWS) FOXNEWS from (
SELECT NGRAM, sum(COUNT) CNN, 0 MSNBC, 0 FOXNEWS FROM `gdelt-bq.gdeltv2.iatv_1gramsv2` WHERE STATION='CNN' group by NGRAM
UNION ALL
SELECT NGRAM, 0 CNN, sum(COUNT) MSNBC, 0 FOXNEWS FROM `gdelt-bq.gdeltv2.iatv_1gramsv2` WHERE STATION='MSNBC' group by NGRAM
UNION ALL
SELECT NGRAM, 0 CNN, 0 MSNBC, sum(COUNT) FOXNEWS FROM `gdelt-bq.gdeltv2.iatv_1gramsv2` WHERE STATION='FOXNEWS' group by NGRAM
) group by NGRAM ORDER BY CNN+MSNBC+FOXNEWS desc

This yields the results "1grams-cnnmsbcfox-20090702-20200703.csv.gz" (100MB uncompressed) covering the entire July 2, 2009 to present period of the Television News Archive!