Using Television News Ngrams To Expand Query Search Terms & Catch Misspellings

This past April, Tara Reade's name surged into the media as she accused presidential candidate Joe Biden of sexual assault. Spot checking results and testing various misspellings suggested that "Tara Reid" was a common misspelling of her name, leading to a final query of "Tara Reade OR Tara Reid". How might the new Television News Ngram 2.0 Dataset help with systematically identifying misspellings and expanding query terms in general?

Using the bigram version of the dataset in BigQuery, a simple SQL query is all that's required to compile a list of name variants:

SELECT NGRAM, COUNT(1) cnt FROM `gdelt-bq.gdeltv2.iatv_2gramsv2` WHERE DATE(TIMESTAMP) >= "2020-04-24" and DATE(TIMESTAMP) <= "2020-05-25" and NGRAM like 'tara re%' group by NGRAM order by cnt desc

Yielding the following list:

Ngram	Count
tara reade	492
tara reid	140
tara reade's	130
tara reid's	16
tara reed	15
tara read	6
tara reader	4
tara reed's	3
tara reads	2
tara rea	1
tara rereade's	1
tara re	1
tara readest	1
tara reade.ourse	1
tara rei	1
tara readeys	1

The GDELT Project

Using Television News Ngrams To Expand Query Search Terms & Catch Misspellings

Archives