This past April, Tara Reade's name surged into the media as she accused presidential candidate Joe Biden of sexual assault. Spot checking results and testing various misspellings suggested that "Tara Reid" was a common misspelling of her name, leading to a final query of "Tara Reade OR Tara Reid". How might the new Television News Ngram 2.0 Dataset help with systematically identifying misspellings and expanding query terms in general?
Using the bigram version of the dataset in BigQuery, a simple SQL query is all that's required to compile a list of name variants:
SELECT NGRAM, COUNT(1) cnt FROM `gdelt-bq.gdeltv2.iatv_2gramsv2` WHERE DATE(TIMESTAMP) >= "2020-04-24" and DATE(TIMESTAMP) <= "2020-05-25" and NGRAM like 'tara re%' group by NGRAM order by cnt desc
Yielding the following list:
Ngram | Count |
tara reade | 492 |
tara reid | 140 |
tara reade's | 130 |
tara reid's | 16 |
tara reed | 15 |
tara read | 6 |
tara reader | 4 |
tara reed's | 3 |
tara reads | 2 |
tara rea | 1 |
tara rereade's | 1 |
tara re | 1 |
tara readest | 1 |
tara reade.ourse | 1 |
tara rei | 1 |
tara readeys | 1 |