What are the most common social media handles that have appeared on television news this year? Using a simple regular expression in BigQuery we can compile a ranked list by station of any string of letters, numbers and underscores that follow an "@" symbol (the rules for a Twitter handle) for the year to date:
SELECT station, REGEXP_EXTRACT(OCRText, r'(@[a-zA-Z0-9_]+)') handle, count(1) count FROM `gdelt-bq.gdeltv2.vgegv2_iatv` WHERE DATE(date) >= "2020-01-01" and OCRText like '%@%' group by station, handle having handle is not null order by count desc
This yields a final master spreadsheet. Some of the top handles are consistent errors, such as "@realDonald" versus "@realDonaldTrump" due to an OCR error and "@CN" versus "@CNN". This sheet can help find common errors.