The GDELT Project

Keyword Searching Television News Trump Tweet Appearances

Last week we released an aligned dataset of Trump tweets appearing on television news using the Visual Global Entity Graph 2.0 (VGEG). But what if you want to do more complex filtering? below is a simple PERL script that can be used to quickly generate SQL queries for BigQuery. Save each keyword you wish to search for to a UTF8 ASCII text file, one word per line. Make sure you have all conjugations and forms of the word of interest, one per row. Then run this script on that file and it will generate the proper SQL. You can uncomment the first SQL line and comment out the second if you want LIKE syntax that allows matches of the keywords anywhere in the OCR text, including inside of other words, whereas as-is it will search each keyword as a regular expression requiring a word boundary on either side of each keyword.

#!/usr/bin/perl

if ($ARGV[0] eq '' || (!-e $ARGV[0])) { print "FATAL: No input file!\n"; }

open(FILE, $ARGV[0]); binmode(FILE, ":utf8");
while(<FILE>) {
$_=~s/\s+$//;
#$SQL .= "LOWER(OCRText) like '%" . lc($_) . "%' OR "; #use LIKE syntax...
$_=~s/'/\\'/g; $SQL .= "REGEXP_CONTAINS(LOWER(OCRText), r'\\b" . lc($_) . "\\b') OR "; #use REGEXP syntax...
}
close(FILE);

$SQL=~s/ OR $//;
$SQL = "SELECT date, station, showName, iaShowId, iaClipUrl, OCRText FROM `gdelt-bq.gdeltv2.vgegv2_iatv` WHERE DATE(date) >= \"2020-01-01\" and (LOWER(OCRText) like '%realdonald%' OR LOWER(OCRText) like '%donald j. trump retweeted%') and ($SQL) order by date asc";

open(OUT, ">$ARGV[0].sql"); binmode(OUT, ":utf8");
print OUT $SQL;
close(OUT);

Save the script above into a file called "makebqterms.pl" Then save the following keyword list to a file called "keywords.txt":

 

iran
iranian
iranians
democrats
democrat
covid
coronavirus
pandemic
wuhan

Then run the script over this keyword list:

./makebqterms.pl ./keywords.txt

This will output the necessary SQL query to use with BigQuery:

SELECT date, station, showName, iaShowId, iaClipUrl, OCRText FROM `gdelt-bq.gdeltv2.vgegv2_iatv` WHERE DATE(date) >= "2020-01-01" and (LOWER(OCRText) like '%realdonald%' OR LOWER(OCRText) like '%donald j. trump retweeted%') and (REGEXP_CONTAINS(LOWER(OCRText), r'\biran\b') OR REGEXP_CONTAINS(LOWER(OCRText), r'\biranian\b') OR REGEXP_CONTAINS(LOWER(OCRText), r'\biranians\b') OR REGEXP_CONTAINS(LOWER(OCRText), r'\bdemocrats\b') OR REGEXP_CONTAINS(LOWER(OCRText), r'\bdemocrat\b') OR REGEXP_CONTAINS(LOWER(OCRText), r'\bcovid\b') OR REGEXP_CONTAINS(LOWER(OCRText), r'\bcoronavirus\b') OR REGEXP_CONTAINS(LOWER(OCRText), r'\bpandemic\b') OR REGEXP_CONTAINS(LOWER(OCRText), r'\bwuhan\b')) order by date asc

We hope this is useful to you in making it easier to search the VGEG 2.0 for tweets!