Last week we released an aligned dataset of Trump tweets appearing on television news using the Visual Global Entity Graph 2.0 (VGEG). But what if you want to do more complex filtering? below is a simple PERL script that can be used to quickly generate SQL queries for BigQuery. Save each keyword you wish to search for to a UTF8 ASCII text file, one word per line. Make sure you have all conjugations and forms of the word of interest, one per row. Then run this script on that file and it will generate the proper SQL. You can uncomment the first SQL line and comment out the second if you want LIKE syntax that allows matches of the keywords anywhere in the OCR text, including inside of other words, whereas as-is it will search each keyword as a regular expression requiring a word boundary on either side of each keyword.
#!/usr/bin/perl if ($ARGV[0] eq '' || (!-e $ARGV[0])) { print "FATAL: No input file!\n"; } open(FILE, $ARGV[0]); binmode(FILE, ":utf8"); while(<FILE>) { $_=~s/\s+$//; #$SQL .= "LOWER(OCRText) like '%" . lc($_) . "%' OR "; #use LIKE syntax... $_=~s/'/\\'/g; $SQL .= "REGEXP_CONTAINS(LOWER(OCRText), r'\\b" . lc($_) . "\\b') OR "; #use REGEXP syntax... } close(FILE); $SQL=~s/ OR $//; $SQL = "SELECT date, station, showName, iaShowId, iaClipUrl, OCRText FROM `gdelt-bq.gdeltv2.vgegv2_iatv` WHERE DATE(date) >= \"2020-01-01\" and (LOWER(OCRText) like '%realdonald%' OR LOWER(OCRText) like '%donald j. trump retweeted%') and ($SQL) order by date asc"; open(OUT, ">$ARGV[0].sql"); binmode(OUT, ":utf8"); print OUT $SQL; close(OUT);
Save the script above into a file called "makebqterms.pl" Then save the following keyword list to a file called "keywords.txt":
iran iranian iranians democrats democrat covid coronavirus pandemic wuhan
Then run the script over this keyword list:
./makebqterms.pl ./keywords.txt
This will output the necessary SQL query to use with BigQuery:
SELECT date, station, showName, iaShowId, iaClipUrl, OCRText FROM `gdelt-bq.gdeltv2.vgegv2_iatv` WHERE DATE(date) >= "2020-01-01" and (LOWER(OCRText) like '%realdonald%' OR LOWER(OCRText) like '%donald j. trump retweeted%') and (REGEXP_CONTAINS(LOWER(OCRText), r'\biran\b') OR REGEXP_CONTAINS(LOWER(OCRText), r'\biranian\b') OR REGEXP_CONTAINS(LOWER(OCRText), r'\biranians\b') OR REGEXP_CONTAINS(LOWER(OCRText), r'\bdemocrats\b') OR REGEXP_CONTAINS(LOWER(OCRText), r'\bdemocrat\b') OR REGEXP_CONTAINS(LOWER(OCRText), r'\bcovid\b') OR REGEXP_CONTAINS(LOWER(OCRText), r'\bcoronavirus\b') OR REGEXP_CONTAINS(LOWER(OCRText), r'\bpandemic\b') OR REGEXP_CONTAINS(LOWER(OCRText), r'\bwuhan\b')) order by date asc
We hope this is useful to you in making it easier to search the VGEG 2.0 for tweets!