Announcing The 2018-2019 Global Frontpage Graph (GFG) Outlet Ngram Dataset

Following in the footsteps of the 2018-2019 Global Frontpage Graph (GFG) Linguistic Ngram Dataset released yesterday, the 2018-2019 Global Frontpage Graph (GFG) Outlet Ngram Dataset offers the same 1-gram/unigram ngram dataset organized by outlet rather than language. This dataset includes all 50,000 outlets, not just those for which CLD2 could identify a primary language and includes outlets that have changed their URL over time (this results in a total of 1,094,244 distinct homepage URLs). Note that many outlets have appended URL parameters or redirects to their homepages over time since the GFG was launched in March 2018, meaning that you may wish to scan the FromFrontPageURL field to identify variants of the URL of the homepage of interest to group them all together for comprehensive analyses.

The final unigram dataset processed all 6.6TB of text in 1,694 seconds (28 minutes) yielding 1,133,186,994 distinct site-ngram unigram records recording the appearances of a total of 506,275,817,920 words of which there were 111,704,066 unique words (counting words that appeared on multiple sites only once).

The final bigram dataset processed 8.57TB of text in 8,011 seconds (2 hours) yielding 3,545,126,372  distinct site-ngram bigram records recording the appearances of a total of 398,895,220,934 bigrams of which there were 640,763,846 unique bigrams (counting bigrams that appear on multiple sites only once).

The final ngrams files feature one unigram/bigram per row, with three columns and are in UTF-8 format. The first column is the human readable language name in CLD2 format, the second is the unigram/bigram itself and the third is the number of times it appeared in the dataset.

The datasets are also available as BigQuery tables:

We're tremendously excited to see what you're able to do with these two new linguistic datasets!

 

TECHNICAL DETAILS

For those interested in how to create ngrams at BigQuery scale, here is the query used for the unigram dataset (remove the _PARTITION_TIME parameters to run over the entire dataset).

select FromFrontPageURL, word, count(1) count from (
WITH nested AS (
SELECT FromFrontPageURL,
SPLIT(
REPLACE( 
TRIM(REGEXP_REPLACE(REGEXP_REPLACE( 
REGEXP_REPLACE( LOWER(LinkText) , r'([^\s])\'([^\s])', r'\1APOS\2')
, r'(\pP)', r' \1 '), r'\s+', r' '))
, 'APOS', '\'')
, ' ') words
 FROM `gdelt-bq.gdeltv2.gfg_partitioned` WHERE _PARTITIONTIME >= "2019-08-01 00:00:00" AND _PARTITIONTIME < "2019-08-02 00:00:00"
) select * from nested, UNNEST(words) as word
) group by FromFrontPageURL, word

And here is the query for the bigram dataset (remove the _PARTITION_TIME parameters to run over the entire dataset).

select FromFrontPageURL, CONCAT(word, ' ', nextword) ngram, count(1) count from (
select FromFrontPageURL, word, LEAD(word) OVER (PARTITION by id ORDER BY pos) nextword from (
WITH nested AS (
SELECT FromFrontPageURL, FARM_FINGERPRINT(CONCAT(CAST(DATE AS STRING),FromFrontPageURL,CAST(LinkID AS STRING))) id,
SPLIT(
REPLACE( 
TRIM(REGEXP_REPLACE(REGEXP_REPLACE( 
REGEXP_REPLACE( LOWER(LinkText) , r'([^\s])\'([^\s])', r'\1APOS\2')
, r'(\pP)', r' \1 '), r'\s+', r' '))
, 'APOS', '\'')
, ' ') words
 FROM `gdelt-bq.gdeltv2.gfg_partitioned` WHERE _PARTITIONTIME >= "2019-08-01 00:00:00" AND _PARTITIONTIME < "2019-08-02 00:00:00"
) select FromFrontPageURL, id, word, ROW_NUMBER() OVER (PARTITION BY id) pos from nested, UNNEST(words) as word
)
) where nextword is not null group by FromFrontPageURL, ngram