Announcing The 2018-2019 Global Frontpage Graph (GFG) Linguistic Ngram Dataset

On March 2, 2018 we unveiled the GDELT Global Frontpage Graph (GFG) which scans 50,000 news homepages around the world every hour on the hour to catalog their links, offering a unique and powerful glimpse into the stories and narratives news outlets around the world are prioritizing each day. While news outlets can publish an effectively unlimited volume of news each day, their homepages represent limited and precious real estate resembling the fixed air time of broadcast media, curated by editors that must decide which stories represent the most important events of the moment.

Today this dataset totals more than 134 billion records covering a year and a half of global journalistic priorities. What insights into the structure of language can be gleaned from such a massive dataset?

Given that the GFG monitors the text of each frontpage link, what would it look like to split each link's text into words and two word phrases and convert these 134 billion links into a set of language-specific word frequency histograms, offering a glance at the word choices news editors have made on their homepages over the past year and a half?

Using the power of BigQuery, it took just two SQL queries to create these two massive new multilingual ngram datasets, one for 1-grams/unigrams and one for 2-grams/bigrams.

The unigram query processed 6.6TB of text over the 134 billion links in 486 seconds (8.1 minutes), yielding 124,514,722 distinct unigrams codifying 506,552,254,362 total words of which 111,791,702 were unique (counting words that appear in multiple languages only once).

The bigram query processed 8.58TB of text over the the 134 billion links in 3,788 seconds (63 minutes) yielding 687,413,157 distinct bigrams codifying 399,129,977,794 total two word spans of which 640,574,861 were unique (counting bigrams that appear in multiple languages only once).

Each link's text was processed to place spaces around all characters found in the Unicode "Punctuation" class, with the sole exception of apostrophes found inside of a word. This special handling of apostrophes is to ensure that languages that rely upon apostrophes like French are correctly handled, as well as transliterated names from languages like Arabic and English possessives (apostrophes at the start or end of a word are removed). The remaining text is tokenized by space into "words" with each word being either a sequence of characters or a punctuation symbol. Thus, the link text "Wall Street (Today)" would be tokenized into words "Wall" and "Street" and "(" and "Today" and ")."

The use of space tokenization obviously yields incorrect results for non-space-delineated languages like Burmese, Chinese, Dzongkha, Japanese, Khmer, Laothian, Thai, Tibetan and Vietnamese which are monitored by GDELT. Given that word tokenization of these languages requires specialized language models, proper tokenization was outside the scope of this analysis and the results for those languages should be understood to represent an incorrect space-delimited tokenization, though the results still shed interesting light on these space-delimited blocks.

Words were then aggregated by source language by merging against the source language inventory file compiled today using Google's Chrome Language Detector 2 (CLD2) to estimate the primary language of each homepage's links. Note that many homepages contain links in other language, especially in countries where multiple languages are spoken or to notify readers that the outlet is also available in other languages and each homepage is cataloged according to its dominate language that encompasses the most text on the homepage over time. Language detection of such sparse and highly disparate text is imperfect at best and thus there is a degree of error to these estimates, especially for outlets with very low volumes of text and for low-resource languages.

The final ngrams files feature one unigram/bigram per row, with three columns and are in UTF-8 format. The first column is the human readable language name in CLD2 format, the second is the unigram/bigram itself and the third is the number of times it appeared in the dataset.

The datasets are also available as BigQuery tables:

We're tremendously excited to see what you're able to do with these two new linguistic datasets!

 

TECHNICAL DETAILS

For those interested in how to create ngrams at BigQuery scale, here is the query used for the unigram dataset (remove the _PARTITION_TIME parameters to run over the entire dataset).

select HumanLang, ngram, count(1) count from (
WITH nested AS (
SELECT HumanLang,
SPLIT(
REPLACE( 
TRIM(REGEXP_REPLACE(REGEXP_REPLACE( 
REGEXP_REPLACE( LOWER(LinkText) , r'([^\s])\'([^\s])', r'\1APOS\2')
, r'(\pP)', r' \1 '), r'\s+', r' '))
, 'APOS', '\'')
, ' ') words
 FROM `gdelt-bq.gdeltv2.gfg_partitioned` as data JOIN `gdelt-bq.gdeltv2.gfg_linguisticinventory_20190901` as inventory ON data.FromFrontPageURL = inventory.SiteURL  WHERE _PARTITIONTIME >= "2019-08-01 00:00:00" AND _PARTITIONTIME < "2019-08-02 00:00:00"
) select * from nested, UNNEST(words) as ngram
) group by HumanLang, ngram

And here is the query for the bigram dataset (remove the _PARTITION_TIME parameters to run over the entire dataset).

select HumanLang, CONCAT(word, ' ', nextword) ngram, count(1) count from (
select HumanLang, word, LEAD(word) OVER (PARTITION by id ORDER BY pos) nextword from (
WITH nested AS (
SELECT HumanLang, FARM_FINGERPRINT(CONCAT(CAST(DATE AS STRING),FromFrontPageURL,CAST(LinkID AS STRING))) id,
SPLIT(
REPLACE( 
TRIM(REGEXP_REPLACE(REGEXP_REPLACE( 
REGEXP_REPLACE( LOWER(LinkText) , r'([^\s])\'([^\s])', r'\1APOS\2')
, r'(\pP)', r' \1 '), r'\s+', r' '))
, 'APOS', '\'')
, ' ') words
 FROM `gdelt-bq.gdeltv2.gfg_partitioned` as data JOIN `gdelt-bq.gdeltv2.gfg_linguisticinventory_20190901` as inventory ON data.FromFrontPageURL = inventory.SiteURL
) select HumanLang, id, word, ROW_NUMBER() OVER (PARTITION BY id) pos from nested, UNNEST(words) as word
)
) where nextword is not null group by HumanLang, ngram