The original Global Frontpage Graph (GFG) inventory file was created with just two and a half week's worth of data, making it difficult to perform robust language detection for many sites. In addition, the GFG records each snapshot under the final URL it was redirected to, meaning that since some homepages have changed their landing URLs over time and others append various tokens to their URLs, the final universe of 50,000 unique monitored sites has ballooned to more than 1 million unique source URLs in the GFG's 130 billion records.
To update the GFG inventory for language analysis of frontpage links, we've created a new Linguistic Inventory by taking all link text in the GFG from its inception in March 2, 2018 through today September 1, 2019 by unique source URL and constructed a 100K sample for each site, which was processed using Google's Chrome Language Detector 2 (CLD2) to determine the primary language of the site. Sites that mix multiple languages will be assigned the primary language of the site.
This new inventory sheet makes it possible to perform maximally comprehensive linguistic analysis of the GFG and also demonstrates how BigQuery can be used for massive linguistic analysis, collapsing 6.63TB of link text across 130 billion rows into a 100K sample by site, all with a single line of SQL that ran in 597 seconds, which was written into a temporary BigQuery table and exported to a 32-core Compute Engine instance to run it through CLD2, with the final results compiled and written into GCS.
- Download The 2018-2019 GFG Linguistic Inventory Sheet. (Updated September 1, 2019).
For those interested in the BigQuery command used to create the per-site textual samples for CLD2 analysis, the final query is below.
SELECT FromFrontPageURL, SUBSTR(ARRAY_TO_STRING(ARRAY_AGG(DISTINCT LinkText IGNORE NULLS), ' '),0, 100000) FROM `gdelt-bq.gdeltv2.gfg_partitioned` group by FromFrontPageURL