The GDELT Project

Global Frontpage Graph (GFG) Inventory Sheet With Charset And Language Breakdown

NOTE (9/1/2019): A new version of this linguistic inventory sheet is available with better accuracy and coverage for all site changes March 2018 – September 2019. See New Version.

 

The GDELT Global Frontpage Graph (GFG) has generated incredible interest in just its first three weeks of existence, so in addition to publishing a first glimpse at some of the statistics you can extract from it, we've also compiled an inventory sheet derived from the March 21, 2018 noon UTC snapshot that lists every unique frontpage we successfully scanned in that snapshot, the native characterset it was encoded in (we transcode everything to UTF-8) and the primary language of the page as estimated by Google's Chrome Language Detector 2 (CLD2). This can be used to filter the hourly snapshots to just specific languages of interest. We are also working to construct a country lookup that records the country of origin for each outlet to further assist analyses.

We hope these statistics and this new inventory file make it easier for you to navigate the new Global Frontpage Graph!

 

CHARSETS

Unsurprisingly, UTF-8 is the dominate characterset encoding in use on news websites today, accounting for 93.07% of the homepages in the GFG. However, there are a number of other encodings in use as well, as seen in the table below (not all may be valid).

Charset Count %
UTF-8 44605 93.07
iso-8859-1 1374 2.87
windows-1251 604 1.26
euc-kr 354 0.74
gb2312 292 0.61
windows-1252 168 0.35
iso-8859-2 166 0.35
windows-1250 91 0.19
windows-1256 45 0.09
gbk 37 0.08
iso-8859-15 32 0.07
iso-8859-9 30 0.06
windows-1254 25 0.05
windows-1255 20 0.04
big5 11 0.02
shift_jis 11 0.02
gb18030 10 0.02
cp1256 9 0.02
iso-8859-7 6 0.01
windows-1253 6 0.01
us-ascii 5 0.01
euc-jp 4 0.01
koi8-r 4 0.01
latin1 3 0.01
windows-1257 3 0.01
cp1251 1 0.00
iso-8850-1 1 0.00
iso-8859 1 0.00
iso8859-2 1 0.00
iso-8859-5 1 0.00
iso-8859-6 1 0.00
iso-8859-8 1 0.00
koi8-u 1 0.00
logical 1 0.00
UTF-16 1 0.00
UTF-ISO-8859-1 1 0.00

 

LANGUAGES

In all, CLD2 detected 98 different languages, though the structure of news homepages as collections of disjoint short text snippets with low total text volume means that CLD2 will have an elevated false positive rate and the numbers below should be used only as approximations. Note that homepages may contain text in multiple languages – only the most common language found on the page is recorded. Also note that in general language should not be used as a geographic indicator (ie that all German language outlets are in Germany or that all Russian outlets are in Russia, etc), since there are many outlets in the collection that are based in one country, but serve a primary audience in a different country or serve an expat audience that speak a different language than the most common languages of that country.

Language Count %
ENGLISH 21312 45.60
RUSSIAN 3553 7.60
ITALIAN 2973 6.36
FRENCH 2631 5.63
SPANISH 2563 5.48
GERMAN 1900 4.07
PORTUGUESE 1236 2.64
POLISH 1196 2.56
DUTCH 1059 2.27
TURKISH 964 2.06
ARABIC 944 2.02
Chinese 885 1.89
Korean 688 1.47
SWEDISH 577 1.23
CZECH 456 0.98
GREEK 409 0.88
NORWEGIAN 350 0.75
UKRAINIAN 313 0.67
HUNGARIAN 308 0.66
VIETNAMESE 262 0.56
ChineseT 249 0.53
ROMANIAN 210 0.45
HINDI 184 0.39
HEBREW 154 0.33
Japanese 125 0.27
SERBIAN 111 0.24
INDONESIAN 84 0.18
FINNISH 79 0.17
Unknown 74 0.16
DANISH 67 0.14
BULGARIAN 66 0.14
CROATIAN 62 0.13
ALBANIAN 56 0.12
SLOVAK 55 0.12
NORWEGIAN_N 44 0.09
SLOVENIAN 44 0.09
ESTONIAN 40 0.09
LITHUANIAN 35 0.07
PERSIAN 34 0.07
THAI 33 0.07
ARMENIAN 32 0.07
LATVIAN 26 0.06
BENGALI 22 0.05
BOSNIAN 19 0.04
MACEDONIAN 17 0.04
SINHALESE 16 0.03
TAMIL 16 0.03
MALAY 15 0.03
AZERBAIJANI 14 0.03
CATALAN 12 0.03
MALAYALAM 11 0.02
ICELANDIC 10 0.02
SOMALI 10 0.02
SWAHILI 10 0.02
URDU 10 0.02
NEPALI 9 0.02
LATIN 8 0.02
WARAY_PHILIPPINES 7 0.01
GEORGIAN 6 0.01
KAZAKH 6 0.01
UZBEK 5 0.01
HAUSA 4 0.01
MALTESE 4 0.01
MARATHI 4 0.01
MONGOLIAN 4 0.01
AFRIKAANS 3 0.01
DHIVEHI 3 0.01
GALICIAN 3 0.01
GUJARATI 3 0.01
KINYARWANDA 3 0.01
TAJIK 3 0.01
TELUGU 3 0.01
TIGRINYA 3 0.01
BELARUSIAN 2 0.00
BURMESE 2 0.00
CEBUANO 2 0.00
IRISH 2 0.00
KURDISH 2 0.00
PASHTO 2 0.00
AFAR 1 0.00
AMHARIC 1 0.00
BASQUE 1 0.00
FAROESE 1 0.00
FRISIAN 1 0.00
GANDA 1 0.00
GUARANI 1 0.00
HAITIAN_CREOLE 1 0.00
KANNADA 1 0.00
KYRGYZ 1 0.00
LUXEMBOURGISH 1 0.00
NYANJA 1 0.00
PUNJABI 1 0.00
RHAETO_ROMANCE 1 0.00
TAGALOG 1 0.00
TIBETAN 1 0.00
WELSH 1 0.00
X_PIG_LATIN 1 0.00
XHOSA 1 0.00