NOTE (9/1/2019): A new version of this linguistic inventory sheet is available with better accuracy and coverage for all site changes March 2018 – September 2019. See New Version.
The GDELT Global Frontpage Graph (GFG) has generated incredible interest in just its first three weeks of existence, so in addition to publishing a first glimpse at some of the statistics you can extract from it, we've also compiled an inventory sheet derived from the March 21, 2018 noon UTC snapshot that lists every unique frontpage we successfully scanned in that snapshot, the native characterset it was encoded in (we transcode everything to UTF-8) and the primary language of the page as estimated by Google's Chrome Language Detector 2 (CLD2). This can be used to filter the hourly snapshots to just specific languages of interest. We are also working to construct a country lookup that records the country of origin for each outlet to further assist analyses.
- Download The Latest Inventory Sheet. (Compiled from the March 21, 2018 noon UTC snapshot)
We hope these statistics and this new inventory file make it easier for you to navigate the new Global Frontpage Graph!
CHARSETS
Unsurprisingly, UTF-8 is the dominate characterset encoding in use on news websites today, accounting for 93.07% of the homepages in the GFG. However, there are a number of other encodings in use as well, as seen in the table below (not all may be valid).
Charset | Count | % |
UTF-8 | 44605 | 93.07 |
iso-8859-1 | 1374 | 2.87 |
windows-1251 | 604 | 1.26 |
euc-kr | 354 | 0.74 |
gb2312 | 292 | 0.61 |
windows-1252 | 168 | 0.35 |
iso-8859-2 | 166 | 0.35 |
windows-1250 | 91 | 0.19 |
windows-1256 | 45 | 0.09 |
gbk | 37 | 0.08 |
iso-8859-15 | 32 | 0.07 |
iso-8859-9 | 30 | 0.06 |
windows-1254 | 25 | 0.05 |
windows-1255 | 20 | 0.04 |
big5 | 11 | 0.02 |
shift_jis | 11 | 0.02 |
gb18030 | 10 | 0.02 |
cp1256 | 9 | 0.02 |
iso-8859-7 | 6 | 0.01 |
windows-1253 | 6 | 0.01 |
us-ascii | 5 | 0.01 |
euc-jp | 4 | 0.01 |
koi8-r | 4 | 0.01 |
latin1 | 3 | 0.01 |
windows-1257 | 3 | 0.01 |
cp1251 | 1 | 0.00 |
iso-8850-1 | 1 | 0.00 |
iso-8859 | 1 | 0.00 |
iso8859-2 | 1 | 0.00 |
iso-8859-5 | 1 | 0.00 |
iso-8859-6 | 1 | 0.00 |
iso-8859-8 | 1 | 0.00 |
koi8-u | 1 | 0.00 |
logical | 1 | 0.00 |
UTF-16 | 1 | 0.00 |
UTF-ISO-8859-1 | 1 | 0.00 |
LANGUAGES
In all, CLD2 detected 98 different languages, though the structure of news homepages as collections of disjoint short text snippets with low total text volume means that CLD2 will have an elevated false positive rate and the numbers below should be used only as approximations. Note that homepages may contain text in multiple languages – only the most common language found on the page is recorded. Also note that in general language should not be used as a geographic indicator (ie that all German language outlets are in Germany or that all Russian outlets are in Russia, etc), since there are many outlets in the collection that are based in one country, but serve a primary audience in a different country or serve an expat audience that speak a different language than the most common languages of that country.
Language | Count | % |
ENGLISH | 21312 | 45.60 |
RUSSIAN | 3553 | 7.60 |
ITALIAN | 2973 | 6.36 |
FRENCH | 2631 | 5.63 |
SPANISH | 2563 | 5.48 |
GERMAN | 1900 | 4.07 |
PORTUGUESE | 1236 | 2.64 |
POLISH | 1196 | 2.56 |
DUTCH | 1059 | 2.27 |
TURKISH | 964 | 2.06 |
ARABIC | 944 | 2.02 |
Chinese | 885 | 1.89 |
Korean | 688 | 1.47 |
SWEDISH | 577 | 1.23 |
CZECH | 456 | 0.98 |
GREEK | 409 | 0.88 |
NORWEGIAN | 350 | 0.75 |
UKRAINIAN | 313 | 0.67 |
HUNGARIAN | 308 | 0.66 |
VIETNAMESE | 262 | 0.56 |
ChineseT | 249 | 0.53 |
ROMANIAN | 210 | 0.45 |
HINDI | 184 | 0.39 |
HEBREW | 154 | 0.33 |
Japanese | 125 | 0.27 |
SERBIAN | 111 | 0.24 |
INDONESIAN | 84 | 0.18 |
FINNISH | 79 | 0.17 |
Unknown | 74 | 0.16 |
DANISH | 67 | 0.14 |
BULGARIAN | 66 | 0.14 |
CROATIAN | 62 | 0.13 |
ALBANIAN | 56 | 0.12 |
SLOVAK | 55 | 0.12 |
NORWEGIAN_N | 44 | 0.09 |
SLOVENIAN | 44 | 0.09 |
ESTONIAN | 40 | 0.09 |
LITHUANIAN | 35 | 0.07 |
PERSIAN | 34 | 0.07 |
THAI | 33 | 0.07 |
ARMENIAN | 32 | 0.07 |
LATVIAN | 26 | 0.06 |
BENGALI | 22 | 0.05 |
BOSNIAN | 19 | 0.04 |
MACEDONIAN | 17 | 0.04 |
SINHALESE | 16 | 0.03 |
TAMIL | 16 | 0.03 |
MALAY | 15 | 0.03 |
AZERBAIJANI | 14 | 0.03 |
CATALAN | 12 | 0.03 |
MALAYALAM | 11 | 0.02 |
ICELANDIC | 10 | 0.02 |
SOMALI | 10 | 0.02 |
SWAHILI | 10 | 0.02 |
URDU | 10 | 0.02 |
NEPALI | 9 | 0.02 |
LATIN | 8 | 0.02 |
WARAY_PHILIPPINES | 7 | 0.01 |
GEORGIAN | 6 | 0.01 |
KAZAKH | 6 | 0.01 |
UZBEK | 5 | 0.01 |
HAUSA | 4 | 0.01 |
MALTESE | 4 | 0.01 |
MARATHI | 4 | 0.01 |
MONGOLIAN | 4 | 0.01 |
AFRIKAANS | 3 | 0.01 |
DHIVEHI | 3 | 0.01 |
GALICIAN | 3 | 0.01 |
GUJARATI | 3 | 0.01 |
KINYARWANDA | 3 | 0.01 |
TAJIK | 3 | 0.01 |
TELUGU | 3 | 0.01 |
TIGRINYA | 3 | 0.01 |
BELARUSIAN | 2 | 0.00 |
BURMESE | 2 | 0.00 |
CEBUANO | 2 | 0.00 |
IRISH | 2 | 0.00 |
KURDISH | 2 | 0.00 |
PASHTO | 2 | 0.00 |
AFAR | 1 | 0.00 |
AMHARIC | 1 | 0.00 |
BASQUE | 1 | 0.00 |
FAROESE | 1 | 0.00 |
FRISIAN | 1 | 0.00 |
GANDA | 1 | 0.00 |
GUARANI | 1 | 0.00 |
HAITIAN_CREOLE | 1 | 0.00 |
KANNADA | 1 | 0.00 |
KYRGYZ | 1 | 0.00 |
LUXEMBOURGISH | 1 | 0.00 |
NYANJA | 1 | 0.00 |
PUNJABI | 1 | 0.00 |
RHAETO_ROMANCE | 1 | 0.00 |
TAGALOG | 1 | 0.00 |
TIBETAN | 1 | 0.00 |
WELSH | 1 | 0.00 |
X_PIG_LATIN | 1 | 0.00 |
XHOSA | 1 | 0.00 |