Global Frontpage Graph (GFG) Inventory Sheet With Charset And Language Breakdown

NOTE (9/1/2019): A new version of this linguistic inventory sheet is available with better accuracy and coverage for all site changes March 2018 – September 2019. See New Version.

The GDELT Global Frontpage Graph (GFG) has generated incredible interest in just its first three weeks of existence, so in addition to publishing a first glimpse at some of the statistics you can extract from it, we've also compiled an inventory sheet derived from the March 21, 2018 noon UTC snapshot that lists every unique frontpage we successfully scanned in that snapshot, the native characterset it was encoded in (we transcode everything to UTF-8) and the primary language of the page as estimated by Google's Chrome Language Detector 2 (CLD2). This can be used to filter the hourly snapshots to just specific languages of interest. We are also working to construct a country lookup that records the country of origin for each outlet to further assist analyses.

Download The Latest Inventory Sheet. (Compiled from the March 21, 2018 noon UTC snapshot)

We hope these statistics and this new inventory file make it easier for you to navigate the new Global Frontpage Graph!

CHARSETS

Unsurprisingly, UTF-8 is the dominate characterset encoding in use on news websites today, accounting for 93.07% of the homepages in the GFG. However, there are a number of other encodings in use as well, as seen in the table below (not all may be valid).

Charset	Count	%
UTF-8	44605	93.07
iso-8859-1	1374	2.87
windows-1251	604	1.26
euc-kr	354	0.74
gb2312	292	0.61
windows-1252	168	0.35
iso-8859-2	166	0.35
windows-1250	91	0.19
windows-1256	45	0.09
gbk	37	0.08
iso-8859-15	32	0.07
iso-8859-9	30	0.06
windows-1254	25	0.05
windows-1255	20	0.04
big5	11	0.02
shift_jis	11	0.02
gb18030	10	0.02
cp1256	9	0.02
iso-8859-7	6	0.01
windows-1253	6	0.01
us-ascii	5	0.01
euc-jp	4	0.01
koi8-r	4	0.01
latin1	3	0.01
windows-1257	3	0.01
cp1251	1	0.00
iso-8850-1	1	0.00
iso-8859	1	0.00
iso8859-2	1	0.00
iso-8859-5	1	0.00
iso-8859-6	1	0.00
iso-8859-8	1	0.00
koi8-u	1	0.00
logical	1	0.00
UTF-16	1	0.00
UTF-ISO-8859-1	1	0.00

LANGUAGES

In all, CLD2 detected 98 different languages, though the structure of news homepages as collections of disjoint short text snippets with low total text volume means that CLD2 will have an elevated false positive rate and the numbers below should be used only as approximations. Note that homepages may contain text in multiple languages – only the most common language found on the page is recorded. Also note that in general language should not be used as a geographic indicator (ie that all German language outlets are in Germany or that all Russian outlets are in Russia, etc), since there are many outlets in the collection that are based in one country, but serve a primary audience in a different country or serve an expat audience that speak a different language than the most common languages of that country.

Language	Count	%
ENGLISH	21312	45.60
RUSSIAN	3553	7.60
ITALIAN	2973	6.36
FRENCH	2631	5.63
SPANISH	2563	5.48
GERMAN	1900	4.07
PORTUGUESE	1236	2.64
POLISH	1196	2.56
DUTCH	1059	2.27
TURKISH	964	2.06
ARABIC	944	2.02
Chinese	885	1.89
Korean	688	1.47
SWEDISH	577	1.23
CZECH	456	0.98
GREEK	409	0.88
NORWEGIAN	350	0.75
UKRAINIAN	313	0.67
HUNGARIAN	308	0.66
VIETNAMESE	262	0.56
ChineseT	249	0.53
ROMANIAN	210	0.45
HINDI	184	0.39
HEBREW	154	0.33
Japanese	125	0.27
SERBIAN	111	0.24
INDONESIAN	84	0.18
FINNISH	79	0.17
Unknown	74	0.16
DANISH	67	0.14
BULGARIAN	66	0.14
CROATIAN	62	0.13
ALBANIAN	56	0.12
SLOVAK	55	0.12
NORWEGIAN_N	44	0.09
SLOVENIAN	44	0.09
ESTONIAN	40	0.09
LITHUANIAN	35	0.07
PERSIAN	34	0.07
THAI	33	0.07
ARMENIAN	32	0.07
LATVIAN	26	0.06
BENGALI	22	0.05
BOSNIAN	19	0.04
MACEDONIAN	17	0.04
SINHALESE	16	0.03
TAMIL	16	0.03
MALAY	15	0.03
AZERBAIJANI	14	0.03
CATALAN	12	0.03
MALAYALAM	11	0.02
ICELANDIC	10	0.02
SOMALI	10	0.02
SWAHILI	10	0.02
URDU	10	0.02
NEPALI	9	0.02
LATIN	8	0.02
WARAY_PHILIPPINES	7	0.01
GEORGIAN	6	0.01
KAZAKH	6	0.01
UZBEK	5	0.01
HAUSA	4	0.01
MALTESE	4	0.01
MARATHI	4	0.01
MONGOLIAN	4	0.01
AFRIKAANS	3	0.01
DHIVEHI	3	0.01
GALICIAN	3	0.01
GUJARATI	3	0.01
KINYARWANDA	3	0.01
TAJIK	3	0.01
TELUGU	3	0.01
TIGRINYA	3	0.01
BELARUSIAN	2	0.00
BURMESE	2	0.00
CEBUANO	2	0.00
IRISH	2	0.00
KURDISH	2	0.00
PASHTO	2	0.00
AFAR	1	0.00
AMHARIC	1	0.00
BASQUE	1	0.00
FAROESE	1	0.00
FRISIAN	1	0.00
GANDA	1	0.00
GUARANI	1	0.00
HAITIAN_CREOLE	1	0.00
KANNADA	1	0.00
KYRGYZ	1	0.00
LUXEMBOURGISH	1	0.00
NYANJA	1	0.00
PUNJABI	1	0.00
RHAETO_ROMANCE	1	0.00
TAGALOG	1	0.00
TIBETAN	1	0.00
WELSH	1	0.00
X_PIG_LATIN	1	0.00
XHOSA	1	0.00

The GDELT Project

Global Frontpage Graph (GFG) Inventory Sheet With Charset And Language Breakdown

CHARSETS

Archives