The Languages Of The New Web News Ngram Datasets (WEB-NGRAM)

Here is a list of all of the languages represented in the new Web News Ngram (WEB-NGRAM) unigram and bigram datasets and how many words of each are currently represented in the unigram dataset. See the WEB-NGRAM Character Ngram list for the breakdown of character-based ngrams.

Language Total Words
ENGLISH 22,313,928,109
SPANISH 3,809,002,145
GERMAN 2,072,061,850
ITALIAN 1,713,394,263
FRENCH 1,445,902,797
RUSSIAN 1,428,440,159
ARABIC 1,294,513,215
PORTUGUESE 1,053,643,488
TURKISH 981,518,721
GREEK 769,717,342
ROMANIAN 612,204,059
INDONESIAN 595,532,469
SWEDISH 455,122,855
Korean 438,716,120
POLISH 418,520,502
DUTCH 264,981,879
HUNGARIAN 259,075,988
BULGARIAN 251,148,518
MACEDONIAN 190,624,093
UKRAINIAN 166,890,540
SERBIAN 151,000,329
CROATIAN 138,519,124
CZECH 135,042,481
ALBANIAN 129,997,712
DANISH 126,606,774
HEBREW 123,175,045
FINNISH 107,517,933
NORWEGIAN 101,457,008
LITHUANIAN 98,489,823
SLOVAK 71,899,349
CATALAN 56,887,994
AZERBAIJANI 53,379,749
URDU 52,571,759
SLOVENIAN 47,516,487
ARMENIAN 39,033,898
LATVIAN 38,776,107
BOSNIAN 38,530,754
ICELANDIC 35,952,061
PERSIAN 34,416,512
ESTONIAN 30,084,016
MALAY 27,703,127
TAMIL 27,014,503
TELUGU 25,693,413
MALAYALAM 25,390,294
MONGOLIAN 23,507,808
SINHALESE 20,583,726
GUJARATI 17,002,368
KANNADA 16,289,048
GALICIAN 12,904,344
SOMALI 10,828,602
LATIN 10,663,428
GEORGIAN 10,280,369
BELARUSIAN 9,616,698
BASQUE 9,443,613
PASHTO 8,383,804
TAGALOG 7,856,876
PUNJABI 7,844,201
KAZAKH 5,178,514
UZBEK 5,045,283
SWAHILI 4,962,490
MALTESE 4,729,809
NORWEGIAN_N 4,633,529
KYRGYZ 4,554,537
AFRIKAANS 4,163,926
FAROESE 3,439,843
TAJIK 2,613,295
DHIVEHI 2,509,464
SINDHI 2,244,081
LUXEMBOURGISH 1,623,118
WELSH 1,492,293
MALAGASY 1,411,887
GANDA 1,051,061
CEBUANO 987,409
RHAETO_ROMANCE 833,559
TATAR 691,541
IRISH 685,004
TONGA 643,026
FRISIAN 511,122
SAMOAN 460,038
BIHARI 371,971
HINDI 344,400
HAUSA 288,989
KINYARWANDA 243,439
SHONA 234,798
TIGRINYA 231,460
NYANJA 217,178
CORSICAN 160,541
ZULU 107,430
WARAY_PHILIPPINES 92,788
MAORI 78,490
HAITIAN_CREOLE 76,988
SANSKRIT 76,927
GUARANI 72,812
KURDISH 63,412
TSWANA 56,132
SESELWA 51,104
MANX 42,509
SCOTS_GAELIC 39,043
KHASI 36,632
MAURITIAN_CREOLE 30,267
SCOTS 29,913
INUKTITUT 28,686
BISLAMA 28,328
SESOTHO 22,991
XHOSA 20,846
OCCITAN 20,162
AFAR 20,128
YORUBA 19,043
WOLOF 18,063
LINGALA 13,577
AMHARIC 11,952
ZHUANG 11,166
VOLAPUK 11,057
BRETON 10,507
FIJIAN 10,035
ORIYA 8,753
AKAN 8,605
INTERLINGUE 8,592
BENGALI 8,475
OROMO 6,921
GREENLANDIC 6,554
AYMARA 5,804
JAVANESE 5,716
QUECHUA 5,567
TURKMEN 5,500
INUPIAK 5,000
IGBO 4,700
HAWAIIAN 4,520
RUNDI 4,221
SISWANT 2,943
SUNDANESE 2,566
HMONG 2,335
VENDA 1,993
MARATHI 1,869
TSONGA 1,537
NAURU 615
YIDDISH 422
NEPALI 274
SANGO 232
ABKHAZIAN 194
UIGHUR 108
ASSAMESE 6

 

TECHNICAL DETAILS

Here is the SQL query used to generate the table above.

SELECT LANG, sum(COUNT) TOT FROM `gdelt-bq.gdeltv2.web_1grams` group by LANG order by TOT desc LIMIT 10000