Here is a list of all of the languages represented in the new Web News Ngram (WEB-NGRAM) unigram and bigram datasets and how many words of each are currently represented in the unigram dataset. See the WEB-NGRAM Character Ngram list for the breakdown of character-based ngrams.
Language | Total Words |
ENGLISH | 22,313,928,109 |
SPANISH | 3,809,002,145 |
GERMAN | 2,072,061,850 |
ITALIAN | 1,713,394,263 |
FRENCH | 1,445,902,797 |
RUSSIAN | 1,428,440,159 |
ARABIC | 1,294,513,215 |
PORTUGUESE | 1,053,643,488 |
TURKISH | 981,518,721 |
GREEK | 769,717,342 |
ROMANIAN | 612,204,059 |
INDONESIAN | 595,532,469 |
SWEDISH | 455,122,855 |
Korean | 438,716,120 |
POLISH | 418,520,502 |
DUTCH | 264,981,879 |
HUNGARIAN | 259,075,988 |
BULGARIAN | 251,148,518 |
MACEDONIAN | 190,624,093 |
UKRAINIAN | 166,890,540 |
SERBIAN | 151,000,329 |
CROATIAN | 138,519,124 |
CZECH | 135,042,481 |
ALBANIAN | 129,997,712 |
DANISH | 126,606,774 |
HEBREW | 123,175,045 |
FINNISH | 107,517,933 |
NORWEGIAN | 101,457,008 |
LITHUANIAN | 98,489,823 |
SLOVAK | 71,899,349 |
CATALAN | 56,887,994 |
AZERBAIJANI | 53,379,749 |
URDU | 52,571,759 |
SLOVENIAN | 47,516,487 |
ARMENIAN | 39,033,898 |
LATVIAN | 38,776,107 |
BOSNIAN | 38,530,754 |
ICELANDIC | 35,952,061 |
PERSIAN | 34,416,512 |
ESTONIAN | 30,084,016 |
MALAY | 27,703,127 |
TAMIL | 27,014,503 |
TELUGU | 25,693,413 |
MALAYALAM | 25,390,294 |
MONGOLIAN | 23,507,808 |
SINHALESE | 20,583,726 |
GUJARATI | 17,002,368 |
KANNADA | 16,289,048 |
GALICIAN | 12,904,344 |
SOMALI | 10,828,602 |
LATIN | 10,663,428 |
GEORGIAN | 10,280,369 |
BELARUSIAN | 9,616,698 |
BASQUE | 9,443,613 |
PASHTO | 8,383,804 |
TAGALOG | 7,856,876 |
PUNJABI | 7,844,201 |
KAZAKH | 5,178,514 |
UZBEK | 5,045,283 |
SWAHILI | 4,962,490 |
MALTESE | 4,729,809 |
NORWEGIAN_N | 4,633,529 |
KYRGYZ | 4,554,537 |
AFRIKAANS | 4,163,926 |
FAROESE | 3,439,843 |
TAJIK | 2,613,295 |
DHIVEHI | 2,509,464 |
SINDHI | 2,244,081 |
LUXEMBOURGISH | 1,623,118 |
WELSH | 1,492,293 |
MALAGASY | 1,411,887 |
GANDA | 1,051,061 |
CEBUANO | 987,409 |
RHAETO_ROMANCE | 833,559 |
TATAR | 691,541 |
IRISH | 685,004 |
TONGA | 643,026 |
FRISIAN | 511,122 |
SAMOAN | 460,038 |
BIHARI | 371,971 |
HINDI | 344,400 |
HAUSA | 288,989 |
KINYARWANDA | 243,439 |
SHONA | 234,798 |
TIGRINYA | 231,460 |
NYANJA | 217,178 |
CORSICAN | 160,541 |
ZULU | 107,430 |
WARAY_PHILIPPINES | 92,788 |
MAORI | 78,490 |
HAITIAN_CREOLE | 76,988 |
SANSKRIT | 76,927 |
GUARANI | 72,812 |
KURDISH | 63,412 |
TSWANA | 56,132 |
SESELWA | 51,104 |
MANX | 42,509 |
SCOTS_GAELIC | 39,043 |
KHASI | 36,632 |
MAURITIAN_CREOLE | 30,267 |
SCOTS | 29,913 |
INUKTITUT | 28,686 |
BISLAMA | 28,328 |
SESOTHO | 22,991 |
XHOSA | 20,846 |
OCCITAN | 20,162 |
AFAR | 20,128 |
YORUBA | 19,043 |
WOLOF | 18,063 |
LINGALA | 13,577 |
AMHARIC | 11,952 |
ZHUANG | 11,166 |
VOLAPUK | 11,057 |
BRETON | 10,507 |
FIJIAN | 10,035 |
ORIYA | 8,753 |
AKAN | 8,605 |
INTERLINGUE | 8,592 |
BENGALI | 8,475 |
OROMO | 6,921 |
GREENLANDIC | 6,554 |
AYMARA | 5,804 |
JAVANESE | 5,716 |
QUECHUA | 5,567 |
TURKMEN | 5,500 |
INUPIAK | 5,000 |
IGBO | 4,700 |
HAWAIIAN | 4,520 |
RUNDI | 4,221 |
SISWANT | 2,943 |
SUNDANESE | 2,566 |
HMONG | 2,335 |
VENDA | 1,993 |
MARATHI | 1,869 |
TSONGA | 1,537 |
NAURU | 615 |
YIDDISH | 422 |
NEPALI | 274 |
SANGO | 232 |
ABKHAZIAN | 194 |
UIGHUR | 108 |
ASSAMESE | 6 |
TECHNICAL DETAILS
Here is the SQL query used to generate the table above.
SELECT LANG, sum(COUNT) TOT FROM `gdelt-bq.gdeltv2.web_1grams` group by LANG order by TOT desc LIMIT 10000