Here is a list of all of the languages represented in the new Web News Ngram (WEB-NGRAM) unigram and bigram datasets and how many words of each are currently represented in the unigram dataset. See the WEB-NGRAM Character Ngram list for the breakdown of character-based ngrams.
| Language | Total Words |
| ENGLISH | 22,313,928,109 |
| SPANISH | 3,809,002,145 |
| GERMAN | 2,072,061,850 |
| ITALIAN | 1,713,394,263 |
| FRENCH | 1,445,902,797 |
| RUSSIAN | 1,428,440,159 |
| ARABIC | 1,294,513,215 |
| PORTUGUESE | 1,053,643,488 |
| TURKISH | 981,518,721 |
| GREEK | 769,717,342 |
| ROMANIAN | 612,204,059 |
| INDONESIAN | 595,532,469 |
| SWEDISH | 455,122,855 |
| Korean | 438,716,120 |
| POLISH | 418,520,502 |
| DUTCH | 264,981,879 |
| HUNGARIAN | 259,075,988 |
| BULGARIAN | 251,148,518 |
| MACEDONIAN | 190,624,093 |
| UKRAINIAN | 166,890,540 |
| SERBIAN | 151,000,329 |
| CROATIAN | 138,519,124 |
| CZECH | 135,042,481 |
| ALBANIAN | 129,997,712 |
| DANISH | 126,606,774 |
| HEBREW | 123,175,045 |
| FINNISH | 107,517,933 |
| NORWEGIAN | 101,457,008 |
| LITHUANIAN | 98,489,823 |
| SLOVAK | 71,899,349 |
| CATALAN | 56,887,994 |
| AZERBAIJANI | 53,379,749 |
| URDU | 52,571,759 |
| SLOVENIAN | 47,516,487 |
| ARMENIAN | 39,033,898 |
| LATVIAN | 38,776,107 |
| BOSNIAN | 38,530,754 |
| ICELANDIC | 35,952,061 |
| PERSIAN | 34,416,512 |
| ESTONIAN | 30,084,016 |
| MALAY | 27,703,127 |
| TAMIL | 27,014,503 |
| TELUGU | 25,693,413 |
| MALAYALAM | 25,390,294 |
| MONGOLIAN | 23,507,808 |
| SINHALESE | 20,583,726 |
| GUJARATI | 17,002,368 |
| KANNADA | 16,289,048 |
| GALICIAN | 12,904,344 |
| SOMALI | 10,828,602 |
| LATIN | 10,663,428 |
| GEORGIAN | 10,280,369 |
| BELARUSIAN | 9,616,698 |
| BASQUE | 9,443,613 |
| PASHTO | 8,383,804 |
| TAGALOG | 7,856,876 |
| PUNJABI | 7,844,201 |
| KAZAKH | 5,178,514 |
| UZBEK | 5,045,283 |
| SWAHILI | 4,962,490 |
| MALTESE | 4,729,809 |
| NORWEGIAN_N | 4,633,529 |
| KYRGYZ | 4,554,537 |
| AFRIKAANS | 4,163,926 |
| FAROESE | 3,439,843 |
| TAJIK | 2,613,295 |
| DHIVEHI | 2,509,464 |
| SINDHI | 2,244,081 |
| LUXEMBOURGISH | 1,623,118 |
| WELSH | 1,492,293 |
| MALAGASY | 1,411,887 |
| GANDA | 1,051,061 |
| CEBUANO | 987,409 |
| RHAETO_ROMANCE | 833,559 |
| TATAR | 691,541 |
| IRISH | 685,004 |
| TONGA | 643,026 |
| FRISIAN | 511,122 |
| SAMOAN | 460,038 |
| BIHARI | 371,971 |
| HINDI | 344,400 |
| HAUSA | 288,989 |
| KINYARWANDA | 243,439 |
| SHONA | 234,798 |
| TIGRINYA | 231,460 |
| NYANJA | 217,178 |
| CORSICAN | 160,541 |
| ZULU | 107,430 |
| WARAY_PHILIPPINES | 92,788 |
| MAORI | 78,490 |
| HAITIAN_CREOLE | 76,988 |
| SANSKRIT | 76,927 |
| GUARANI | 72,812 |
| KURDISH | 63,412 |
| TSWANA | 56,132 |
| SESELWA | 51,104 |
| MANX | 42,509 |
| SCOTS_GAELIC | 39,043 |
| KHASI | 36,632 |
| MAURITIAN_CREOLE | 30,267 |
| SCOTS | 29,913 |
| INUKTITUT | 28,686 |
| BISLAMA | 28,328 |
| SESOTHO | 22,991 |
| XHOSA | 20,846 |
| OCCITAN | 20,162 |
| AFAR | 20,128 |
| YORUBA | 19,043 |
| WOLOF | 18,063 |
| LINGALA | 13,577 |
| AMHARIC | 11,952 |
| ZHUANG | 11,166 |
| VOLAPUK | 11,057 |
| BRETON | 10,507 |
| FIJIAN | 10,035 |
| ORIYA | 8,753 |
| AKAN | 8,605 |
| INTERLINGUE | 8,592 |
| BENGALI | 8,475 |
| OROMO | 6,921 |
| GREENLANDIC | 6,554 |
| AYMARA | 5,804 |
| JAVANESE | 5,716 |
| QUECHUA | 5,567 |
| TURKMEN | 5,500 |
| INUPIAK | 5,000 |
| IGBO | 4,700 |
| HAWAIIAN | 4,520 |
| RUNDI | 4,221 |
| SISWANT | 2,943 |
| SUNDANESE | 2,566 |
| HMONG | 2,335 |
| VENDA | 1,993 |
| MARATHI | 1,869 |
| TSONGA | 1,537 |
| NAURU | 615 |
| YIDDISH | 422 |
| NEPALI | 274 |
| SANGO | 232 |
| ABKHAZIAN | 194 |
| UIGHUR | 108 |
| ASSAMESE | 6 |
TECHNICAL DETAILS
Here is the SQL query used to generate the table above.
SELECT LANG, sum(COUNT) TOT FROM `gdelt-bq.gdeltv2.web_1grams` group by LANG order by TOT desc LIMIT 10000