The GDELT Project

Global Embedded Metadata Graph (GEMG): Cataloging Articles By Self-Reported Language

With the addition of the new HTTP-EQUIV and HTML LANG attributes to the Global Embedded Metadata Graph (GEMG), you can now trivially catalog articles by their self-reported languages.

For example, the following SQL query filters for articles that include a language code in their embedded metadata and also includes CLD2's estimation of the page's language for comparison:

SELECT metatag.*, lang cldlang, url FROM `gdelt-bq.gdeltv2.gemg`, UNNEST(metatags) metatag WHERE DATE(date) = "2021-11-12" and (metatag.key='lang' or metatag.key='content-language')

This yields a table of entries like the following:

key type value CLDLang URL
lang htmltag hi HINDI https://www.amarujala.com/chandigarh/time-has-come-for-the-accused-of-sacrilege-case-to-go-to-jail-says-sukhjinder-singh-randhawa
lang htmltag it-IT ITALIAN https://www.rollingstone.it/tv/news-tv/lol-chi-ride-e-fuori-stagione-2-svelato-il-cast-ed-e-notevole/596563/
lang htmltag de GERMAN https://de.euronews.com/2021/11/12/klimawandel-bedroht-venedig
lang htmltag sv SWEDISH https://www.st.nu/2021-11-12/nackstamordarna-gripna-efter-11-manader-pa-rymmen
lang htmltag en ENGLISH https://www.reuters.com/world/africa/policeman-arrested-killing-8-year-old-girl-cameroon-2021-11-12/
lang htmltag pt-pt PORTUGUESE https://rr.sapo.pt/noticia/pais/2021/11/12/marta-temido-nao-exclui-futuros-confinamentos-cenarios-tem-de-estar-todos-em-aberto/260570/
lang htmltag en-CA ENGLISH https://www.thelondoner.ca/news/local-news/little-opposition-anticipated-to-heritage-designation-at-old-victoria-hospital-site/wcm/1558d10c-3f43-4408-a543-d21e535a03ba
lang htmltag de-DE ENGLISH https://de.nachrichten.yahoo.com/corona-europa-lockdowns-f%C3%BCr-ungeimpfte-220430497.html
lang htmltag de GERMAN https://www.abendzeitung-muenchen.de/promis/vor-zwei-jahren-paul-walkers-tochter-meadow-hatte-einen-tumor-art-770646
lang htmltag el GREEK https://www.tribune.gr/politics/news/article/784986/diaskepsi-gia-ti-livyi-ekloges-stin-ora-toys-na-fygoyn-toyrkoi-kai-rosoi.html
lang htmltag en-AU ENGLISH https://www.businessinsider.com.au/legal-expert-warns-astroworld-refunds-could-negate-right-to-sue-2021-11
lang htmltag es-AR SPANISH https://www.diarioelnorte.com.ar/el-dolar-blue-bajo-650-y-cerro-en-200/
lang htmltag en-CA ENGLISH https://vancouversun.com/news/local-news/yes-more-rain-atmospheric-river-brings-rain-and-snow-to-southern-b-c
content-language http-equiv ar-kw ARABIC https://www.alanba.com.kw/1083073/

Note how language codes are a mixture of language codes by themselves and mixed language+country versions. Note that some pages include both an HTML LANG attribute and an HTTP-EQUIV META field and that those can differ in their resolution. For example, Alanba.com.kw offers only "ar" in its HTML LANG attribute, but the more specific "ar-kw" (Kuwaiti Arabic) in its HTTP-EQUIV field:

key type value CLDLang URL
content-language http-equiv ar-kw ARABIC https://www.alanba.com.kw/1083073/
lang htmltag ar ARABIC https://www.alanba.com.kw/1083059/
content-language http-equiv ar-kw ARABIC https://www.alanba.com.kw/1083059/
lang htmltag ar ARABIC https://www.alanba.com.kw/1083075/

Thus, you may wish to examine both attributes for each URL and select the most specific.