The GDELT Project

Global Embedded Metadata Graph (GEMG) Adds HTML LANG Attribute

Many websites embed a "lang" attribute in the top-level <HTML> tag of each page that indicates the ISO 639-1 language code or language+country code of the page's contents. This information is not typically repeated in any form in the <META> or JSON-LD fields captured by the Global Embedded Metadata Graph (GEMG), so we have added a new field to its "metatags" block with "htmltag" as its "type" and "lang" as its "key" to capture this information now if present.

The HTML LANG attribute is not found on all sites and when present can be incorrectly set, but for those sites where it is present and typically accurate, this attribute can be a useful signal for understanding a page's content and especially for helping to distinguish highly similar languages like Bosnian, Croatian and Serbian content written in the Latin characterset, which even purpose-built Serbo-Croatian language detectors frequently fail to distinguish.

Learn More.