The GDELT Project

Visual Global Knowledge Graph Adds Image Caption, Page Title And Page Language

We're excited to announce today that the Visual Global Knowledge Graph has added three new fields to the "ImageProperties" object of the RawJSON field.

On the first news article a given image is seen on, if the image was associated with an on-page visible textual caption or ALT caption, that description will now be included in a field called "ImageCaption". Given that many news images do not include descriptive caption text in their EXIF metadata, this will richly expand the universe of analytic possibilities, including such complex analyses as associating visual labels from Cloud Vision with specific entities found in the image caption to identify dominate themes of visual representation. Note that images that are specified for social sharing and not included in the body of the article (such as OG:IMAGE fields) will not include captioning text, though the page title may be useful in these cases to discern further insights into its likely contents.

The title of the first news article to include a given image is also now included in a field called "SourceArticleTitle".

Non-ASCII characters in both fields are JSON escaped. While this increases the filesize of the records slightly, the increase is negligible compared to the size of the rich Cloud Vision API annotations and ensures that existing ingest workflows are unaffected by the expansion to all 152 languages monitored by GDELT today, as well as avoids issues with more brittle JSON parsers encountering invalid character sequences.

The primary language of the first article to include an image is now reported in both the existing "LangHints" field and a new "SourceArticleLang" field as the CLD2 human language label. Note that this label is uppercase for most languages and mixed case for select languages.

We're enormously excited by the profound new analytic capabilities these fields afford, especially the image captions!