GEG & VGKG: A Multimodal Master List Of 30 Million Visual And Textual Entities

GDELT's Visual Knowledge Graph has annotated more than half a billion global news images through Google's Cloud Vision API since 2017, resulting in 22.5 million distinct textually-derived caption-based Web Entities and 16,500 distinct visual-based Labels, while the GDELT Global Entity Graph has annotated a sample of 103 million worldwide English language online news articles since 2016 through Google's Cloud Natural Language API, yielding 13.9 million distinct textual entities with MID codes.

How do these parallel universes of visual and textual entities compare and how much overlap is there? Given that the Cloud Vision API and Cloud Natural Language API both use a shared concept graph with a common set of MIDs, the three entity datasets can simply be merged based on their MID codes to yield a single master multimodal entity dataset.

The combined multimodal visual and textual entity dataset comprises 30,782,747 distinct MID codes with 24,519,732 distinct associated entity names (in some cases a single textual name can resolve to multiple MID codes based on its different contexts). Of these 30 million entities, 22,504,682 (73%) were returned as visual Web Entities, 13,892,261 (45%) as textual entities and 16,528 (0.05%) as visual Labels. A total of 5,614,794 (18%) were in both Web Entities and text entities, while 15,175 (0.05%) were present in all three datasets.

In all, entities with non-null MID codes collectively made 1,584,550,381 text entity appearances (most textual entities do not have MID codes), 3,279,516,626 Web Entity appearances and 4,314,959,061 Label appearances, yielding a grand total of 5,578,017,892 entity appearances across both modalities.

Correlating Web Entities with text entities yields a Pearson correlation of r=0.289 (n=30782747), while limiting to only those entities that appeared more than 100,000 times for both Web Entities and text entities yields r=0.418 (n=629). Comparing Web Entities with Labels (and thus caption-derived entities with visually-assigned entities), yields a correlation of r=0.522 (n=30782747), while limiting to just those entities that appeared at least once in both datasets yields a correlation of r=0.583 (n=15741).

The final master combined GEG & VGKG textual and visual multimodal entity lookup is available as a UTF8 JSON file:

We're tremendously excited by the opportunities this new entity dataset offers for multimodal analyses that look across both the textual and visual worlds.


To create this lookup, the existing VGKG and GEG entity lookups were loaded into temporary tables and the following query used to merge them together in just 37 seconds.

SELECT COALESCE(a.entity, b.entity, c.entity) entity, COALESCE(a.mid, b.mid, c.mid) mid, IF(a.Count > 0, a.Count, 0) imagewebentitycount, IF(b.Count > 0, b.Count, 0) imagelabelcount, IF(c.Count > 0, c.Count, 0) textcount,
 IF(a.Count > 0, a.Count, 0) + IF(b.Count > 0, b.Count, 0) + IF(c.Count > 0, c.Count, 0) totalcount
) a
 (SELECT entity, mid, Count FROM `TEMPTABLE-LABELS`)
) b
) c
USING(mid) order by totalcount desc