Visual Global Knowledge Graph (VGKG) April 2016 Snapshot Dataset

Following in the footsteps of our February snapshot, we're releasing an April 2016 snapshot of the Visual Global Knowledge Graph (VGKG) by popular demand! This snapshot is in CSV format, one Article/Image per row, with the following columns (in order of appearance):

  • DATE. (Timestamp in YYYYMMDDHHMMSS format to the nearest 15 minutes when the article containing the image was monitored)
  • DocumentIdentifier. (The URL of the article the image was found in – note that a URL may appear multiple times if it contained multiple images)
  • ImageURL. (The URL of the image itself – note that a URL may appear multiple times if it appeared in multiple articles – sometimes different news outlets owned by the same publisher may use a common CDN)
  • RawJSON. (The raw JSON output of the Cloud Vision API's results of processing the image – note that some fields may change format and/or new fields may be added over time since this data spans multiple versions and releases of the Cloud Vision API – this is the JSON results block as-is and requires parsing in any standard JSON application)

There are 36,769,236 rows, including the header row, totaling 233GB (for this extract only the newer images for which the full JSON output is available were included). Given the size of the full dataset, it has been broken into 6 parts and each has been GZIP'd. The first row of Part 1 is the CSV header row with the column names. To load into a database, simply download and gunzip all six parts and then concatenate back together into a single file.