A Planetary Scale Open Dataset: Just How Big Is GDELT As Of 2021?

Just how large is GDELT as of 2021? Back in July 2018 we showed that GDELT at the time encompassed more than 3.2 trillion datapoints. How large has it grown in the nearly two years since? Below are the sizes of some of our largest datasets. These totals don't include the myriad specialty datasets and extracts we routinely post on our blog.

  • Visual Global Knowledge Graph (VGKG). The VGKG contains 630,131,403 records times 12 fields per record, yielding 7,561,576,836 total fields. Yet one of those fields, the RawJSON field, contains the complete annotation set as generated by the Cloud Vision API. We can approximate the total number of distinct fields in this field for a given day without actually parsing the JSON using the query "SELECT SUM(ARRAY_LENGTH(split(RawJSON, ': '))) FROM `gdelt-bq.gdeltv2.cloudvision_partitioned` WHERE DATE(_PARTITIONTIME) = "2021-03-17". This yields 570,388,131,331 distinct datapoints.
    • Total: 577,949,708,167 datapoints
  • Global Frontpage Graph (GFG). The GFG contains 265,632,114,441 records times 6 fields per record, yielding 1,593,792,686,646 distinct datapoints.
    • Total:1,593,792,686,646 datapoints
  • Global Quotation Graph (GQG). The GQG contains 175,889,625 records times 4 fixed and 3 repeated fields. Using the query "SELECT count(1), SUM(ARRAY_LENGTH(quotes)) FROM `gdelt-bq.gdeltv2.gqg`" we can get the total count of this repeated structure, 1,136,908,105 times 3 fields in each struct, yielding a grand total of 4,114,282,815 distinct datapoints.
    • Total: 4,114,282,815 datapoints
  • Radio News NGrams. The Radio News NGrams dataset contains 115,382,303,734 total records times 7 fields yields 807,676,126,138 datapoints.
    • Total: 807,676,126,138 datapoints
  • Television News NGrams. The Television News NGrams dataset contains 15,354,334,394 total records times 7 fields yields 107,480,340,758 datapoints.
    • Total: 107,480,340,758 datapoints
  • Web News NGrams. The Web News NGrams dataset contains 72,455,977,934 records times 4 fields yields 289,823,911,736 total datapoints.
    • Total: 289,823,911,736 datapoints
  • Web Part Of Speech Dataset. The Web POS dataset contains 23,411,266,087 records with 18 fixed and 2 repeated fields. Using the query "SELECT count(1), sum(num_changes) FROM `gdelt-bq.gdeltv2.gdg_partitioned`" we can get the total count of this repeated structure, 34,488,799,253 times 2 fields in each struct, yielding 68,977,598,506 distinct records. Combined there are 490,380,388,072 datapoints.
    • Total: 490,380,388,072 datapoints.
  • Advertising Inventory Files (AIF) Captioning Time Dataset. The AIF Captioning Time dataset consists of 2,377,142,765 records times 6 fields, yielding 14,262,856,590 total records.
    • Total: 14,262,856,590 datapoints
  • Advertising Inventory Files (AIF) Video Time Dataset. The AIF Video Time dataset consists of 326,646,781 records times 6 fields, yielding 1,959,880,686 total records.
    • Total: 1,959,880,686 datapoints
  • Events 2.0. The Events 2.0 dataset consists of 563,735,007 records times 61 fields in the Events table, yielding 34,387,835,427 total records, plus 1,733,311,961 records times 16 fields in the EventMentions table, yielding 27,732,991,376 total records. Combined, it has 62,120,826,803 datapoints.
    • Total: 62,120,826,803 datapoints
  • Global Difference Graph (GDG). The GDG contains 1,700,968,506 records with 20 fixed and 4 repeated fields. Using the query "SELECT count(1), sum(num_changes) FROM `gdelt-bq.gdeltv2.gdg_partitioned`" we can get the total count of this repeated structure, 5,270,291,447 times four fields in each struct, yielding 21,081,165,788 distinct records. Combined there are 55,100,535,908 datapoints.
    • Total: 55,100,535,908 datapoints
  • Global Geographic Graph (GGG). The GGG contains 1,978,632,488 records times 16 fields, yielding 31,658,119,808 datapoints.
    • Total: 31,658,119,808 datapoints
  • Global Relationship Graph (GRG). The GRG contains 1,220,945,927 records times 4 fixed and 2 repeated fields. using the query "SELECT count(1), SUM(ARRAY_LENGTH(urls)) FROM `gdelt-bq.gdeltv2.grg_vcn`" we can get the total count of this repeated structure, 1,461,397,069 times 2 fields, for a combined total of 7,806,577,846 datapoints.
    • Total: 7,806,577,846 datapoints.
  • Visual Global Entity Graph (VGEG). The VGEG contains 369,893,769 records times 49 fixed fields and two separate repeated fields, each with 6 fields. Using the query "SELECT count(1), sum( numDistinctEntities),sum( numDistinctPresenceEntities) FROM `gdelt-bq.gdeltv2.vgegv2_iatv`", we find there are a total of 98,862,965,259 datapoints.
    • Total: 98,862,965,259 datapoints.
  • Global Entity Graph (GEG) G1. The GEG G1 contains 183,691,666 records times 3 fixed and 7 repeated fields. Using the query "SELECT count(1), SUM(ARRAY_LENGTH(entities)) FROM `gdelt-bq.gdeltv2.geg_g1`" we find there are a combined total of 132,082,233,890 datapoints.
    • Total: 132,082,233,890 datapoints.
  • Global Entity Graph (GEG) GCNLPAPI. The GEG G1 contains 188,684,288 records times 6 fixed and 6 repeated fields. Using the query "SELECT count(1), SUM(ARRAY_LENGTH(entities)) FROM `gdelt-bq.gdeltv2.geg_gcnlapi`" we find there are a combined total of 119,561,070,030 datapoints.
    • Total: 119,561,070,030 datapoints.
  • Global Entity Graph (GEG) TV. The GEG TV contains 70,245,330 records times 6 fixed and and 6 repeated fields. Using the query "SELECT count(1), SUM(ARRAY_LENGTH(entities)) FROM `gdelt-bq.gdeltv2.gegv2_iatv`" we find there are a combined total of 3,599,739,846 datapoints.
    • Total: 3,599,739,846 datapoints.
  • Global Embedded Metadata Graph (GEMG). The GEMG contains 725,902,037 records times 5 fixed and 3 repeated fields. Using the query "SELECT count(1), SUM(ARRAY_LENGTH(metatags)) FROM `gdelt-bq.gdeltv2.gemg`" we find there are a total of 54,224,798,163 datapoints in the repeated fields plus the 3,629,510,185 fixed datapoints. The "jsonld" field contains JSON data with 12,201,123,815 total fields. Combined, the GEMG has 70,055,432,163 datapoints.
    • Total: 70,055,432,163 datapoints
  • Global Knowledge Graph 2.0 (GKG). The GKG has 1,181,195,967 records times 27 fields, totaling 31,892,291,109 datapoints. The GCAM field alone consists of 3,395,938,405,125 distinct datapoints. The V2Tone field contains 7 fields yielding 8,268,371,769 datapoints, while the combined remaining delimited yields yield another 238,380,875,061 datapoints using the query "SELECT SUM(ARRAY_LENGTH(split(SocialVideoEmbeds, ';'))) + SUM(ARRAY_LENGTH(split(SocialImageEmbeds, ';'))) + SUM(ARRAY_LENGTH(split(RelatedImages, ';'))) + SUM(ARRAY_LENGTH(split(Quotations, '#')))*4 + SUM(ARRAY_LENGTH(split(Amounts, ';')))*3 + SUM(ARRAY_LENGTH(split(AllNames, ';')))*2 + SUM(ARRAY_LENGTH(split(V2Organizations, ';')))*2 + SUM(ARRAY_LENGTH(split(V2Persons, ';')))*2 + SUM(ARRAY_LENGTH(split(V2Themes, ';')))*2 + SUM(ARRAY_LENGTH(split(V2Counts, ';')))*11 + SUM(ARRAY_LENGTH(split(V2Locations, ';')))*9 FROM `gdelt-bq.gdeltv2.gkg_partitioned`". Combined, the entire GKG 2.0 table yields 3,674,479,943,064 datapoints.
    • Total: 3,674,479,943,064 datapoints

Thus, across just the datasets above, GDELT contains a total of 8.14 trillion datapoints!