Constructing Trillion-Cooccurrence Graphs In BigQuery With A Single SQL Query

Earlier today we demonstrated the construction of a semantic graph representing 24.5-billion entity co-occurrences from the 11 billion entity annotations created by Google's Cloud Natural Language API on 103 million English language articles. In that particular case, we limited the graph to only those entities for which the Natural Language API returned a non-null MID ID. This is a relatively small subset of the total entity population, but reflects the most semantically meaningful entities.

This raised the question of just how large the co-occurrence graph would be if applied to the complete 11-billion entity annotation dataset.

The answer is that using the same SQL query without the "where entities.mid is not null" filter yields a graph encoding 1,070,854,203,247 (1.1 trillion) total co-occurrences, which took BigQuery 21,755.5 seconds (6 hours) and 135GB to construct.

With just 6 hours of compute time, BigQuery was able to permute 11 billion entities across 103 million articles into a graph encoding 1.1 trillion co-occurrences – all with a single SQL query.

That BigQuery today can construct and analyze a 1.1-trillion-co-occurrence graph in just six hours with just a single SQL query stands testament to the power of the modern cloud.

The GDELT Project

Constructing Trillion-Cooccurrence Graphs In BigQuery With A Single SQL Query

Archives