Yesterday we released a massive new dataset comprising the 1.78 billion URL outlink graph over GDELT's Global Knowledge Graph 2016-2018. Even though this new dataset is collapsed to the domain level, it is still nearly 1GB and contains more than 30 million rows.
A simple application of this new dataset is to compile a list of the top news outlets worldwide that link to a given outlet in their own coverage. In other words, what are the top 30 news outlets around the world that link the most to CNN's coverage in their own articles?
Today we release just such a summary dataset, which processed the full 30 million record outlink graph and compiled for each news outlet for which GDELT had seen inlinks to it on at least 30 different days (to ensure sufficient data) the list of the top 30 other news outlets that linked the most to that outlet's coverage 2016-2018.
Each row represents a single news outlet, with the first column being the outlet and the second column containing a comma-separated list of the top 30 news outlets linking most heavily to it 2016-2018. The TSV version separates the two columns with a tab (for importing into applications where you wish to keep the "top 30" domain list as a single column), while the CSV version uses a comma, making it easier to import both the outlet and each of the top 30 entries into their own columns in Excel, etc.
Note that since this dataset is limited to news outlets with at least 30 days of other news outlets linking to them, it excludes many of the smaller sites present in the full 30 million record dataset, but is better suited for the purposes of assessing inlink structure as a measure of context.
By releasing this simple summarized version, we hope it makes it easier to work with this massive new dataset and create exciting new applications with it!