The GDELT Project

Which SSL Certificate Providers Do News Outlets Use? A Global Inventory.

As the global online news landscape has increasingly made the transition to HTTPS, what are the SSL certificate providers they use? To explore this question in more detail, we looked at news outlets in the 65 languages represented in the GKG 2.0 from 2015 to present that we have monitored at least 50 articles from. Using a threshold of 50 articles eliminates the smallest of outlets and leaves us with around 60,000 outlets. The actual number of outlets monitored by GDELT over the full 152 languages it monitors is much larger, but focusing on just the GKG's 65 languages captures a majority cross-section of global news. Note that this list includes not just traditional journalism outlets, but also government agencies that produce news and press release content, citizen journalism outlets and other sites that produce content a nation's citizenry may turn to as "news."

Of the final 52,615 domains for which we were able to retrieve SSL certificate information using the workflow below, 20,563 (39%) use Let's Encrypt, showing its widespread adoption in the media industry. Cloudflare is also prominent with 10,758 (20%) of entries. It is particularly fascinating to look globally at the regional and local issuers preferred in various parts of the world and whether outlets in a given country prefer local issuers to global ones.

Using BigQuery's NET.REG_DOMAIN() function, it is trivial to extract the registerable domain for each article URL and collapse into a list of unique domains:

SELECT NET.REG_DOMAIN( DocumentIdentifier) domain, count(1) cnt FROM `gdelt-bq.gdeltv2.gkg_partitioned` group by domain having cnt > 50

We download this as a JSON file and parse into a list of domains, then randomize the ordering to reduce the likelihood of crawling multiple subdomains at the same time:

apt-get -y install jq
jq -r '.domain' ./BQDOWNLOADFILE.json > DOMAINS.TXT.TMP
shuf DOMAINS.TXT.TMP > DOMAINS.TXT

Then, using the workflow outlined in this ServerFault post, we use openssl to fetch the SSL certificate information for each domain:

mkdir CACHE
time cat DOMAINS.TXT | parallel --eta -j 120 'openssl s_client -connect {}:443 -servername {} </dev/null 2>/dev/null | openssl x509 -inform pem -text > ./CACHE/{}.sslinfo'

While the resulting files contain a wealth of information, we've extracted the following fields (if a field is present multiple times, only the first value is extracted):

The results have been compiled into a tab-delimited file, one domain per row, with the fields above as the columns: