As the global online news landscape has increasingly made the transition to HTTPS, what are the SSL certificate providers they use? To explore this question in more detail, we looked at news outlets in the 65 languages represented in the GKG 2.0 from 2015 to present that we have monitored at least 50 articles from. Using a threshold of 50 articles eliminates the smallest of outlets and leaves us with around 60,000 outlets. The actual number of outlets monitored by GDELT over the full 152 languages it monitors is much larger, but focusing on just the GKG's 65 languages captures a majority cross-section of global news. Note that this list includes not just traditional journalism outlets, but also government agencies that produce news and press release content, citizen journalism outlets and other sites that produce content a nation's citizenry may turn to as "news."
Of the final 52,615 domains for which we were able to retrieve SSL certificate information using the workflow below, 20,563 (39%) use Let's Encrypt, showing its widespread adoption in the media industry. Cloudflare is also prominent with 10,758 (20%) of entries. It is particularly fascinating to look globally at the regional and local issuers preferred in various parts of the world and whether outlets in a given country prefer local issuers to global ones.
Using BigQuery's NET.REG_DOMAIN() function, it is trivial to extract the registerable domain for each article URL and collapse into a list of unique domains:
SELECT NET.REG_DOMAIN( DocumentIdentifier) domain, count(1) cnt FROM `gdelt-bq.gdeltv2.gkg_partitioned` group by domain having cnt > 50
We download this as a JSON file and parse into a list of domains, then randomize the ordering to reduce the likelihood of crawling multiple subdomains at the same time:
apt-get -y install jq jq -r '.domain' ./BQDOWNLOADFILE.json > DOMAINS.TXT.TMP shuf DOMAINS.TXT.TMP > DOMAINS.TXT
Then, using the workflow outlined in this ServerFault post, we use openssl to fetch the SSL certificate information for each domain:
mkdir CACHE time cat DOMAINS.TXT | parallel --eta -j 120 'openssl s_client -connect {}:443 -servername {} </dev/null 2>/dev/null | openssl x509 -inform pem -text > ./CACHE/{}.sslinfo'
While the resulting files contain a wealth of information, we've extracted the following fields (if a field is present multiple times, only the first value is extracted):
- Domain. The registerable domain as returned by BigQuery's NET.REG_DOMAIN().
- Start Date. The contents of the "Not Before" field, giving the human-formatted start date on which the certificate becomes valid. The value for cnn.com at present is "Apr 20 19:10:07 2021 GMT".
- End Date. The contents of the "Not After" field, giving the human-formatted end date on which the certificate ceases to be valid. The value for cnn.com at present is "May 22 19:10:06 2022 GMT".
- Start Date Unix. The Start Date converted into a Unix timestamp.
- End Date Unix. The End Date converted into a Unix timestamp.
- Duration In Seconds. The total seconds of duration in which the certificate is valid (End Date Unix – Start Date Unix). Divide this by 86,400 to get the estimated number of days the certificate is valid. This can be used to bin certificates by lifespan, from the 3-month validity typically used by issuers like Let's Encrypt to multi-year validity.
- Issuer. The contents of the "Issuer:" field. The value for cnn.com at present is "C = BE, O = GlobalSign nv-sa, CN = GlobalSign Atlas R3 DV TLS CA 2020".
- Subject. The contents of the "Issuer:" field. The value for cnn.com at present is "CN = *.api.cnn.com".
- DNS. The contents of the "DNS:" field. The value for cnn.com at present is "DNS:*.api.cnn.com, DNS:*.api.cnn.io, DNS:*.api.electiontracker.cnn.com, DNS:*.api.platform.cnn.com, DNS:*.api.warnermedialabs.com, DNS:*.arabic.cnn
.com, DNS:*.artemis.turner.com, DNS:*.beta.next.cnn.com, DNS:*.blogs.cnn.com, DNS:*.client.appletv.cnn.com, DNS:*.cnn.com, DNS:*.cnn.io, DNS:*.cnnarabic.com, DNS:*
.cnnlabs.com, DNS:*.cnnmoney.com, DNS:*.cnnmoneystream.com, DNS:*.cnnpolitics.com, DNS:*.config.outturner.com, DNS:*.corporatemobile.outturner.com, DNS:*.cronkite.
cnn.com, DNS:*.data.api.cnn.io, DNS:*.edition.cnn.com, DNS:*.edition.i.cdn.cnn.com, DNS:*.edition.stage.next.cnn.com, DNS:*.edition.stage2.next.cnn.com, DNS:*.edit
ion.stage3.next.cnn.com, DNS:*.elections.cnn.com, DNS:*.electiontracker.cnn.com, DNS:*.go.cnn.com, DNS:*.greatbig.com, DNS:*.greatbigstory.com, DNS:*.greatbigstory
.se, DNS:*.i.cdn.cnn.com, DNS:*.markets.money.cnn.io, DNS:*.money.cnn.com, DNS:*.moneystream.cnn.com, DNS:*.next.cnn.com, DNS:*.odm.platform.cnn.com, DNS:*.outturn
er.com, DNS:*.platform.cnn.com, DNS:*.section-content.money.cnn.com, DNS:*.stage.next.cnn.com, DNS:*.stage2.next.cnn.com, DNS:*.stage3.next.cnn.com, DNS:*.stellar.
cnn.com, DNS:*.terra.next.cnn.com, DNS:*.travel.cnn.com, DNS:*.warnermedialabs.com, DNS:*.www.i.cdn.cnn.com, DNS:api.electiontracker.cnn.com, DNS:api.etp.cnn.com,
DNS:api.platform.cnn.com, DNS:app.cnn.io, DNS:arabic.cnn.com, DNS:client.appletv.cnn.com, DNS:cnn.com, DNS:cnn.io, DNS:cnnarabic.com, DNS:cnnlabs.com, DNS:cnnmoney
.com, DNS:cnnpolitics.com, DNS:compositor.api.cnn.com, DNS:cronkite.cnn.com, DNS:dcfandome.com, DNS:dev.client.appletv.cnn.com, DNS:dev.hypatia.api.cnn.io, DNS:dev
.money.cnn.com, DNS:edition-m.cnn.com, DNS:eightiesyourself.cnn.com, DNS:graphql.verticals.api.cnn.io, DNS:hypatia.api.cnn.io, DNS:i.cdn.travel.cnn.com, DNS:lite.c
nn.com, DNS:markets.money.cnn.io, DNS:money.cnn.com, DNS:preview.dev.money.cnn.com, DNS:preview.money.cnn.com, DNS:preview.qa.money.cnn.com, DNS:preview.ref.money.
cnn.com, DNS:preview.train.money.cnn.com, DNS:preview2.ref.money.cnn.com, DNS:qa.money.cnn.com, DNS:ref.hypatia.api.cnn.io, DNS:ref.money.cnn.com, DNS:ref2.money.c
nn.com, DNS:stage.edition-m.cnn.com, DNS:stage.money.cnn.com, DNS:stage.us-m.cnn.com, DNS:stage.www-m.cnn.com, DNS:train.money.cnn.com, DNS:underscored.com, DNS:us
-m.cnn.com, DNS:www-m.cnn.com".
The results have been compiled into a tab-delimited file, one domain per row, with the fields above as the columns: