New GKG 2.0 Article Metadata Fields

Kalev Leetaru

9 years ago

With the debut this week of our massive new document extraction platform, we are tremendously excited to unveil several new metadata fields added to the Global Knowledge Graph (GKG 2.0). As of yesterday all GKG 2.0 records will now include several new extracted fields in their "XMLExtras" field (the last column of each GKG record). Note that since these new metadata fields are extracted, rather than computed, not all documents may include all fields. We believe these new metadata fields will open up a number of new analytic and interface capabilities and look forward to what you are able to accomplish with them!

Article Hyperlinks. Increasingly modern online news articles include hyperlinks in the body of the article to related news articles and other online content that provides context and background to the piece. GDELT now captures these links inside a "<PAGE_LINKS>…</PAGE_LINKS>" block. Each link is separated by a semicolon. As this feature is brand new we are still fine-tuning the algorithms that determine the relevancy of each link to the body text, ensuring that only links in the core body text itself are recorded. Approximately 55% of articles contain extracted links.
Article Authors. GDELT now attempts to determine the author of each news article, which it currently is able to extract for 53% of articles. In the case of multiple authors they will be separated by semicolons. Note that author names in non-ASCII charactersets are HTML UTF8 escaped. We are constantly expanding and fine-tuning these algorithms and so the percent of articles with extracted authors will continue to grow over time. Note however that many news outlets do not list an author or list only a corporate author (the news outlet itself) due to tradition, policy or journalistic norms in a given region. Authors are stored in a "<PAGE_AUTHORS>…</PAGE_AUTHORS>" block.
Article Precise Publication Timestamp. GDELT operates on a 15 minute heartbeat and monitors the majority of news articles shortly after they are published. However, many factors can introduce delays from minutes up to several hours in very rare circumstances before GDELT sees an article. For those needing absolute precision publication timestamps, down to the second level, GDELT now attempts to determine the high-precision publication timestamp of each article. As the time of this writing, approximately 33% of articles provide sufficient information to determine a high resolution publication timestamp. This is stored as "<PAGE_PRECISEPUBTIMESTAMP>YYYYMMDDHHMMSS</PAGE_PRECISEPUBTIMESTAMP>" in year-month-day-hour-minute-second format. Timestamps are automatically converted from the local time zone of the news outlet to UTC. Note that some news outlets may report high precision timestamps that claim to be published at midnight or other times such elements of the HHMMSS portion of the date are all zeros. We are currently working on additional heuristics and algorithmic refinements to detect these relatively rare boundary cases. In the case of both a high resolution publication and modification timestamp being provided, we utilize the publication timestamp.
Article Accelerated Mobile Pages (AMP) URL. GDELT now automatically scans each article for the existence of an alternative Accelerated Mobile Pages (AMP) version of the article. AMP refers to an open technology developed by Google to provide fast responsive mobile-optimized web content. Applications supporting mobile users may wish to forward their users to this URL instead of the main article URL when connecting users to a given article. As of this writing 36% of all articles offer an AMP version. The AMP URL is stored as "<PAGE_ALTURL_AMP>…</PAGE_ALTURL_AMP>".
Mobile Optimized URL. In addition to scanning each article for alternative AMP versions, GDELT also scans each page for a mobile optimized version of any kind. Some AMP pages may be reported in this field, but in general the majority of URLs recorded in this field are non-AMP mobile-optimized pages using a range of mobile optimization techniques and technologies. Applications using GDELT data to support mobile use cases should first check for an AMP URL and then fall back to the contents of this field if no AMP URL is available. As of this writing, 8% of all articles offer a mobile version and when both AMP and mobile URLs are considered, just under 40% of all articles include one or the other. The Mobile Optimized URL is stored as "<PAGE_ALTURL_MOBILE>…</PAGE_ALTURL_MOBILE>".

EXTRACTING FIELDS USING BIGQUERY

To extract one of these new fields using BigQuery, use REGEXP_EXTRACT. For example, to retrieve the AMP URL for articles with an AMP version available, use the following Legacy SQL query:

SELECT REGEXP_EXTRACT(Extras, r'<PAGE_ALTURL_AMP>(.*?)</PAGE_ALTURL_AMP>') FROM [gdeltv2.gkg] where Extras like '%<PAGE_ALTURL_AMP>%' order by DATE desc

Using Standard SQL, the query for links would be:

SELECT DATE, DocumentIdentifier, REGEXP_EXTRACT(Extras, r'<PAGE_LINKS>(.*?)</PAGE_LINKS>') links FROM `gdelt-bq.gdeltv2.gkg_partitioned` WHERE _PARTITIONTIME >= "2019-09-08 00:00:00" AND _PARTITIONTIME < "2019-09-09 00:00:00" 
and Extras like '%<PAGE_LINKS>%' LIMIT 10000