Today we announce the alpha release of a powerful new tool for understanding both global events and the narratives that shape how we understand them. The new GDELT 3.0 Global Frontpage Graph (GFG) is a prototype experiment to explore how we can better understand which of the myriad news stories each day are considered the most "important" by the global media ecosystem.
One of the most basic measures of how a given news outlet perceives the "importance" of a story is the positioning of that story on its website. Those stories afforded precious space on the frontpage of a news outlet's website are those it considers to be the most important at the moment. The presence, position and length of time on the front page, as well as changes in its position there over time are all indicators of how the outlet's editors view the story. Thus, a news outlet might publish a steady stream of articles about Syria, but if none of those are featured on its homepage, that suggests it views them as less important than its other coverage. Similarly, out of all of the coverage of Syria that an outlet publishes today, which (if any) of those articles are featured on the frontpage? Are the frontpage selections fundamentally "different" from the rest of the outlet's Syria coverage, perhaps emphasizing a particular framing or emotional tenor?
Of course, frontpage placement is not the only indicator of how an outlet perceives a given story, but it offers a powerful and globally consistent filter to help surface the stories that each outlet believes is the most important at that moment to its readership. By aggregating geographically, topically, etc, we can begin to gain a coarse understanding of the priorities of different clusters beyond simply their overall publication volume, especially in the online world in which outlets can only prioritize a small fraction of the totality of their daily output. In short, while online news outlets can publish an unlimited volume of coverage each day, the limited space of the frontpage enforces the kind of editorial selection and displacement that makes broadcast media so valuable as a barometer of media attention and agenda setting.
To make it possible to incorporate this basic "importance" metric into analyses, we are launching today the alpha release of the new GDELT 3.0 Global Frontpage Graph (GFG). Every hour we scan a list of around 50,000 news website homepages from across the world and compile a list of all HTTP/HTTPS hyperlinks contained within and the order in which they appeared in the HTML of the page (we exclude all other kinds of links like email, telephone, WhatsApp, Telegram, etc). This is compiled into a single master hourly tab-delimited file that essentially catalogs all of the hyperlinks appearing on the world's news outlet homepages. All 50,000 sites are rescanned every hour, meaning you can trace at hourly resolution how stories spread through the global media ecosystem, where on the frontpage they debuted (from above the fold to buried at the bottom), how their position on the frontpage changed through the course of the day and when they finally disappeared from the frontpage.
Thus, the final format of the GFG hourly file is a giant gzip'd tab-delimited file in UTF-8, one row per hyperlink found on a homepage (typically around 10M+ links per hour), with six columns: DATE, FromFrontPageURL, LinkID, LinkPercentMaxID, ToLinkURL, and LinkText. For each homepage, the extracted links are written to the GFG file in the order they appear on the page and LinkID records the order it appeared on the page (this makes it easier to perform analyses and assess just how far down or up a page a given link moved from the previous hour(s)). More detail about the fields are:
- DATE. This is the date of the snapshot in YYYYMMDDHHMMSS format. This field is the same for all entries in a given file and matches the filename of the file. It is included in the data to make it easy to load the hourly files directly into a database system.
- FromFrontPageURL. This is the URL of the homepage the link was found in.
- LinkID. All links in a homepage are numbered sequentially from 0 to the number of links in the page, making it easy to examine links by the order they appeared on the homepage. A news site may move a given link around on its homepage over time, so its LinkID may change between each snapshot.
- LinkPercentMaxID. This is simply the LinkID of the current link divided by the max LinkID for this homepage and multiplied by 100, allowing you to compare rough relative link positioning across sites. In other words, knowing that a given link has LinkID 100 is not enough to compare its position on two different sites, since one site might only have 100 links (meaning it is the last link on the site), while the other site might have 1000 links, meaning it is in the top 10% of the page. We recommend using the this field when comparing across sites.
- ToLinkURL. The URL extracted from the homepage. Non-ASCII URLs that are already encoded in the original document (such as Punycode and percent-escaping) are preserved as-is, otherwise they are automatically escaped.
- LinkText. Up to the first 100 characters of the link text (links longer than this are truncated and "…" is appended to the end). Non-ASCII characters across all characterset encodings are transcoded to UTF-8.
For this alpha release, we include every single HTTP/HTTPS link found in an <A HREF> tag in the HTML of the page and record it in the order it appears in the HTML. This initial set of around 50,000 websites was provided by a tremendous number of different organizations and researchers from across the world and is currently separate from the main GDELT monitoring catalog (though we will be integrating them in the release of GDELT 3.0 this spring). Thus, not all websites on this list are monitored by GDELT at the moment and not all sites monitored by GDELT are on this list. Given the immense and growing interest in understanding how narratives are both unifying and dividing societies throughout the world, we include a wide range of outlets, ranging from traditional mainstream general news sites to topical and specialty outlets to select governmental and NGO news rooms to high profile citizen media sites and a growing collection of partisan, satirical, "fake news" and divisive outlets as part of our efforts to help understand the broader contours of the narratives that are shaping our societies. The presence or absence of a given website on this list does not indicate any editorial statement regarding its stature or status as a "news" outlet and if there are sites that you would like added to this list please email us, as we're eager to rapidly grow this list to encompass the broadest possible contours of the global media ecosystem.
For technical users interested in the methodology, we use two different scanning methodologies in parallel. Through extensive testing we identified the majority of sites whose hyperlinks are largely expressed in their static HTML as delivered from the web server, rather than generated dynamically at runtime via JavaScript executing inside the browser. Most news websites today contain some amount of dynamic content generated via JavaScript, but that content typically emphasizes advertisements and other non-news-related sections.
For sites in which the majority of hyperlinks are contained in the static HTML, we fetch the HTML and extract all <A HREF> hyperlinks in order of their appearance in the HTML. Absolute hyperlinks are extracted as-is, while relative links are derelativized. Links that include non-HTTP/HTTPS protocols (mailto:, tel:, whatsapp:, tg:, etc) are excluded. Links are written to the GFG file in the order they appear in the HTML. Given the complex CSS styling used on some news websites, the order links appear in the HTML does not always perfectly match the order they appear on the page (in particular, headers/footers and insets may sometimes appear later or earlier in the HTML compared to where they appear on the page), but the ordering tends to match overall. Thus, links found earlier in the HTML tend to be those that appear towards the top of the frontpage, while those found towards the end of the HTML tend to be footers and other end-of-page links. Horizontal ordering is also typically reflected in the order links appear in the HTML, with sequences of links in the HTML appearing from top to bottom and left to right on the page. The same link may appear multiple times on a frontpage in different sections, which is often important information, so we do not deduplicate links – we output them as-is in the order they appear on the page. Some sites will double-link all stories by including a link in both the illustrative image for a story and a "read more" link beneath the image or headline, so when incorporating the number of times an article is linked from a homepage into your analyses, keep this in mind.
For the small handful of sites like CNN that incorporate such extensive dynamic content generation that the majority of the homepage is not accessible via the original HTML, we fully render the page using the latest version of Headless Chrome in desktop mode, using dynamic "adaptive behavioral scrolling" to mimic how a human scrolls through a page, including pausing to skim content sections as they dynamically load in and jumping through the page. Traditional Headless Chrome rendering typically misses the later content sections of such dynamic sites due to the way they implement their lazy loading triggers and thus the behavioral scrolling we use ensures the entire page loads. For infinite scrolling homepages we load either the portion of the page available in static HTML format or, for dynamic-only infinite scrolling sites, we load up to the first 16,000 pixels of height in Headless Chrome, though the final content section on the page may only be partially rendered and thus missing a few links. The final fully rendered HTML is then used to extract the <A HREF> links exactly as we do for static HTML pages.
The final GFG file is typically completed and ready for download around 30 minutes after each hour. The first available GFG file is http://data.gdeltproject.org/gdeltv3/gfg/alpha/20180302020000.LINKS.TXT.gz and the filename follows the format YYYYMMDDHH0000.LINKS.TXT.gz in the UTC timezone. For those wishing to download the GFG file hourly as it becomes available, check http://data.gdeltproject.org/gdeltv3/gfg/alpha/lastupdate.txt at 30-40 minutes after each hour to be alerted when the latest file is available for download.
The full dataset is also available in Google BigQuery in a partitioned table, as gdelt-bq:gdeltv2.gfg_partitioned.
Note that in this alpha release we output the complete list of every URL found in each homepage. Future versions may filter this list to only record URLs that have been added, removed, or changed position on the page, so we're very interested in your feedback in just what kind of filtering or format is most useful to your work. Please get in touch and let us know of sites you want added, filtering that would be useful to you, etc, as we build upon this alpha release. Happy analyzing!
DOWNLOAD
- http://data.gdeltproject.org/gdeltv3/gfg/alpha/lastupdate.txt (Check 30 minutes after each hour for the latest file URL).
- Google BigQuery gdelt-bq:gdeltv2.gfg_partitioned.