Building upon the GDELT 2.0 DOC and GEO APIs that debuted earlier this summer, we’re incredibly excited to announce today the debut of the new GDELT 2.0 TV API! The newest addition to our rapidly growing collection of APIs, the GDELT 2.0 TV API takes the pioneering functionality of the Television Explorer and makes it available in API format with semantics identical to the 2.0 DOC API. Using data from the Internet Archive’s Television News Archive, the new 2.0 TV API allows you to explore, analyze and visualize just over 8 years of national and local US television news coverage.
Functionally the API searches the same database as the Television Explorer, meaning you are searching the raw uncorrected closed captioning transcript of each broadcast and if your search combines multiple terms, they must all appear within four sentences of each other in the transcript (roughly around 30 seconds of airtime).
Search 2 Million Hours Of TV from 150 Stations 2009-Present
More than two million hours of television news totaling more than 5.7 billion words from over 150 distinct stations spanning July 2009 to present are available, from major national stations like CNN MSNBC and Bloomberg to local affiliate networks like CBS, NBC and ABC (though not all stations are covered over the entire time span). BBC News London is also available from January 2017 to present.
Human + Machine Interfaces
The API is designed to both generate machine-friendly CSV and JSON output, suitable for analysis in any major platform and beautiful visualizations for human consumption, optimized for embedding on your own website. The API is available with both HTTP and HTTPS endpoints, meaning it can be embedded in an iframe in any website.
QUICK START EXAMPLES
Here are some really simple examples to get you started with the API!
- Zoomable Normalized Volume Timeline of National Coverage of “Trump”. This returns a browser-based interactive timeline visualization of the percentage of sentences monitored from each of the major national networks that contained the word “trump” from 2009 to present, including partial results from the last 24 hours. You can click and drag a portion of the timeline to automatically zoom into that portion. Functionally this display is identical to the timeline in the Television Explorer.
- CNN Volume Timeline of Trump. This timeline is identical to the one above, but narrows the display to show only coverage from CNN instead of all national networks.
- CNN Top Words WordCloud of Trump. This displays the top terms that appear most commonly within four sentences of mentions of Donald Trump’s name since 2009 on CNN.
- CNN Versus Fox News Timeline of "Puerto Rico" 8/29/2017 to 10/7/2017. This compares the relative percent of sentences on CNN and Fox News over the period 8/29/2017 through 10/7/2017 that mentioned the phrase "puerto rico".
- Smoothed CNN Versus Fox News Timeline of "Puerto Rico" 8/29/2017 to 10/7/2017. This is identical to the timeline above, but uses 5 day smoothing to more clearly show the macro-level coverage patterns. Note that since this uses rolling window smoothing, the precise dates of each portion of the timeline will be skewed by 5 days.
- Last Two Weeks National Top Clips Of Trump. This displays top matching clips that mention Donald Trump that aired in the last two weeks on national television networks.
- Russia + Syria Timeline. This example shows how to combine two different OR blocks for more powerful searches. Here, the final search is "(kremlin OR russia OR putin) (syria OR syrian OR assad)". The API automatically AND's all query statements and blocks together, so in essence there is an invisible AND between the two OR blocks.
The GDELT GEO 2.0 API is accessed via a simple URL with the following parameters. Under each parameter is the list of operators that can be used as the value of that parameter.
- QUERY. This contains your search query and supports keyword and keyphrase searches, OR statements and a variety of advanced operators. NOTE – all of the operators below must be used as part of the value of the QUERY field, separated by spaces, and cannot be used as URL parameters on their own.
- "". Anything found inside of quote marks is treated as an exact phrase search. Thus, you can search for "Donald Trump" to find all matches of his name.
- "donald trump"
- (a OR b). You can specify a list of keywords to be boolean OR'd together by enclosing them in parentheses and placing the capitalized word "OR" between each keyword or phrase. Boolean OR blocks cannot be nested at this time. For example, to search for mentions of Clinton, Sanders or Trump, you would use "(clinton OR sanders OR trump)".
- (clinton OR sanders OR trump)
- -. You can place a minus sign in front of any operator, word or phrase to exclude it. For example "-sanders" would exclude results that contained "sanders" from your results.
- Market. This narrows your search to a particular geographic market. The list of available markets can be found via the Station Details mode (look for the city name in the description of local stations). Example markets include "San Francisco" and "Philadelphia". The market name must be enclosed in quote marks. You can also use the special reserved market "National" to search the major national networks together.
- market:"San Francisco"
- Network. This narrows your search to a particular television network. The list of available networks can be found via the Station Details mode (look for the network name in the description of local stations). Example markets include "CBS" and "NBC". The network name must be enclosed in quote marks.
- Show. This narrows your search to a particular television show. This must be the complete show name as returned by the TV API. To find a particular show, search the API and use the "clipgallery" mode to display matching clips and their source show. For example, to limit your search to the show Hardball With Chris Matthews, you'd search for "show:"Hardball With Chris Matthews"". Note that you must surround the show name with quote marks. Remember that the TV API only searches shows monitored by the Internet Archive's Television News Archive, which may not include all shows.
- show:"Hardball With Chris Matthews"
- Station. This narrows your search to a particular television station. Remember that the TV API only searches stations monitored by the Internet Archive's Television News Archive and not all of those stations have been monitored for the entire 2009-present time period. Do not use quote marks around the name of the station. To find the Station ID of a particular station, use the Station Details mode.
- "". Anything found inside of quote marks is treated as an exact phrase search. Thus, you can search for "Donald Trump" to find all matches of his name.
- MODE. This specifies the specific output you would like from the API, ranging from timelines to word clouds to clip galleries.
- ClipGallery. This displays up to the top 50 most relevant clips matching your search. Each returned clip includes the name of the source show and station, the time the clip aired, a thumbnail, the actual text of the snippet and a link to view the full one minute clip on the Internet Archive's website. This allows you to see what kinds of clips are matching and view the full clip to gain context on your search results. In HTML output, this mode displays a "high design" visual layout suitable for creating magazine-style collages of matching coverage. When embedded as an iframe, the API uses the same postMessage resize model as the DOC 2.0 API.
- StationChart. This compares how many results your search generates from each of the selected stations over the selected time period, allowing you to assess the relative attention each is paying to the topic. Using the DATANORM parameter you can control whether this reports results as raw clip counts or as normalized percentages of all coverage (the most robust way of comparing stations). Note that in HTML mode, you can use the button at the top right of the display to save it as a static image or save its underlying data.
- StationDetails. This is a special mode that outlets the complete list of all stations that are available for searching, along with the start and end date of their monitoring. Some stations may simply no longer exist (such as Aljazeera America), while others were monitored by the Internet Archive for a brief period of time around a major event like an election. Stations with end dates within the last 24 hours can are considered active stations being currently monitored. Note that this mode is only available with JSON/JSONP output.
- TimelineVol. This tracks how many results your search generates by day/hour over the selected time period, allowing you to assess the relative attention each is paying to the topic and how that attention has varied over time. Using the DATANORM parameter you can control whether this reports results as raw clip counts or as normalized percentages of all coverage (the most robust way of comparing stations). By default, the timeline will not display the most recent 24 hours, since those results are still being generated (it can take up to 2-12 hours for a show to be processed by the Internet Archive and ready for analysis), but you can include those if needed via the LAST24 option. You can also smooth the timeline using the TIMELINESMOOTH option and combine all selected stations into a single time series using the DATACOMB option. Note that in HTML mode, you can toggle the station legend using the button that appears at the top right of the display or export the timeline as a static image or save its underlying data.
- TrendingTopics. This is a special mode that returns the most common trending topics, trending keywords/phrases and top keywords, overall and for the national stations. The results here offer a powerful summary of the major topics and memes dominating television news coverage at the moment. Results are updated every 15 minutes. Note that this mode is only available with JSON/JSONP output.
- WordCloud. This mode returns the top words that appear most frequently in clips matching your search. It takes the 150 most relevant clips that match your search and displays a word cloud of up to the top 200 most frequent words that appeared in those clips (common stop words are automatically removed). This is a powerful way of understanding the topics and words dominating the relevant coverage and suggesting additional contextual search terms to narrow or evolve your search. Note that if there are too few matching clips for your query, the word cloud may be blank. Note that in HTML mode, you can use the options at the bottom right of the display to save it as a static image or save its underlying data.
- FORMAT. This controls what file format the results are displayed in. Not all formats are available for all modes. To assist with website embedding, the CORS ACAO header for all output of the API is set to the wildcard "*", permitting universal embedding.
- HTML. This is the default mode and returns a browser-based visualization or display. Some displays, such as word clouds, are static images, some, like the timeline modes, result in interactive clickable visualizations, and some result in simple HTML lists of images or articles. The specific output varies by mode, but all are intended to be displayed directly in the browser in a user-friendly intuitive display and are designed to be easily embedded in any page via an iframe.
- CSV. This returns the requested data in comma-delimited (CSV) format. The specific set of columns varies based on the requested output mode. Note that since some modes return multilingual content, the CSV is encoded as UTF8 and includes the UTF8 BOM to work around Microsoft Excel limitations handling UTF8 CSV files.
- JSON. This returns the requested data in UTF8 encoded JSON. The specific fields varies by output mode.
- DATACOMB. By default, both timeline and station chart modes separate out each matching station's data to make it possible to compare the relative attention paid to a given topic by each station. Sometimes, however, the interest is in overall media attention, rather than specific per-station differences in that coverage. Setting this parameter to "combined" will collapse all matching data into a single "Combined" synthetic station and make it much easier to understand macro-level patterns.
- DATANORM. Most media researchers are accustomed to working with raw result counts, such as the absolute number of sentences that matched a given query. Such raw counts, while useful quick "sanity checks", are inappropriate for production analysis, especially looking over time or across stations. Raw sentence counts reflect the absolute number of sentences that matched a given query. However, different television shows and stations have slightly different rates of speech, meaning that one show might favor relatively slow utterance rates with brief statements containing minimal words and long pauses between each. Other shows may favor rapid-fire high rate speech with lengthy sentences. Comparing the absolute word or sentence count of a topic across two such stations will skew the results to such a degree as to render any comparisons meaningless. Instead, by default the API normalizes timeline and station chart modes by dividing the number of matching clips in each time interval by the total number of all monitored clips from that station or set of stations over that time interval. For example, if searching for "trump" on CNN over the past 2 years in timeline mode, each day will be displayed as the percent of all sentences monitored from CNN that day that contained the word "trump". This transforms result counts into normalized density measurements that allow you to directly compare different periods of time or stations in a meaningful way. We strongly recommend that users use only normalized result counts (the default), but for cases where you need raw counts, you can set this parameter to "raw" to display raw counts.
- LAST24. It can take the Internet Archive 2-12 hours to process a given television broadcast once it concludes and occasionally up to 24 hours for particularly long shows. Thus, by default the TV API does not return results from the most recent 24 hours to ensure that analyses are not skewed by partial results. However, when tracking breaking news events, it may be desirable to view partial results with the understanding that any time or station-based trends may not accurately reflect the totality of their coverage. In such cases, ClipGallery and WordCloud modes may be particularly insightful. To include results from the most recent 24 hours, set this URL parameter to "yes".
- MAXRECORDS. This option only applies to ClipGallery mode. By default 32 clips are displayed in HTML mode and this option can increase that to a maximum of 50 results. In JSON, JSONP or CSV formats, up to 3,000 clips can be returned.
- SORT. By default results are sorted by relevance to your query. Sometimes you may wish to sort by date or tone instead.
- DateDesc. Sorts matching clips by broadcast date, displaying the most recent clips first.
- DateAsc. Sorts matching clips by broadcast date, displaying the oldest clips first.
- STARTDATETIME/ENDDATETIME. These parameters allow you to specify the precise start and end date/times to search, instead of using an offset like with TIMESPAN.
- STARTDATETIME. Specify the precise date/time in YYYYMMDDHHMMSS format to begin the search – only articles published after this date/time stamp will be considered. The earliest available date is July 2, 2009. If you do not specify an ENDDATETIME, the API will search from STARTDATETIME through the present date/time.
- ENDDATETIME. Specify the precise date/time in YYYYMMDDHHMMSS format to end the search – only articles published before this date/time stamp will be considered. The earliest available date is July 2, 2009. If you do not specify a STARTDATETIME, the API will search from July 2, 2009 through the specified ENDDATETIME.
- TIMELINESMOOTH. This option is only available in timeline mode and performs moving window smoothing over the specified number of time steps, up to a maximum of 5. Timeline displays can sometimes capture too much of the chaotic noisy information environment that is the television landscape, resulting in jagged displays. Use this option to enable moving average smoothing up to 5 time steps to smooth the results to see the macro-level patterns.
- TIMESPAN. By default the TV API searches the entirety of the Internet Archive's Television News Archive's holdings, which extend back to July 2009 for some stations. You can narrow this range by using this option to specify the number of months, weeks, days or hours (minimum of 1 hour). The API then only searches airtime within the specified timespan backwards from the present time. If you would instead like to specify the precise start/end time of the search instead of an offset from the present time, you should use the STARTDATETIME/ENDDATETIME parameters.
- Hours. Specify a number followed by "h" or "hours" to provide the timespan in hours.
- Days. Specify a number followed by "d" or "days" to provide the timespan in days.
- Weeks. Specify a number followed by "w" or "weeks" to provide the timespan in weeks.
- Months. Specify a number followed by "m" or "months" to provide the timespan in months.
- Years. Specify a number followed by "y" or "years" to provide the timespan in years.
- TIMEZOOM. This option is only available for timeline modes in HTML format output and enables interactive zooming of the timeline using the browser-based visualization. Set to "yes" to enable and set to "no" or do not include the parameter, to disable. By default, the browser-based timeline display allows interactive examination and export of the timeline data, but does not allow the user to rezoom the display to a more narrow time span. If enabled, the user can click-drag horizontally in the graph to select a specific time period. If the visualization is being displayed directly by itself (it is the "parent" page), it will automatically refresh the page to display the revised time span. If the visualization is being embedded in another page via iframe, it will use postMessage to send the new timespan to the parent page with parameters "startdate" and "enddate" in the format needed by the STARTDATETIME and ENDDATETIME API parameters. The parent page can then use these parameters to rewrite the URLs of any API visualizations embedded in the page and reload each of them. This allows the creation of dashboard-like displays that contain multiple TV API visualizations where the user can zoom the timeline graph at the top and have all of the other displays automatically refresh to narrow their coverage to that revised time frame.