We are extraordinarily excited to announce today the public unveiling of what has been perhaps the most requested feature of 2015: the ability to perform full text searches through the debut of the new GDELT Full Text Search API! The new Full Text Search API allows you to search the full text of all monitored coverage from the last 24 hours and return a list of matching articles sorted by relevance, date, or even sentiment, a timeline of media coverage, a timeline of the tone of that coverage, or even a word cloud of the top words appearing in matching coverage (using either the coverage’s original native language or the English translations).
As an alpha release, you may encounter a few bugs as you use the new GDELT Full Text Search API. Please bear with us as we work to constantly improve and enhance the new API and let us know about any particularly significant errors you encounter.
Searching Across The World’s Languages
Perhaps most powerfully, and utterly unlike any other news search system today, when you search using the GDELT Full Text Search API you are not just searching English news coverage: you are searching the English translations of coverage from 65 languages. Search for “genocide” and you see not just English-language Western news coverage, but rather perspectives from outlets across the entire globe in the world’s languages. GDELT today operates one of the largest streaming machine translation deployments in the world, live translating every monitored article in realtime from 65 different languages into English. Using these translations, the GDELT Full Text Search API is able to seamlessly and completely transparently search across languages, breaking down the language barrier to accessing the world’s events, narratives, and perspectives. With a traditional news search engine, you would have to manually translate your search term into all 65 languages using various dictionaries and translation tools, and then conduct 65 separate searches and merge all of the results together. In addition, few search engines have comprehensive catalogs of the non-Western and non-English news landscape, meaning even if you did all that, you would get back only a very limited non-Western perspective.
With the new GDELT Full Text Search API, a single English keyword transparently searches across the world’s languages. Currently 65 languages are live translated by GDELT: Afrikaans, Albanian, Arabic (MSA and many common dialects), Armenian, Azerbaijani, Bengali, Bosnian, Bulgarian, Catalan, Chinese (Simplified), Chinese (Traditional), Croatian, Czech, Danish, Dutch, Estonian, Finnish, French, Galician, Georgian, German, Greek, Gujarati, Hebrew, Hindi, Hungarian, Icelandic, Indonesian, Italian, Japanese, Kannada, Kazakh, Korean, Latvian, Lithuanian, Macedonian, Malay, Malayalam, Marathi, Mongolian, Nepali, Norwegian (Bokmal), Norwegian (Nynorsk), Persian, Polish, Portuguese (Brazilian), Portuguese (European), Punjabi, Romanian, Russian, Serbian, Sinhalese, Slovak, Slovenian, Somali, Spanish, Swahili, Swedish, Tamil, Telugu, Thai, Turkish, Ukrainian, Urdu, and Vietnamese. Or, if there is a particular word or phrase you want to search for in another language, you can search for that word natively in any of the languages above.
Best of all, this new capability is designed as an embeddable machine-friendly API, meaning you can use it to generate a live CSV-formatted list of URLs to cross reference against GDELT 2.0’s Global Knowledge Graph or Event Database, or output article lists, timelines, or word cloud visualizations that you can embed right on your own website!
AVAILABLE VISUALIZATIONS AND OUTPUTS
The GDELT Full Text Search API currently offers a number of different output formats:
- URL List. This is the most basic output mode and is designed for machine consumption by automated scripts that use the URLs to cross reference against the GDELT GKG and Event databases. It outputs a CSV file with the date/time, language, and URL of each matching article.
- Text Only Article List. This is the most basic human-friendly output mode and is a simple textual list of all matching articles. For each article, the title, source outlet, date/time converted to the user’s time zone, article language, and source outlet location are displayed.
- Image Only Article List. This is identical to the Text Only Article List above, but only displays articles that provide a featured social sharing header image.
- Mixed Article List. This offers a mixture of the two article modes above, displaying all articles regardless of whether they provide a social sharing image and displaying such images where they are provided.
- Image Collage List. Similar to Image Only Article List above, this displays only articles that provide a featured social sharing header image, but displays only the image itself as a thumbnail link to the full article. It is designed for high-density visual treatments.
- Volume Timeline. This is a simple timeline displaying the total volume of matching articles in 15 minute increments. Since news volume naturally varies through the course of the day, the Y axis is normalized by dividing the number of matching articles in a given 15 minute increment by the total global volume of coverage monitored by GDELT in that same interval, yielding what amounts to a “global news volume intensity” indicator.
- Sentiment Timeline. This is a simple timeline that shows the average sentiment/tone (from “very happy” to “very sad”) of all matching coverage in 15 minute increments, allowing you to rapidly spot major changes in the tone of coverage of interest.
- English Word Cloud. This is a simple browser-based word cloud that shows the top words appearing in matching coverage. To minimize computing resources, it currently uses only the first 20 matching results (you can control whether these are the most recent 20 or the most relevant 20 by using the various sort and recency search options). This visualization offers a rapid triage of the major narratives of the coverage of interest. The visualization operates on the English translation of matching coverage, so offers a global perspective.
- Native Word Cloud. This word cloud visualization is identical to the one above, but operates on the original native language material. Thus, coverage with mixtures of Arabic, Japanese, and French articles will yield a word cloud with multiple languages (you can control this using the language selection search options). This is especially useful to understand how native domestic language coverage is discussing a particular event.
EMBEDDING THE API
You can embed any of the visualizations/outputs above into your own website, using the options in the next two sections to configure its display.
For example, to embed a live word cloud of the top words appearing in Nigerian news media:
You can embed this word cloud in your own website using a simple iframe via the HTML code below:
<iframe src="http://api.gdeltproject.org/api/v1/search_ftxtsearch/search_ftxtsearch?query=sourcecountry:nigeria&output=wordcloud&sort=desc" height="500" scrolling="no" width=500></iframe>
The API currently recognizes the following query commands that you can pass in as part of the query itself (all commands MUST be lowercase and all commands are AND’d together):
- sortby: This controls how matching results are sorted.
- sortby:date This sorts results by date, with the most recent results first.
- sortby:rel This sorts by relevance for keyword or theme searches and is ignored for other types of searches (they revert to date).
- sortby:toneasc This sorts by the sentiment/tone of each article, with the most negative coverage first. (Larger positive scores indicate “happier” coverage and larger negative scores indicate “sadder” coverage.) This uses the base GDELT Tone score, which offers a general purpose tonal indicator.
- sortby:tonedesc This sorts by the sentiment/tone of each article, with the most positive coverage first. (Larger positive scores indicate “happier” coverage and larger negative scores indicate “sadder” coverage.) This uses the base GDELT Tone score, which offers a general purpose tonal indicator.
- domain: This allows you to restrict your search to a particular news outlet. Specifying an outlet like “cnn.com” will return only matching coverage from CNN. You can combine this with a keyword or other query to limit to just the given outlet, or you can specify this as the sole query, searching for all coverage from this domain (such as to make a word cloud of all coverage from a given outlet). For outlets that have multiple subdomains, if you specify just the root domain you will get all coverage from all subdomains, or if you specify the full subdomain you will limit to only that domain (“domain:cnn.com” will search both “cnn.com” and “arabic.cnn.com”).
- sourcelang: This narrows your search to only coverage published in the given language. Your search terms must still be in English, since you will be searching the English translations of all coverage, but this allows you to narrow your search to just a particular language. If you include this command multiple times in your query, the subsequent mentions will be ignored.
- sourcelangnot: This is the opposite of “sourcelang:” and allows you to exclude coverage in a particular language. This is useful for searching for coverage about a particular region and exclude material written in that region’s primary language. This option can be repeated multiple times in a single query to exclude multiple languages.
- searchlang: Unlike “sourcelang:”, which still searches the English translations of articles, this option instructs the API to search the raw native language text of the original articles, rather than the English translations. This is only relevant with a keyword search and expects the keyword to be in the requested language. For example, to search for a particular Arabic phrase for the Islamic State, you would search for “searchlang:arabic الدولة الإسلامية”. This allows you to use the API like a traditional news search engine and search for very specific words and phrases. NOTE that at this time for queries in Chinese, Japanese, Thai, and Vietnamese, you must split your query into individual words with spaces in between each word due to limitations in our current search interface.
- sourcecountry: This allows you to restrict your search to only news outlets believed to be located in the specified country. Multiword country names should be collapsed to a single word like “sourcecountry:saudiarabia”. This is especially powerful in allowing you to peer within a particular country to see how it is contextualizing a given event, or to search only local coverage about a specific event.
- sourcecountrynot: This is the opposite of “sourcecountry:” and allows you to exclude coverage from a given country. For example, you might want to search for all coverage about a country, but exclude its own domestic press and the press of its immediate neighbors. This option can be repeated to exclude multiple countries.
- lastminutes: By default the API searches the last 24 hours of monitored coverage. You can use this option to restrict the search to the last X minutes (in multiples of 15 minutes). For example, “lastminutes:60” searches just the last 60 minutes (one hour), while “lastminutes:360” searches the last six hours of coverage. Note that there is a rolling 15-30 minute delay in the search interface.
- tonemorethan: This returns only coverage with a sentiment/tone score greater than (happier than) the given score. This uses the base GDELT Tone score, which offers a general purpose tonal indicator that ranges from -100 (extremely sad) to 100 (extremely happy). In practice, most tone scores range between -10 and 10, with numbers closer to 0 being more neutral. You may find that you must adjust this value for different searches to get the best results.
- tonelessthan: This returns only coverage with a sentiment/tone score less than (sadder than) the given score. This uses the base GDELT Tone score, which offers a general purpose tonal indicator that ranges from -100 (extremely sad) to 100 (extremely happy). In practice, most tone scores range between -10 and 10, with numbers closer to 0 being more neutral. You may find that you must adjust this value for different searches to get the best results.
- tonemorethanabs: Similar to “tonemorethan” except that it uses the absolute value of the tone score (which does not distinguish between positive or negative), meaning that it is useful for distinguishing the magnitude of the tone score, rather than whether it is happy or sad. Put another way, this is useful for filtering for articles that are either extremely happy or extremely sad, returning both kinds, whereas “tonemorethan” will only return very happy articles and not very negative articles.
- tonelessthanabs: Same as “tonemorethanabs,” but returns only articles with an absolute value tone score of less than the specified value.
- All remaining words and phrases are treated as either full text keywords/keyphrases, or GDELT Theme searches. Any word in all capitals will be treated as a GDELT Theme search. Any phrase in quotation marks will be treated as an exact phrase search. All remaining words will be treated as individual search words.
There are also several configuration options you can pass as part of the URL. These must be passed as part of the URL, not as part of the query:
- query This is the actual query string to be searched on, including all keywords and commands from the section above.
- output This specifies the type of output that should be produced (see earlier in this blog post for a description of each kind of output). Some outputs accept an additional “outputtype” parameter that further configures the output format.
- output=urllist Generates standard machine-friendly list of URLs optimized for ingesting into automated processing workflows and cross referencing with the GDELT GKG and GDELT Event datasets.
- output=artlist This produces an HTML formatted text-only article list suitable for iframe embedding.
- output=artimgonlylist This produces an HTML formatted article list suitable for iframe embedding that includes ONLY those articles that include a featured social sharing image.
- output=artimglist This produces an HTML formatted article list suitable for iframe embedding that includes both image and non-image matching articles, displaying images for articles that include them and a blank square for articles that do not include them.
- output=artimgonlycollage This produces an HTML formatted article list suitable for iframe embedding that includes ONLY those articles that include a featured social sharing image and displays each article as a simple thumbnail image linked to the article. Thumbnails automatically tile to fill the available display space by rows.
- output=timeline This produces an interactive browser-based timeline visualization.
- output=timeline&outputtype=volume This is the default timeline type and displays the global media volume intensity (number of matching articles every 15 minutes divided by the total volume of all monitored coverage in that 15 minute interval).
- output=timeline&outputtype=tone Instead of volume of news coverage, this timeline displays the average tone of matching coverage.
- output=timelinecsv Same as above, but produces CSV output suitable for using with your own alternative visualization tools or ingesting into automated workflows. Supports the same “outputtype” options.
- output=wordcloud&outputtype=theme Displays a word cloud of the most popular GDELT Themes that appear in matching coverage. Size of each theme is based on the number of articles it appeared in. Position and orientation is not semantically meaningful and is set by the layout algorithm to optimize appearance and fill.
- output=wordcloud&outputtype=native Same as above, but displays the top native language words appearing in matching coverage. Automatically excludes all native English coverage. Size is based on total number of occurrences. Common stop word dictionaries are used to filter the results for most languages.
- output=wordcloud&outputtype=english Same as above, but displays the top words appearing in English translations of matching coverage. Size is based on total number of occurrences.
- output=wordcloudcsv Same as above, but produces CSV output suitable for using with your own alternative visualization tools or ingesting into automated workflows. Supports the same “outputtype” options.
- maxrows Configures how many records should be returned. For article list outputs, this controls how many articles are shown in the results. For word cloud output, controls how many articles are processed to generate the word cloud, up to a maximum of 20. Set to a small value to create an article list of just a few matching records, or set to a larger value to create a full-screen scrolling list of coverage. NOTE that at present the article list outputs do NOT support paging, so all results will be returned as a single HTML page.
- timelinesmooth This option only applies to timeline output and allows for moving average smoothing, setting the window size of the rolling average up to five 15 minute increments (and hour and 15 minutes). This helps to smooth out the jagged up/down curve in timelines.
- dropdup Performs very basic deduplication in an attempt to filter out duplicate articles, such as wire stories that were republished by large numbers of news outlets. Normally disabled for maximum search speed, but can be enabled if your search frequently generates large numbers of duplicative results. NOTE that this currently uses a very basic deduplication algorithm to ensure that similar, but not truly identical articles are not filtered out. NOTE that sometimes you will see the exact same article appear twice in your search results. This can be caused by a timing inconsistency in the pipeline that feeds the full text search index and we are working to resolve this.
- trans=googtrans Given that GDELT currently monitors and translates news from 65 languages in realtime, you will frequently be presented with search results in a wide array of languages that you yourself may not be able to read. To help with this, you can add "&trans=googtrans" to the HTML article list output formats ("output=artlist" and "output=artimgonlylist") which will embed a Google Translate widget on the page to translate all of the headlines to your own language in realtime. A dropdown will appear at the top of the results page the first time you view a results page and after you have selected your desired language it will remember your primary language and automatically translate all future results pages in realtime.
Here are some examples to help you get started using the new API:
- Simple text-only article list of the latest coverage containing the keyword "nigeria"
- Image-only article list of the latest coverage containing the keyword "nigeria" (only lists articles that include a featured image)
- Mixed article list of the latest coverage containing the keyword "nigeria" that includes both articles that include a featured image and those that do not and displays both, showing images for articles for which they are available.
- Mixed article list for Arabic language coverage mentioning the Arabic phrase “الدولة الإسلامية” (the string of hexadecimal codes and percent signs in the URL below is how browsers encode Arabic script in the URL)
- Timeline of the normalized volume of media coverage containing the keyword "nigeria" (smoothed with moving average)
- Timeline of the average “tone” of media coverage containing the keyword "nigeria" (smoothed with moving average)
- Word cloud of the top words appearing in the latest CNN coverage
- Word cloud of the top words appearing in the English translations of the latest French language coverage
- Word cloud of the top words appearing in the latest French language coverage