Today we are immensely excited to announce a transformative new dataset for linguistic analysis: more than 101 billion tokens (words, word parts and punctuation) from a daily random sample of GDELT's total monitoring volume consisting of 100 million primarily English language online news articles from across the world from July 2016 through January 2020, with their part of speech information (tag, aspect, case, form, gender, mood, number, person, proper, reciprocity, tense and voice) and dependency parse label, along with example snippets of each construction, all machine computed by Google's Cloud Natural Language API.
This past Fall we announced a massive new online news ngrams dataset computing chargrams, unigrams and bigrams for 152 languages across worldwide online news each day at 15 minute resolution. Ngrams offer an incredibly powerful lens through which to track the relative popularity of specific words phrases, but are limited by their inability to examine the context and usage of those terms. How often is the word "can" used as a noun describing "a can of soup" versus a verb arguing that "he can run very fast"? How often is the word "dogged" used as an adjective as in "dogged detective work" or "dogged determination" versus as a verb as in "has been dogged by" or "that dogged the start"? What are the various sentential roles a word takes as determined by its dependency tree parse? Such questions lie at the heart of linguistic analysis and are critical to everything from improving deep learning text understanding systems to cataloging the evolution of language itself.
To help answer these questions, today we are releasing this immense new dataset compiled using Google's Cloud Natural Language API.
Every 15 minutes a small random sample of primarily English worldwide online news articles are selected from all online coverage GDELT monitored over the previous 15 minutes and annotated through the API, totaling around 70,000-100,000 articles per day. The API performs automatic language detection for each article and processes the entire document under the recognized language and thus mixed-language articles will typically be processed using the grammatical rules of the dominate language.
Once all articles from a given 15 minute interval have been processed by the API, the results are aggregated into a token-centric dataset. Each document token within the 15 minute period is concatenated into a hash key consisting of the case-sensitive token and all its computed attributes. Token case is preserved so that "Can" and "can" are treated differently in order to support linguistic assessments of capitalization rates and rules. All appearances of the token with the same set of attributes in that given 15 minutes are collapsed into a single record, with a "count" field recording how many times that given token construct appeared and an "examples" field offering up to five example snippets with their URL citations showing the word being used that way, with each snippet up to five tokens in length with the given token at center. Even if a given construct appears multiple times in a given URL, at most one example of each construct is selected from each article. Thus, an article that contains "can" in a verb sense multiple times will yield just a single example snippet of its noun usage, but if "can" appears multiple times as both a noun and a verb, one example each of the two usages will be included.
All attributes for each token for which the Cloud Natural Language API returned a value other than "UNKNOWN" are included. Note that not all attributes are applicable to all languages or forms.
Remember that these results are 100% machine assigned with no human review or correction. There will inevitably be some degree of error in these assignments, especially around rare and edge case grammatical constructs. If you come across particularly noteworthy errors, please let us know! Due to the way in which this dataset was constructed there will be some level of duplication in which examples from the same URL appear in multiple timeslots – the underlying processing workflow will shortly be revised to eliminate those.
What are the kinds of questions you can ask of the data?
Here's a trivial example of asking what parts of speech the word "dogged" appeared as on January 1, 2020:
SELECT token, posTag FROM `gdelt-bq.gdeltv2.web_pos` WHERE DATE(dateTime) = "2020-01-01" and token='dogged' group by token,posTag
It appears that on that day, the word "dogged" appeared as both an adjective and a verb.
What about its dependency roles that day?
SELECT token, dependencyLabel FROM `gdelt-bq.gdeltv2.web_pos` WHERE DATE(dateTime) = "2020-01-01" and token='dogged' group by token, dependencyLabel
The result (with descriptions added from their definitions in the Cloud Natural Language API documentation):
|dogged||ADVCL||Adverbial clause modifier|
|dogged||AMOD||Adjectival modifier of an NP|
|dogged||CCOMP||Clausal complement of a verb or adjective|
|dogged||PCOMP||The complement of a preposition is a clause|
|dogged||POBJ||Object of a preposition|
|dogged||RCMOD||Relative clause modifier|
|dogged||XCOMP||Open clausal complement|
Of course, the true power of this new dataset comes in actually being able to see the examples in-situ with all of their associated attributes.
Here's a sample of adjective usage:
SELECT * FROM `gdelt-bq.gdeltv2.web_pos` WHERE DATE(dateTime) = "2020-01-01" and token='dogged' and posTag='ADJ' LIMIT 1000
|2020-01-01 12:45:00 UTC||1||dogged||en||dogged||ADJ||PAST||AMOD||http://www.thenewsherald.com/news/lincoln-park-man-has-family-ties-to-historic-mayflower-ship/article_14cdf9ce-2995-11ea-bfdb-ab1437f7e94b.html||determination and dogged detective work|
|2020-01-01 20:30:00 UTC||1||dogged||en||dogged||ADJ||PAST||CONJ||https://calgaryherald.com/entertainment/the-great-majority-of-our-new-years-resolutions-will-fail-and-yet-we-continue-to-make-them-nonetheless/wcm/4b0650b6-4990-47f7-8ab4-710bcf4cb8ee||large so dogged , so|
|2020-01-01 20:30:00 UTC||1||dogged||en||dogged||ADJ||PAST||POBJ||https://calgaryherald.com/entertainment/the-great-majority-of-our-new-years-resolutions-will-fail-and-yet-we-continue-to-make-them-nonetheless/wcm/4b0650b6-4990-47f7-8ab4-710bcf4cb8ee||large so dogged , so|
|2020-01-01 20:30:00 UTC||1||dogged||en||dogged||ADJ||PAST||ACOMP||https://calgaryherald.com/entertainment/the-great-majority-of-our-new-years-resolutions-will-fail-and-yet-we-continue-to-make-them-nonetheless/wcm/4b0650b6-4990-47f7-8ab4-710bcf4cb8ee||evidently not dogged or tenacious|
|2020-01-01 22:45:00 UTC||1||dogged||en||dogged||ADJ||PAST||AMOD||https://indianexpress.com/article/opinion/columns/citizenship-amendment-act-nrc-mohan-bhagwat-6195387/||underlying the dogged and insidious|
|2020-01-01 22:30:00 UTC||1||dogged||en||dogged||ADJ||PAST||AMOD||https://news.sky.com/story/labour-leadership-sir-keir-starmer-takes-lead-in-race-to-replace-corbyn-poll-11899154||Starmer 's dogged determination not|
|2020-01-01 15:00:00 UTC||1||dogged||en||dogged||ADJ||PAST||AMOD||https://wusfnews.wusf.usf.edu/post/sunshine-savior-retiring-barbara-petersen-floridas-first-amendment-champion||for her dogged pursuit of|
|2020-01-01 01:45:00 UTC||1||dogged||en||dogged||ADJ||PAST||AMOD||https://www.niemanlab.org/2019/12/a-smarter-conversation-about-how-and-why-fact-checking-matters/||, where dogged reporters expose|
|2020-01-01 13:30:00 UTC||1||dogged||en||dogged||ADJ||PAST||AMOD||https://www.sthelensstar.co.uk/news/18130841.highlights-challenges-saints-2020/||shown some dogged determination ,|
Requesting its verb usage is equally trivial:
SELECT * FROM `gdelt-bq.gdeltv2.web_pos` WHERE DATE(dateTime) = "2020-01-01" and token='dogged' and posTag='VERB' LIMIT 1000
|2020-01-01 18:30:00 UTC||1||dogged||en||dog||VERB||PAST||ROOT||https://sputniknews.com/latam/202001011077917429-panama-canal-reportedly-suffering-from-major-water-shortage-lacks-over-40-of-needed-volume/||supply has dogged the Panama|
|2020-01-01 23:00:00 UTC||1||dogged||en||dog||VERB||PAST||PASSIVE||CONJ||https://www.sott.net/article/426632-Is-the-shale-boom-running-on-fumes||long been dogged by a|
|2020-01-01 13:15:00 UTC||1||dogged||en||dog||VERB||PAST||PASSIVE||ROOT||http://www.theprogressnews.com/news/state/editorial-roundup-pennsylvania/article_419717a7-f124-544b-a53e-93e7f8b8539d.html||are n't dogged by the|
|2020-01-01 16:00:00 UTC||1||dogged||en||dog||VERB||PAST||PASSIVE||CONJ||https://www.the42.ie/jack-grealish-heel-var-villa-burnley-4950998-Jan2020/||year was dogged by another|
|2020-01-01 03:45:00 UTC||1||dogged||en||dog||VERB||PAST||AMOD||https://www.thisdaylive.com/index.php/2020/01/01/natures-gift-to-the-nation/||with the dogged determination of|
|2020-01-01 03:15:00 UTC||1||dogged||en||dog||VERB||PAST||PASSIVE||ROOT||https://tucson.com/news/state-and-regional/lawyers-want-state-to-cover-costs-of-monitoring-inmate-care/article_667da9a3-d5a5-5f39-89cf-6055ea73c274.html||has been dogged for several|
|2020-01-01 14:45:00 UTC||2||dogged||en||dog||VERB||PAST||RCMOD||https://www.middleeastmonitor.com/20200101-embracing-palestine-how-to-combat-israels-misuse-of-antisemitism/||which have dogged his leadership|
|2020-01-01 12:00:00 UTC||1||dogged||en||dog||VERB||PAST||ROOT||http://www.daijiworld.com/news/newsDisplay.aspx?newsID=658952||is still dogged with massive|
|2020-01-01 11:00:00 UTC||1||dogged||en||dog||VERB||PAST||PCOMP||https://futaa.com/article/199551/arsenal-vs-manchester-united-here-s-where-to-place-your-bets||sides being dogged by inconsistency|
|2020-01-01 04:15:00 UTC||1||dogged||en||dog||VERB||PAST||ROOT||https://www.pensionplanpuppets.com/2019/12/31/21044975/recap-maple-leafs-win-a-wild-new-years-eve-game-minnesota||could have dogged it a|
|2020-01-01 15:00:00 UTC||1||dogged||en||dog||VERB||PAST||PASSIVE||RCMOD||https://www.staradvertiser.com/2020/01/01/hawaii-news/ahead-in-2020-first-rail-segment-to-open-by-the-end-of-year/?HSA=3a53de7e21313deb43ec33ba0d2c5e21537960ec||has been dogged by years|
|2020-01-01 02:15:00 UTC||1||dogged||en||dog||VERB||PAST||AMOD||https://www.havasunews.com/opinion/our-view-s-created-exciting-vision-for-the-s/article_bd8e3902-2c29-11ea-9fdb-47dc65f3c52b.html||thanks to dogged determination of|
|2020-01-01 03:15:00 UTC||1||dogged||en||dog||VERB||PAST||ROOT||http://www.xinhuanet.com/english/2020-01/01/c_138670076.htm||divide has dogged British politics|
|2020-01-01 07:45:00 UTC||1||dogged||en||dog||VERB||PAST||PASSIVE||ROOT||https://www.yenisafak.com/en/world/trumps-2019-slaps-sanctions-on-iran-turkey-and-venezuela-and-supports-israel-3508633||has been dogged by militant|
|2020-01-01 12:45:00 UTC||1||dogged||en||dog||VERB||PAST||PASSIVE||ROOT||https://www.scotsman.com/arts-and-culture/edinburgh-festivals/edinburgh-council-leader-suggests-hogmanay-party-could-be-scaled-down-from-global-bucket-list-status-in-future-1-5069031||have been dogged by controversies|
|2020-01-01 16:00:00 UTC||1||dogged||en||dog||VERB||PAST||AMOD||https://www.breitbart.com/national-security/2020/01/01/latin-america-socialists-regain-momentum-in-2019-despite-removal-of-bolivias-morales/||him of dogged loyalty to|
|2020-01-01 16:00:00 UTC||1||dogged||en||dog||VERB||PAST||ROOT||http://morungexpress.com/avian-world-record-india||is still dogged with massive|
|2020-01-01 08:00:00 UTC||1||dogged||en||dog||VERB||INDICATIVE||PAST||RCMOD||https://www.mirror.co.uk/3am/celebrity-news/cody-simpson-declares-love-miley-21193109||rumours that dogged the start|
These are trivial examples of how the new dataset can be used!
The complete dataset is available as a set of gzipped newline-delimited JSON files and in BigQuery.
You can download all of the JSON files from:
The table is also available in Google's BigQuery:
The fields within are as follows:
- dateTime. The date and time (currently rounded to the nearest 15 minutes but in future may use higher precision) that GDELT processed the article through Google's Cloud Natural Language API. Typically this is within 30 minutes of the article being published, but sometimes can take hours to days.
- count. Each document token is concatenated into a case-sensitive hash key consisting of all attribute fields as described earlier.
- token. The actual token itself as determined by the API. This is typically a word or punctuation mark.
- lang. The Google-assigned language code for the article this word came from, representing the linguistic rules under which the word was processed. The API performs automatic language detection on each article and this code represents the language the API saw the article as and processed its contents under. Articles that contain words in multiple languages will be assigned a single language by the API, as recorded here.
- lemma. The token's lemma.
- posTag. The token's part of speech.
- posAspect. The token's grammatical aspect.
- posCase. The token's grammatical case.
- posForm. The token's grammatical form.
- posGender. The token's grammatical gender.
- posMood. The token's grammatical mood.
- posNumber. The token's grammatical number.
- posPerson. The token's grammatical person.
- posProper. The token's grammatical properness.
- posReciprocity. The token's grammatical reciprocity.
- posTense. The token's grammatical tense.
- posVoice. The token's grammatical grammatical voice.
- dependencyLabel. The token's dependency parse label.
- examples. An array of up to five example snippets from the given 15 minute period showing the word being used in the specified context. Each snippet is up to five tokens long, with the given token at the center flanked by the two preceding and two subsequent words. Each unique token+dateTime+attributes triplet will feature only a single example from a given URL even if the given usage appears multiple times in that article in order to maximize the number of distinct source examples.
- url. The URL from which the example snippet comes.
- context. A snippet of up to five tokens with the token in question at the center.
We're enormously excited to see what you can do with this immense new dataset!