What Can 101 Billion Words, The NLP API And BigQuery Tell Us About The Part Of Speech Patterns Of The English Language?

The massive new 101-billion-token Web Part of Speech dataset annotated by Google's Cloud Natural Language API released yesterday offers unprecedented insights into the evolution and usage of language globally over the past third of a decade.

Given that the entire dataset is available in BigQuery, asking language-scale questions is as simple as a single line of SQL.

How many distinct case-insensitive tokens and lemmas are there in the English portion of the dataset?

SELECT count(distinct LOWER(token)), count(distinct LOWER(lemma)) FROM `gdelt-bq.gdeltv2.web_pos` where lang='en'

The answer is 53,437,014 distinct tokens and 52,906,365 distinct lemma.

How many of those lemma are different from the their token? In other words, how often does lemmatization change the word?

SELECT LOWER(token), LOWER(lemma), count(1) count FROM `gdelt-bq.gdeltv2.web_pos` where lang='en' and token!=lemma group by LOWER(token), LOWER(lemma) order by count desc

In all, 2,007,675 total distinct tokens are different from their lemma. This is a surprisingly large number suggesting that many of these may be misspellings and rare words. What if we limit to just those token-lemma pairs seen in at least 11 pairings (this can be a grammatical context or a 15 minute interval)?

SELECT LOWER(token), LOWER(lemma), count(1) count FROM `gdelt-bq.gdeltv2.web_pos` where lang='en' and token!=lemma group by LOWER(token), LOWER(lemma) having count>10 order by count desc

This reduces the total count to 240,020 distinct tokens with lemmas different from the original token.

What is the part of speech breakdown of the dataset? How many more nouns do we see compared to verbs?

select posTag, sum(count) count FROM `gdelt-bq.gdeltv2.web_pos` where lang='en' group by posTag order by count desc

This yields the table below (see the Cloud Natural Language API documentation for definitions):

posTag	Count	%
NOUN	30,339,745,788	29.99
PUNCT	14,668,416,836	14.5
VERB	14,219,986,924	14.06
ADP	10,692,080,137	10.57
DET	8,562,611,500	8.46
ADJ	5,840,252,133	5.77
PRON	4,202,292,240	4.15
ADV	3,661,411,543	3.62
NUM	2,861,812,273	2.83
CONJ	2,788,993,340	2.76
PRT	2,369,605,987	2.34
X	956,943,528	0.95

Interestingly, despite its ubiquity, punctuation accounts for just 14.5% of the total token volume, a little more than verbs, while nouns account for almost 30%. The label "X" refers to "Other: foreign words, typos, abbreviations" meaning that 99.05% of words in the dataset have an assigned linguistic role.

What about their dependency roles? Once again, a single SQL query is all that's required.

select dependencyLabel, sum(count) count FROM `gdelt-bq.gdeltv2.web_pos` where lang='en' group by dependencyLabel order by count desc

This yields the table below (see the Cloud Natural Language API documentation for definitions):

dependencyLabel	Count	%
P	14,978,583,119	14.81
PREP	9,400,425,051	9.29
POBJ	8,900,253,352	8.8
NN	8,246,920,859	8.15
DET	7,483,773,285	7.4
NSUBJ	6,521,929,575	6.45
AMOD	5,011,074,619	4.95
DOBJ	4,141,047,892	4.09
ROOT	3,774,814,457	3.73
AUX	3,375,326,375	3.34
CONJ	3,223,887,459	3.19
ADVMOD	3,164,811,153	3.13
CC	2,789,876,948	2.76
NUM	1,865,646,399	1.84
POSS	1,759,028,293	1.74
DEP	1,677,733,243	1.66
APPOS	1,522,666,088	1.51
CCOMP	1,462,459,573	1.45
XCOMP	1,214,964,517	1.2
MARK	1,032,397,231	1.02
RCMOD	890,104,048	0.88
VMOD	864,783,810	0.85
AUXPASS	776,299,987	0.77
ADVCL	749,694,374	0.74
ACOMP	718,181,564	0.71
NSUBJPASS	670,925,459	0.66
ATTR	667,006,502	0.66
PS	664,259,826	0.66
PCOMP	630,466,510	0.62
NEG	549,964,461	0.54
TMOD	393,526,973	0.39
PRT	332,119,940	0.33
PARATAXIS	297,485,305	0.29
TITLE	281,195,129	0.28
NPADVMOD	225,647,728	0.22
NUMBER	189,552,961	0.19
EXPL	120,628,958	0.12
QUANTMOD	94,776,418	0.09
MWE	92,056,954	0.09
UNKNOWN	86,213,529	0.09
IOBJ	70,724,484	0.07
CSUBJ	66,722,115	0.07
SUFFIX	49,822,065	0.05
DISCOURSE	49,479,219	0.05
PREDET	47,201,936	0.05
PRECONJ	36,121,954	0.04
CSUBJPASS	977,932	0
GOESWITH	589,888	0
REPARANDUM	2,676	0
PARTMOD	26	0
VOCATIVE	10	0

BigQuery's immense scalability makes it a perfect lens through which to understand the macro-scale patterns of our world through Google's AI offerings like Cloud Natural Language API.

The GDELT Project

What Can 101 Billion Words, The NLP API And BigQuery Tell Us About The Part Of Speech Patterns Of The English Language?

Archives