What Can 101 Billion Words, The NLP API And BigQuery Tell Us About The Part Of Speech Patterns Of The English Language?

The massive new 101-billion-token Web Part of Speech dataset annotated by Google's Cloud Natural Language API released yesterday offers unprecedented insights into the evolution and usage of language globally over the past third of a decade.

Given that the entire dataset is available in BigQuery, asking language-scale questions is as simple as a single line of SQL.

How many distinct case-insensitive tokens and lemmas are there in the English portion of the dataset?

SELECT count(distinct LOWER(token)), count(distinct LOWER(lemma)) FROM `gdelt-bq.gdeltv2.web_pos` where lang='en'

The answer is 53,437,014 distinct tokens and 52,906,365 distinct lemma.

How many of those lemma are different from the their token? In other words, how often does lemmatization change the word?

SELECT LOWER(token), LOWER(lemma), count(1) count FROM `gdelt-bq.gdeltv2.web_pos` where lang='en' and token!=lemma group by LOWER(token), LOWER(lemma) order by count desc

In all, 2,007,675 total distinct tokens are different from their lemma. This is a surprisingly large number suggesting that many of these may be misspellings and rare words. What if we limit to just those token-lemma pairs seen in at least 11 pairings (this can be a grammatical context or a 15 minute interval)?

SELECT LOWER(token), LOWER(lemma), count(1) count FROM `gdelt-bq.gdeltv2.web_pos` where lang='en' and token!=lemma group by LOWER(token), LOWER(lemma) having count>10 order by count desc

This reduces the total count to 240,020 distinct tokens with lemmas different from the original token.

What is the part of speech breakdown of the dataset? How many more nouns do we see compared to verbs?

select posTag, sum(count) count FROM `gdelt-bq.gdeltv2.web_pos` where lang='en' group by posTag order by count desc

This yields the table below (see the Cloud Natural Language API documentation for definitions):

posTag Count %
NOUN 30,339,745,788 29.99
PUNCT 14,668,416,836 14.5
VERB 14,219,986,924 14.06
ADP 10,692,080,137 10.57
DET 8,562,611,500 8.46
ADJ 5,840,252,133 5.77
PRON 4,202,292,240 4.15
ADV 3,661,411,543 3.62
NUM 2,861,812,273 2.83
CONJ 2,788,993,340 2.76
PRT 2,369,605,987 2.34
X 956,943,528 0.95

Interestingly, despite its ubiquity, punctuation accounts for just 14.5% of the total token volume, a little more than verbs, while nouns account for almost 30%. The label "X" refers to "Other: foreign words, typos, abbreviations" meaning that 99.05% of words in the dataset have an assigned linguistic role.

What about their dependency roles? Once again, a single SQL query is all that's required.

select dependencyLabel, sum(count) count FROM `gdelt-bq.gdeltv2.web_pos` where lang='en' group by dependencyLabel order by count desc

This yields the table below (see the Cloud Natural Language API documentation for definitions):

dependencyLabel Count %
P 14,978,583,119 14.81
PREP 9,400,425,051 9.29
POBJ 8,900,253,352 8.8
NN 8,246,920,859 8.15
DET 7,483,773,285 7.4
NSUBJ 6,521,929,575 6.45
AMOD 5,011,074,619 4.95
DOBJ 4,141,047,892 4.09
ROOT 3,774,814,457 3.73
AUX 3,375,326,375 3.34
CONJ 3,223,887,459 3.19
ADVMOD 3,164,811,153 3.13
CC 2,789,876,948 2.76
NUM 1,865,646,399 1.84
POSS 1,759,028,293 1.74
DEP 1,677,733,243 1.66
APPOS 1,522,666,088 1.51
CCOMP 1,462,459,573 1.45
XCOMP 1,214,964,517 1.2
MARK 1,032,397,231 1.02
RCMOD 890,104,048 0.88
VMOD 864,783,810 0.85
AUXPASS 776,299,987 0.77
ADVCL 749,694,374 0.74
ACOMP 718,181,564 0.71
NSUBJPASS 670,925,459 0.66
ATTR 667,006,502 0.66
PS 664,259,826 0.66
PCOMP 630,466,510 0.62
NEG 549,964,461 0.54
TMOD 393,526,973 0.39
PRT 332,119,940 0.33
PARATAXIS 297,485,305 0.29
TITLE 281,195,129 0.28
NPADVMOD 225,647,728 0.22
NUMBER 189,552,961 0.19
EXPL 120,628,958 0.12
QUANTMOD 94,776,418 0.09
MWE 92,056,954 0.09
UNKNOWN 86,213,529 0.09
IOBJ 70,724,484 0.07
CSUBJ 66,722,115 0.07
SUFFIX 49,822,065 0.05
DISCOURSE 49,479,219 0.05
PREDET 47,201,936 0.05
PRECONJ 36,121,954 0.04
CSUBJPASS 977,932 0
GOESWITH 589,888 0
REPARANDUM 2,676 0
PARTMOD 26 0
VOCATIVE 10 0

BigQuery's immense scalability makes it a perfect lens through which to understand the macro-scale patterns of our world through Google's AI offerings like Cloud Natural Language API.