Global Embedded Metadata Graph (GEMG): Keywords & Google News Keywords In META Tags

The Global Embedded Metadata Graph (GEMG) records all HTML META tags found in each article. What percentage of articles each day include a META tag that lists publisher-provided descriptive keywords for the article, either a general "keywords" field or the Google News-specific "news_keywords" fields?

Exploring this question is as simple as:

SELECT count(distinct(url)) FROM `gdelt-bq.gdeltv2.gemg`, unnest(metatags) metatag WHERE (key='news_keywords' OR key='keywords') and DATE(date) >= '2021-11-01' and DATE(date) <= '2021-11-30'

Which yields 7,059,333 articles out of a total of 12,520,264. Thus, just over 56% of all articles include some form of keyword field in their META tags.

Keywords can also be found in JSON-LD blocks. Thus, a better estimate combines keywords found in both META tags and JSON-LD blocks:

SELECT COUNT(distinct(url)) from (
SELECT url FROM `gdelt-bq.gdeltv2.gemg`, unnest(metatags) metatag WHERE (key='news_keywords' OR key='keywords') and DATE(date) >= '2021-11-01' and DATE(date) <= '2021-11-30'
SELECT url FROM `gdelt-bq.gdeltv2.gemg`, unnest(jsonld) block WHERE (block like '%"keywords"%') and DATE(date) >= '2021-11-01' and DATE(date) <= '2021-11-30'

This returns 8,141,062 total articles, totaling 65% of all articles.

What kinds of keywords are to be found embedded in pages? Not all are useful keywords, with entries ranging from generic site-wide keywords repeated for all articles, to the title or description text being repeated in the keyword field to a simple word histogram including stopwords. For those entries that do contain valid keywords, entries typically include major entities and topics:

RSS activist murder case, RSS, BJP, Kerala Govt, V Muraleedharan, आरएसएस कार्यकर्ता, आरएसएस कार्यकर्ता हत्या मामले, आरएसएस, भाजपा, केरल सरकार, माकपा, इस्लामी आतंकवाद
PM Modi, Narendra Modi, Modi Govt, Farm Laws, Farmer Protests, new delhi, farmers, MSP, Rahul Gandhi, Mamata Banerjee, Amrinder Singh, Navjot Singh Sidhu, नरेंद्र मोदी, मोदी सरकार, कृषि कानून, किसान आंदोलन, एमएसपी, किसान, कैप्टन अमरिंदर सिंह, नवजोत सिंह सिद्धू, नरेंद्र सिंह तोमर, राकेश टिकैत, संयुक्त किसान मोर्चा, राहुल गांधी, मल्लिकार्जुन खड़गे, किसान संगठन, ममता बनर्जी
Madhyapradesh,bhopal,krishi kanon,pm,pm modi,prime minister, narendra modi,narendra singh tomar,bjp, congress,mp bjp,mp congress, madhyapradesh bjp, madhyapradesh congress,farmer,farmbill,farmers,kakkaji,madhyapradesh news, bhopal news,मध्यप्रदेश, भोपाल,मध्यप्रदेश न्यूज़, भोपाल न्यूज़, मध्यप्रदेश की खबर,भोपाल की खबर
kolkata-politics,news,state,Farmar, MSP law, Mamata Banerjee on Farm Laws Repeal, Rakesh Tikait, Rakesh Tikait, Kisan Agitation At UP Border, Delhi Farmers Protest, delhi politics, कोलकाता न्यूज, हिंदी न्यूज, HPCommonManIssues, Agricultural laws, Mamta Banerjee, victory of farmers, कृषि कानूनों की वापसी, भाजपा, किसान,Bengal Politics, TMC, Bengal by elections, Bengal by elections 2021, Bengal election news, West Bengal Politics, Kolkata Politics, BJP, संयुक्त किसान मोर्चा, एसकेएम नेता, माकपा पोलि…[TRUNCATED ORIGCHARLEN=545]
Narcotics Anonymous, Alcoholics Anonymous, Yellow House, Open meeting, Christian Businessmen's Committee, New Life Fellowship Church, Daily Calendar
Joe Biden, White House, Peanut Butter, the White House, Thanksgiving dinner, Peanut Butter and Jelly, White House, turkeys, Peanut Butter and Jelly, Jayson Lusk, turkey, Thanksgiving
President Biden, Stan Greenberg, House Democrats, Dave Wasserman, House seats, midterm elections, Democratic administration, GOP, Donald Trump, Jackie Speier, Daily Beast, President Obama, Tony Fabrizio
Kim Kardashian, Pete Davidson, relationship, holding hands, Travis Barker
Nancy Pelosi, Kevin McCarthy, House Democrats, Magic Minute
wheat production, Chicago wheat, Corn futures, wheat stocks
Matt Schlapp, Sesame Street, Ernie and Bert, Muppet characters, Asian-American
Japan, Scotland, Murrayfield, the World Cup
Township Council, designated areas, cannabis products, transfer tax, NORTH BRUNSWICK, cannabis delivery
Robert Brady, fire district, fire commissioners, EMS service, aid service
Greene Township, Guilford Township, Antrim Township, Washington Township, Hamilton Township, Greene Township, Antrim Township, CHAMBERSBURG, Southampton Township, Chambersburg, St. Thomas Township, Jeffrey Sams
Cathy Hutchinson, Brian Hutchinson, Holly Danca, Hutchinson, brain injuries, Kathy Whipple, TAUNTON
public sector banks, public finances, public sector, Office for National Statistics, public debt
Guadeloupe, violent demonstrations, government ministers, Minister Sebastien Lecornu, Gerald Darmanin, overseas territory, France, police, Sudip Kar-Gupta
peace poster contest, district competition, peace poster, International competition
TVA, natural gas, Chattanooga Gas, Chattanooga, natural gas prices, residential customers, Brian Child, residential customer, fuel costs, fuel cost
Asheville Police Department, Jesse William Allison, Asheville, ice cream, Highland Brewing, open containers, Buncombe County Sheriff
Congressman Chris Smith, water infrastructure, President Biden, Chamber of Commerce
Chris Stapleton, Ron DeSantis, Brandon Brown, Let’s go Brandon, Country Music Association Awards
Facebook, Nir Barkat, Facebook posts, inauthentic, fake accounts, fake accounts, Israeli politician, Jerusalem, Jerusalem, Achiya Schatz, Lindsey Graham, Jerusalem