Unescaping Article Titles In The GKG 2.0

Since September 2019 the GKG 2.0 has included the title of each article in a "<PAGE_TITLE></PAGE_TITLE>" block in the "Extras" field. In keeping with the GKG 2.0's legacy ASCII format, titles that include non-ASCII characters are HTML-escaped. Thus, you will see strings like "&#x627;&#x6CC;&#x645; &#x6A9;&#x6CC;&#x648; &#x627;&#x6CC;&#x645; &#x6A9;&#x648;" in place of "ایم کیو ایم کو" since all non-ASCII characters are escaped.

This can present a challenge when working with the GKG 2.0 in BigQuery, since the platform does not provide a native HTML entity unescape function. Thankfully, BigQuery's UDF functionality allows us to trivially add this capability.

To extract the title for post-September 2019 GKG 2.0 records, simply use "REGEXP_EXTRACT(Extras, r'<PAGE_TITLE>(.*?)<\/PAGE_TITLE>')" and to escape them using our UDF, simply wrap it in a call to "titleunescape" and copy-paste the UDF to the top of the query as follows:

CREATE TEMPORARY FUNCTION titleunescape(title STRING) RETURNS STRING LANGUAGE js AS '''
  return title.replace(/(&#x)([a-zA-Z0-9]+)(;)/gu, function (whole, capture1, capture2) {
    return String.fromCodePoint(parseInt(capture2, 16))
  });
''';

SELECT DocumentIdentifier, REGEXP_EXTRACT(Extras, r'<PAGE_TITLE>(.*?)<\/PAGE_TITLE>') title, titleunescape( REGEXP_EXTRACT(Extras, r'<PAGE_TITLE>(.*?)<\/PAGE_TITLE>') ) titleunescaped, REGEXP_EXTRACT(TranslationInfo, r'srclc:(.*?);') lang FROM `gdelt-bq.gdeltv2.gkg_partitioned` WHERE DATE(_PARTITIONTIME) = "2021-08-04" and (V2Themes like '%VACCIN%' OR V2Themes like '%IMMUNIZATION%') and Extras like '%&#x%'

You can see the results below, with the "title" column showing the raw HTML-escaped title as it appears in the GKG 2.0 record and the "titleunescaped" column containing the unescaped original Unicode version.

Row DocumentIdentifier title titleunescaped lang
1
https://www.livehindustan.com/uttar-pradesh/story-corona-vaccination-up-make-big-record-more-than-5-crore-people-got-jab-in-one-day-cm-yogi-says-there-will-no-let-up-in-availability-of-vaccines-4290986.html
corona vaccination up make big record more than 5 crore people got jab in one day cm yogi says there will no let up in availability of vaccines – &#x915;&#x94B;&#x930;&#x94B;&#x928;&#x93E; &#x91F;&#x940;&#x915;&#x93E;&#x915;&#x930;&#x923; &#x92E;&#x947;&#x902; &#x92F;&#x942;&#x92A;&#x940; &#x928;&#x947; &#x92C;&#x928;&#x93E;&#x92F;&#x93E; &#x90F;&#x915; &#x914;&#x930; &#x915;&#x940;&#x930;&#x94D;&#x924;&#x93F;&#x92E;&#x93E;&#x928;, 5 &#x915;&#x930;&#x94B;&#x921;&#x93C; &#x938;&#x947; &#x91C;&#x94D;&#x92F;&#x93E;&#x926;&#x93E; &#x932;&#x94B;&#x917;&#x94B;&#x902; &#x915;&#x94B; &#x932;&#x917;&#x93E; &#x91F;&#x940;&#x915;&#x93E;, &#x938;&#x940;&#x90F;&#x92E; &#x928;&#x947; &#x915;&#x939;&#x940; &#x92F;&#x947; &#x92C;&#x93E;&#x924;
corona vaccination up make big record more than 5 crore people got jab in one day cm yogi says there will no let up in availability of vaccines – कोरोना टीकाकरण में यूपी ने बनाया एक और कीर्तिमान, 5 करोड़ से ज्यादा लोगों को लगा टीका, सीएम ने कही ये बात
hin
2
https://www.livehindustan.com/national/story-r-value-on-rise-in-eight-states-second-wave-is-still-a-tension-cautions-government-india-hindi-news-4290990.html
R Value on rise in eight states second wave is still a tension cautions Government – India Hindi News – &#x916;&#x924;&#x94D;&#x92E; &#x928;&#x939;&#x940;&#x902; &#x939;&#x941;&#x908; &#x939;&#x948; &#x915;&#x94B;&#x930;&#x94B;&#x928;&#x93E; &#x915;&#x940; &#x926;&#x942;&#x938;&#x930;&#x940; &#x932;&#x939;&#x930;, 8 &#x930;&#x93E;&#x91C;&#x94D;&#x92F;&#x94B;&#x902; &#x92E;&#x947;&#x902; &#x905;&#x92D;&#x940; &#x92D;&#x940; R-&#x935;&#x948;&#x932;&#x94D;&#x92F;&#x942; &#x91C;&#x94D;&#x92F;&#x93E;&#x926;&#x93E;, &#x938;&#x930;&#x915;&#x93E;&#x930; &#x928;&#x947; &#x91A;&#x947;&#x924;&#x93E;&#x92F;&#x93E;
R Value on rise in eight states second wave is still a tension cautions Government – India Hindi News – खत्म नहीं हुई है कोरोना की दूसरी लहर, 8 राज्यों में अभी भी R-वैल्यू ज्यादा, सरकार ने चेताया
hin
3
https://meiemaa.ee/2021/08/04/noorte-sustimine-ootab-labimurret-maalapsed-ei-tohi-kaitseta-jaada/
Noorte s&#xFC;stimine ootab l&#xE4;bimurret, maalapsed ei tohi kaitseta j&#xE4;&#xE4;da
Noorte süstimine ootab läbimurret, maalapsed ei tohi kaitseta jääda
est
4
https://gorkhapatraonline.com/opinion/2021-08-04-43596
&#x92E;&#x939;&#x93E;&#x92E;&#x93E;&#x930;&#x940; &#x928;&#x93F;&#x92F;&#x928;&#x94D;&#x924;&#x94D;&#x930;&#x923; &#x915;&#x93E;&#x930;&#x94D;&#x92F;&#x92F;&#x94B;&#x91C;&#x928;&#x93E; (&#x938;&#x92E;&#x94D;&#x92A;&#x93E;&#x926;&#x915;&#x940;&#x92F;)
महामारी नियन्त्रण कार्ययोजना (सम्पादकीय)
nep
5
https://www.rtvslo.si/svet/v-vuhanu-bodo-po-sedmih-lokalno-prenesenih-okuzbah-testirali-vseh-11-milijonov-prebivalcev/589687
V Vuhanu bodo po sedmih lokalno prenesenih oku&#x17E;bah testirali vseh 11 milijonov prebivalcev
V Vuhanu bodo po sedmih lokalno prenesenih okužbah testirali vseh 11 milijonov prebivalcev
slv
6
https://sakala.postimees.ee/7307463/kristina-kallas-uus-vaktsineerimisplaan-ei-kolba-kuhugi
Kristina Kallas: uus vaktsineerimisplaan ei k&#xF5;lba kuhugi
Kristina Kallas: uus vaktsineerimisplaan ei kõlba kuhugi
est
7
https://www.balatarin.com/permlink/2021/8/3/5633288
&#x628;&#x627;&#x644;&#x627;&#x62A;&#x631;&#x6CC;&#x646;: &#x6F5; &#x62A;&#x641;&#x627;&#x648;&#x62A; &#x645;&#x647;&#x645; (&#x639;&#x628;&#x627;&#x633; &#x639;&#x628;&#x62F;&#x6CC;)
بالاترین: ۵ تفاوت مهم (عباس عبدی)
fas
8
https://borsen.dk:443/nyheder/generelt/israel-genindforer-coronarestriktioner-efter-smittestigning1
Israel genindf&#xF8;rer coronarestriktioner efter smittestigning
Israel genindfører coronarestriktioner efter smittestigning
dan
9
https://sakala.postimees.ee/7307462/kalvi-kova-maalapsed-ei-tohi-kaitsesustita-jaada
Kalvi K&#xF5;va: Maalapsed ei tohi kaitses&#xFC;stita j&#xE4;&#xE4;da
Kalvi Kõva: Maalapsed ei tohi kaitsesüstita jääda
est
10
https://www.divyabhaskar.co.in/local/gujarat/ahmedabad/news/of-cities-in-vaccination-in-the-128774222.html
Villages ahead of cities in vaccination in the gujarat | &#xAB0;&#xABE;&#xA9C;&#xACD;&#xAAF;&#xAAE;&#xABE;&#xA82; &#xAB5;&#xAC7;&#xA95;&#xACD;&#xAB8;&#xABF;&#xAA8; &#xAB2;&#xAC7;&#xAB5;&#xABE; &#xAAE;&#xABE;&#xAAE;&#xAB2;&#xAC7; &#xAB6;&#xAB9;&#xAC7;&#xAB0;&#xACB; &#xA95;&#xAB0;&#xAA4;&#xABE; &#xA97;&#xABE;&#xAAE;&#xAA1;&#xABE;&#xA82;&#xA93; &#xAB5;&#xAA7;&#xAC1; &#xA86;&#xA97;&#xAB3;, 16 &#xAA6;&#xABF;&#xAB5;&#xAB8;&#xAAE;&#xABE;&#xA82; &#xAB6;&#xAB9;&#xAC7;&#xAB0;&#xACB;&#xAAE;&#xABE;&#xA82; 19 &#xAB2;&#xABE;&#xA96; &#xAA1;&#xACB;&#xA9D; &#xAB8;&#xABE;&#xAAE;&#xAC7; &#xA97;&#xABE;&#xAAE;&#xAA1;&#xABE;&#xA82;&#xA93;&#xAAE;&#xABE;&#xA82; 26 &#xAB2;&#xABE;&#xA96; &#xAA1;&#xACB;&#xA9D; &#xA85;&#xAAA;&#xABE;&#xAAF;&#xABE;
Villages ahead of cities in vaccination in the gujarat | રાજ્યમાં વેક્સિન લેવા મામલે શહેરો કરતા ગામડાંઓ વધુ આગળ, 16 દિવસમાં શહેરોમાં 19 લાખ ડોઝ સામે ગામડાંઓમાં 26 લાખ ડોઝ અપાયા
guj

Once again, the power of UDFs makes tasks trivial in BigQuery!