Since September 2019 the GKG 2.0 has included the title of each article in a "<PAGE_TITLE></PAGE_TITLE>" block in the "Extras" field. In keeping with the GKG 2.0's legacy ASCII format, titles that include non-ASCII characters are HTML-escaped. Thus, you will see strings like "ایم کیو ایم کو" in place of "ایم کیو ایم کو" since all non-ASCII characters are escaped.
This can present a challenge when working with the GKG 2.0 in BigQuery, since the platform does not provide a native HTML entity unescape function. Thankfully, BigQuery's UDF functionality allows us to trivially add this capability.
To extract the title for post-September 2019 GKG 2.0 records, simply use "REGEXP_EXTRACT(Extras, r'<PAGE_TITLE>(.*?)<\/PAGE_TITLE>')" and to escape them using our UDF, simply wrap it in a call to "titleunescape" and copy-paste the UDF to the top of the query as follows:
CREATE TEMPORARY FUNCTION titleunescape(title STRING) RETURNS STRING LANGUAGE js AS ''' return title.replace(/(&#x)([a-zA-Z0-9]+)(;)/gu, function (whole, capture1, capture2) { return String.fromCodePoint(parseInt(capture2, 16)) }); '''; SELECT DocumentIdentifier, REGEXP_EXTRACT(Extras, r'<PAGE_TITLE>(.*?)<\/PAGE_TITLE>') title, titleunescape( REGEXP_EXTRACT(Extras, r'<PAGE_TITLE>(.*?)<\/PAGE_TITLE>') ) titleunescaped, REGEXP_EXTRACT(TranslationInfo, r'srclc:(.*?);') lang FROM `gdelt-bq.gdeltv2.gkg_partitioned` WHERE DATE(_PARTITIONTIME) = "2021-08-04" and (V2Themes like '%VACCIN%' OR V2Themes like '%IMMUNIZATION%') and Extras like '%&#x%'
You can see the results below, with the "title" column showing the raw HTML-escaped title as it appears in the GKG 2.0 record and the "titleunescaped" column containing the unescaped original Unicode version.
Row | DocumentIdentifier | title | titleunescaped | lang | |
---|---|---|---|---|---|
1 |
https://www.livehindustan.com/uttar-pradesh/story-corona-vaccination-up-make-big-record-more-than-5-crore-people-got-jab-in-one-day-cm-yogi-says-there-will-no-let-up-in-availability-of-vaccines-4290986.html
|
corona vaccination up make big record more than 5 crore people got jab in one day cm yogi says there will no let up in availability of vaccines – कोरोना टीकाकरण में यूपी ने बनाया एक और कीर्तिमान, 5 करोड़ से ज्यादा लोगों को लगा टीका, सीएम ने कही ये बात
|
corona vaccination up make big record more than 5 crore people got jab in one day cm yogi says there will no let up in availability of vaccines – कोरोना टीकाकरण में यूपी ने बनाया एक और कीर्तिमान, 5 करोड़ से ज्यादा लोगों को लगा टीका, सीएम ने कही ये बात
|
hin
|
|
2 |
https://www.livehindustan.com/national/story-r-value-on-rise-in-eight-states-second-wave-is-still-a-tension-cautions-government-india-hindi-news-4290990.html
|
R Value on rise in eight states second wave is still a tension cautions Government – India Hindi News – खत्म नहीं हुई है कोरोना की दूसरी लहर, 8 राज्यों में अभी भी R-वैल्यू ज्यादा, सरकार ने चेताया
|
R Value on rise in eight states second wave is still a tension cautions Government – India Hindi News – खत्म नहीं हुई है कोरोना की दूसरी लहर, 8 राज्यों में अभी भी R-वैल्यू ज्यादा, सरकार ने चेताया
|
hin
|
|
3 |
https://meiemaa.ee/2021/08/04/noorte-sustimine-ootab-labimurret-maalapsed-ei-tohi-kaitseta-jaada/
|
Noorte süstimine ootab läbimurret, maalapsed ei tohi kaitseta jääda
|
Noorte süstimine ootab läbimurret, maalapsed ei tohi kaitseta jääda
|
est
|
|
4 |
https://gorkhapatraonline.com/opinion/2021-08-04-43596
|
महामारी नियन्त्रण कार्ययोजना (सम्पादकीय)
|
महामारी नियन्त्रण कार्ययोजना (सम्पादकीय)
|
nep
|
|
5 |
https://www.rtvslo.si/svet/v-vuhanu-bodo-po-sedmih-lokalno-prenesenih-okuzbah-testirali-vseh-11-milijonov-prebivalcev/589687
|
V Vuhanu bodo po sedmih lokalno prenesenih okužbah testirali vseh 11 milijonov prebivalcev
|
V Vuhanu bodo po sedmih lokalno prenesenih okužbah testirali vseh 11 milijonov prebivalcev
|
slv
|
|
6 |
https://sakala.postimees.ee/7307463/kristina-kallas-uus-vaktsineerimisplaan-ei-kolba-kuhugi
|
Kristina Kallas: uus vaktsineerimisplaan ei kõlba kuhugi
|
Kristina Kallas: uus vaktsineerimisplaan ei kõlba kuhugi
|
est
|
|
7 |
https://www.balatarin.com/permlink/2021/8/3/5633288
|
بالاترین: ۵ تفاوت مهم (عباس عبدی)
|
بالاترین: ۵ تفاوت مهم (عباس عبدی)
|
fas
|
|
8 |
https://borsen.dk:443/nyheder/generelt/israel-genindforer-coronarestriktioner-efter-smittestigning1
|
Israel genindfører coronarestriktioner efter smittestigning
|
Israel genindfører coronarestriktioner efter smittestigning
|
dan
|
|
9 |
https://sakala.postimees.ee/7307462/kalvi-kova-maalapsed-ei-tohi-kaitsesustita-jaada
|
Kalvi Kõva: Maalapsed ei tohi kaitsesüstita jääda
|
Kalvi Kõva: Maalapsed ei tohi kaitsesüstita jääda
|
est
|
|
10 |
https://www.divyabhaskar.co.in/local/gujarat/ahmedabad/news/of-cities-in-vaccination-in-the-128774222.html
|
Villages ahead of cities in vaccination in the gujarat | રાજ્યમાં વેક્સિન લેવા મામલે શહેરો કરતા ગામડાંઓ વધુ આગળ, 16 દિવસમાં શહેરોમાં 19 લાખ ડોઝ સામે ગામડાંઓમાં 26 લાખ ડોઝ અપાયા
|
Villages ahead of cities in vaccination in the gujarat | રાજ્યમાં વેક્સિન લેવા મામલે શહેરો કરતા ગામડાંઓ વધુ આગળ, 16 દિવસમાં શહેરોમાં 19 લાખ ડોઝ સામે ગામડાંઓમાં 26 લાખ ડોઝ અપાયા
|
guj
|
Once again, the power of UDFs makes tasks trivial in BigQuery!