Announcing The Global Numeric Graph

Kalev Leetaru

5 years ago

We are tremendously excited to announce today the debut of the GDELT Global Numeric Graph (GNG), which compiles appearances of numeric statements across worldwide online news coverage in 152 languages. Each article monitored by GDELT is scanned for all appearances of numbers, either in the numeric characters of the given language for all 152 languages or, for around 100 languages and growing, spelled numbers (ie "one million" or "fifty" in English). Each appearance is compiled along with a brief context of how the number was used and the articles that specific number-in-context was seen in.

This inaugural release compiles nearly 3.8 billion numeric references across 152 languages dating back to January 1, 2020.

Why compile numbers? Numeric statements reflect one of the most basic forms of factual world knowledge that can be used to reason about global events and, by virtue of expressing a precise count, can be verified and used to identify contested or conflicted narratives, such as conflicting casualty counts from a natural disaster, or to enrich event records, such as recording the estimated size of a protest. The Covid-19 pandemic has brought to the forefront the need for precise numeric codification, such as recording public reporting of infections, causalities and vaccinations, as well as comparing public reports with primary sources to identify misreporting or conflicting information. Such insights are especially powerful with respect to combatting misinformation. Internally, GDELT has long used the density of precise numeric statements in an article as one signal of its "verifiability." An article that states "the stimulus bill is too expensive" represents an unevaluatable statement of opinion whereas "the stimulus bill is $1.9 trillion" can be definitively fact checked. Coverage with a higher density of precise numeric statements may still be incorrect, but represents a distinct kind of coverage from articles that have high levels of emotion and low levels of numeric clauses.

GDELT has long cataloged a variety of numeric information about the coverage it monitors. The Global Knowledge Graph includes the "Counts" and "V2.1Counts" fields that record numeric statements associated with specific events such as protests or deaths. The GKG's "V2.1Amounts" field records any numeric appearance, but like the Counts and V2.1Counts fields, relies on a language model tuned for English. Similarly, the Global Entity Graph G1 baseline identifies all numeric clauses in English coverage, while the Part of Speech dataset identifies numbers in the 11 languages supported by Google Cloud's Natural Language API. In contrast, the new Global Numeric Graph covers all 152 languages GDELT monitors, providing one of the richest glimpses at how numbers are expressed in realtime across the world's news.

This is a highly experimental dataset and relies heavily on information like the Unicode tables and various language resources to identify numeric characters and spelled words in each language. We welcome contributions and corrections for different languages. Note that to avoid biasing our results using a language model (since all models have built-in bias that would then be represented in the number contexts we identify), we use a context-free model to extract numbers. This means you will see the full range in which numbers are used across a language. For example, in English you might see examples like "one could do so if one so chose" ("one" refers to a person not specifically a count of 1) or "second, i really believe" ("second" indicating an extended list of arguments of which this is the second), and so on. The Global Numeric Graph's ability to capture this rich diversity of numeric expression across the world's languages yields a treasure trove of how we reason about the world through counts.

The table below shows a few examples from the dataset, showing its incredible richness and topical reach:

date	context	lang	urls.url	urls.title
2021-05-07 23:47:00 UTC	المغرب وكانت محافظة سوهاج قد وفرت ١٢٥ مليون جنيه لتوصيل الكهرباء للمصانع،	ARABIC	https://akhbarelyom.com/news/newdetails/3356048/1/%D8%A7%D8%B3%D8%AA%D8%AC%D8%A7%D8%A8%D8%A9-%D9%84%D9%80-%D8%A8%D9%88%D8%A7%D8%A8%D8%A9-%D8%A3%D8%AE%D8%A8%D8%A7%D8%B1-%D8%A7%D9%84%D9%8A%D9%88%D9%85–%D9%88%D8%B2%D8%A7%D8%B1%D8%A9-%D8%A7%D9%84%D9%83%D9%87	استجابة لـ«بوابة أخبار اليوم».. وزارة الكهرباء تحل مشكلة محول بسوهاج
2021-05-07 10:32:00 UTC	sõnul kõlas eile õhtu jooksul kokku kaheksa lasku, millest esimesed tehti kella kümne	ESTONIAN	https://www.delfi.ee/artikkel/93360375/parnus-avas-64-aastane-mees-vastasmaja-pihta-tule-pealtnagija-politsei-keelas-inimestel-kodudest-valjuda	Pärnus avas 64-aastane mees vastasmaja pihta tule. Pealtnägija: politsei keelas inimestel kodudest väljuda
2021-05-07 07:18:00 UTC	की संख्या कम होने के बाद भी रिपोर्ट तीन से चार दिन बाद मिल रही है। इससे कोरोना	HINDI	https://www.jagran.com/uttar-pradesh/agra-city-agra-coronavirus-news-update-number-of-corona-infected-in-agra-is-22853-and-289-death-toll-from-corona-21622281.html	AGRA CoronaVirus News Update Number of Corona infected in Agra is 23051 and 292 death toll from corona
2021-05-07 10:19:00 UTC	作用可使圈舍温度下降2—4℃。 2.在圈	Chinese	http://shuju.aweb.com.cn/technology/2011/0805/3114085234500.shtml	盛夏鹅舍巧降温-实用技术-农博数据中心
2021-05-07 08:19:00 UTC	市场的趋势就是两种：一是低保费换取用户规模，	Chinese	https://www.huxiu.com/article/426299.html	花100万彻底清除癌细胞，你愿意买单吗？
2021-05-07 20:16:00 UTC	菅義偉首相は７日、新型コロナウイルス	Japanese	https://www.oita-press.co.jp/1002000000/2021/05/07/NP2021050701001601	緊急事態、６都府県に拡大 – 大分のニュースなら大分合同新聞プレミアムオンライン Gate
2021-05-07 00:01:00 UTC	၆ ရက်နေ့က ပရဟိတကယ်ဆယ်ရေးလုပ်နေသူ လူငယ် ၁ ဦးကို အကြမ်းဖက် စစ်ကောင်စီလက်ပါးစေ လက်နက	BURMESE	http://burmese.dvb.no/archives/462477	ထားဝယ်မြို့တွင် ပရဟိတလုပ်နေသည့် အင်ဂျင်နီယာကျောင်းသား ၁ ဦး ဖမ်းဆီးခံရ
2021-05-07 18:03:00 UTC	ທານຊີ້ນຳ ກອງປະຊຸມຂັ້ນລັດຖະມົນຕີ ໃນວັນທີ 8 ເມສາ ໂດຍການເຂົ້າຮ່ວມຂອງ ເລຂາທິການໃຫຍ່ ສປ	LAOTHIAN	https://vietnam.vnanet.vn/lao/%E0%BA%82%E0%BA%B5%E0%BA%94%E0%BB%9D%E0%BA%B2%E0%BA%8D-%E0%BB%81%E0%BA%A5%E0%BA%B0/485633.html	ຂີດໝາຍ ແລະ ສານຂອງ ຫວຽດນາມບົນເວທີສາກົນ – ຂ່າວພາບຫວຽດນາມ
2021-05-07 20:16:00 UTC	El nuevo recinto es el hogar de 6 ejemplares de wallabies adultos	SPANISH	http://viajes.elpais.com.uy/2011/09/24/wallabies-temaiken-a-los-saltos/	Wallabies, Temaiken a los saltos
2021-05-07 20:16:00 UTC	nos diagnósticos confirmados, cinco estados e o Distrito Federal ficaram	PORTUGUESE	https://www.otempo.com.br/brasil/curva-de-novos-casos-de-covid-19-aumenta-e-mortes-apresentam-leve-queda-1.2482324	Curva de novos casos de Covid-19 aumenta, e mortes apresentam leve queda
2021-05-07 13:16:00 UTC	again. Currently, the SNP holds 61 of the 129 seats. Sturgeon is hoping	ENGLISH	http://english.sina.com/world/e/2021-05-07/detail-ikmyaawc3962021.shtml	World Insights: Scotland's election has far-reaching impact on UK's future, says expert – World News
2021-05-07 20:16:00 UTC	и западными медалями и орденами, более 2 тыс. стали Героями Советского Союза, из	RUSSIAN	https://racurs.ua/n154338-8-maya-otmechaut-den-pamyati-i-primireniya.html	День памяти и примирения отмечают в Украине — праздник завтра
2021-05-07 16:18:00 UTC	lasoo saaro ay noqon lahayd lambar saddex. Shidaalka waxaa noo dheer waxa la yiraahdo	SOMALI	https://www.bbc.com/somali/war-57001361	Seddax qodob oo xasaasi ah oo wasiirka arrimaha dibadda Soomaaliya uu uga hadlay Clubhuse
2021-05-07 11:02:00 UTC	uzalishaji wa zana za nyuklia. Awamu tatu zilizopita za mazungumzo zilitajwa na	SWAHILI	https://www.dw.com/sw/duru-ya-nne-ya-mazungumzo-ya-mpango-wa-nyuklia-kuanza-vienna/a-57458951	Duru ya nne ya mazungumzo ya mpango wa nyuklia kuanza Vienna \| Matukio ya Kisiasa \| DW
2021-05-07 05:02:00 UTC	ותיק התקשורת לקארין אלהרר מיש עתיד. שני תיקים הנחשבים בינוניים אך חשובים מאוד	HEBREW	https://www.israelhayom.co.il/news/politics/article/626606	מחלקים את התיקים: בנט ולפיד במאמץ לילי לסגור ממשלה
2021-05-07 07:47:00 UTC	ਯੋਗ ਆਬਾਦੀ ਨੂੰ ਕਰੋਨਾ ਵੈਕਸੀਨ ਦੀ ਘੱਟੋ-ਘੱਟ ਇਕ ਖੁਰਾਕ ਦਿੱਤੀ ਜਾ ਚੁੱਕੀ ਹੈ। ਪੀਐੱਮਓ ਨੇ ਕਿਹਾ	PUNJABI	https://www.punjabitribuneonline.com/news/nation/states-should-not-slow-down-vaccination-modi-70521	ਸੂਬੇ ਟੀਕਾਕਰਨ ਦੀ ਰਫ਼ਤਾਰ ਨੂੰ ਘੱਟ ਨਾ ਹੋਣ ਦੇਣ: ਮੋਦੀ

Each unique numeric context within a given minute appears on its own row, with an array of URLs that contained that context. This way numeric contexts that appeared across multiple articles seen by GDELT in a given minute will be grouped together. This also makes it trivial to look across larger time horizons such as hours, days, and eventually weeks and months to identify counts that are widely covered and/or that appear over a long time period versus those that receive little coverage or only brief bursts of coverage. It also makes it possible to rapidly identify high-velocity counts that are quickly going viral. Note that in cases where many numbers appear in close proximity to one another, a sliding window (sized based on the characteristics of numeric expression in each language) is used to limit matches, excluding a numeric match if it appears within 30 characters of a previous match, effectively limiting the number of matches to one every 30 characters (typically 5 characters for ideographic languages).

The final UTF8 newline delimited JSON file format is as follows, with each row being a unique numeric context:

date. The date GDELT saw the numeric context, rounded to the nearest minute.
context. The numeric expression in context as it appeared in the articles in a brief snippet.
lang. The language of the first article the context was seen in.
urls. A JSON array of all of the articles GDELT found within that given minute that contained this numeric context.
- url. The URL of the article.
- title. The title of the article.

You can download the dataset directly as per-minute UTF8 JSON-NL files beginning with "20200101000000" as the earliest file:

http://data.gdeltproject.org/gdeltv3/gng/YYYYMMDDHHMMSS.gng.json.gz

Remember that since GDELT currently operates on a 15 minute heartbeat, most articles are spread over a 4-5 minute period after each quarter-hour, which will be evening out as GDELT 3.0 launches.

The dataset is available as a BigQuery table:

gdelt-bq.gdeltv2.gng

Note that this is a pilot dataset, meaning we may actively change it based on feedback moving forward, including improving the accuracy of extraction for different languages. Please let us know how you use it so we can keep you updated and gather feedback as we evaluate potential changes.

We're tremendously excited to see what you're able to do with this incredible and unique dataset!