Topic Mining The World’s News with GDELT with NGrams – Felipe Hoffa

Felipe Hoffa is a Developer Advocate on big data at Google, whose mission is "to inspire developers around the world to leverage the Google Cloud Platform tools to analyze and understand their data in ways they could never before." In this Reddit post, Felipe talks about how to perform basic, but exceptionally powerful, ngram-based topic modeling on the world's news each day by analyzing the SOURCEURL column of the GDELT Event Database using Google BigQuery. He includes all of the necessary BigQuery SQL code to generate a list of the top trending terms associated with a given day, automatically comparing the day's terms against those of the previous two weeks to identify terms uniquely trending on that particular day.

GDELT does not make available the underlying text of the news articles it processes, instead applying incredibly sophisticated natural language algorithms to identify and compile a list of events from around the world and their underpinnings each day. Thus, Felipe's approach, of using the URLs of the articles the events came from, is a fantastically creative way of lending further context to the day's events, leveraging the fact that between 60% and 70% of news articles published on the web each day use descriptive keywords or part of the article title as their URL. Comparing the results of this approach with a traditional ngram of the entire fulltext of the articles from a given day yields a correlation of around r=0.30 to r=0.41. It is therefore not a perfect correlation, but does capture the overall "gist" of major events each day. By narrowing down to a particular event type or country, you can even create type-specific and/or country-specific topical ngram lists. The two-week rolling filtering process, comparing the top words on a given day against their popularity over the previous two weeks, does an incredible job of surfacing the major trends and patterns each day.

Here are the results of a couple of days via his query showing the power of this approach:

date	words
2014-08-19	foley,assault,fatal,team,unrest,pak,record,safety,night,coal
2014-08-18	assange,embassy,julian,soon,london,six,sites,shows,wikileaks,pay
2014-08-17	retake,mosul,liberia,nuclear,fighter,syrian,shoot,ukrainian,fears,jet
2014-08-16	dam,indictment,curfew,massacre,container,german,church,singapore,me,again
2014-08-15	perry,indicted,robbery,vehicles,steps,name,gives,speech,destroyed,girls
2014-08-14	independence,level,visits,protesters,francis,fires,common,wife,tear,louis
2014-08-13	refugees,candidate,journalist,jobs,daughter,wont,study,brazil,area,fall
2014-08-12	convoy,brown,michael,ferguson,drugs,sends,experimental,cross,crimes,office
2014-08-11	maliki,williams,robin,jail,job,mission,red,actor,arms,iraqs
2014-08-10	erdogan,missouri,yazidis,tony,crashes,stewart,accept,turkeys,prime,continue
2014-08-09	kurds,shah,iraqis,ready,thousands,asean,bombs,amit,fair,turkish

Read Felipe's Full Reddit Post.

Connect with Felipe on Twitter.

The GDELT Project

Topic Mining The World's News with GDELT with NGrams – Felipe Hoffa

Archives