Looking Across Languages: The Dangers Of Agglomerative Storyline Clustering

One of the great promises of storyline clustering lies in its ability to draw the myriad threads of a complex story together into singular coherent narratives. For example, storyline clustering can tease apart the primary contextualizations of the Biden-Putin meeting and the dominate ways it was framed throughout the world's media. Pairwise similarity scoring compares each article to every other article, yielding a collection of similarity scores that are used to build up storyline clusters. The thresholds and clustering behavior used to group these pairs into larger clusters have an outsized impact.

For example, a common approach is to use agglomerative clustering in which articles above a certain level of similarity are grouped together. What happens when an article is highly similar to multiple clusters, forming a "bridge" between them? An agglomerative approach can quickly connect large numbers of independent storylines together using these bridge articles if similarity thresholds are not sufficiently stringent. However, the more restrictive the threshold, the more genuine clusters are missed, especially across languages.

One solution to this is to treat weak bridge articles differently than others for the purposes of cluster expansion, allowing them to add to a cluster, but not trigger cluster formation or connection on their own.