The Difficulties Of Fuzzy Grouping Of Related News Coverage: Clustering Versus Entity Linking

Many news analysis use cases require the ability to group related coverage together in order to either collapse coverage of the same story to a single record or look across the myriad perspectives on a single story. The two most common approaches used to group stories are clustering and entity grouping.

Story clustering typically involves collapsing textual stories into a numeric vector space and applying anything from K-means to HDBSCAN to any of myriad other approaches to collapsing the resulting sparse matrix into a set of core components. Most approaches today will further collapse the vector space using word embeddings from word2vec to domain-adapted models. The challenge with story clustering is that the resulting clusters reflect the underlying linguistic and semantic cleavages of the analyzed collection that may or may not reflect how a human would see the collection. For example, all coverage mentioning Joe Biden might be grouped together, collapsing both domestic and foreign policy stories under a single heading and perhaps also grouping in coverage of other global heads of state based on similar language and discussions. Applied to news coverage, such clustering often results in seemingly random and uninterpretable clusters, such as one that groups Covid-19 vaccines with the Belarus dissident arrest with an announcement for Douyin Pay with a discussion of a new lithium battery technology. Presentation to decision makers typically requires human intervention to devise elaborate and contorted explanations for each cluster. Such clustering typically collapses individual storylines together, rendering moot the actual driving force of story grouping. Many clustering techniques use random sampling to maintain tractability on large datasets, making the resulting clusters unstable, with special approaches needed for realtime temporal clustering. Clustering is also computationally expensive and cannot typically be applied to large datasets without some form of sampling and dimensionality reduction.

In contrast, entity grouping reduces each document into a small list of the most salient entities it discusses and groups together articles containing those same entities. This dramatically reduces computational complexity and largely prevents the grouping of entirely unrelated stories, since articles must share their most salient entities in order to be grouped together. Many large consumer-facing news services use this approach. However, as with clustering, this approach typically conflates related storylines, grouping for example Joe Biden's domestic agenda with his foreign G-7 and Putin meetings this week, lumping them all under the same heading. Worse, entities that are less common in the news will see all of their coverage grouped together. On one recent day, a major consumer news service grouped all coverage of former US president Barack Obama under a single heading, lumping together a collection of wildly different topics on the basis that they all mentioned him.

The complexity of storyline grouping can be seen in the simple example of the announcement of Douyin Pay this past January. An ideal system would group together the following three Chinese-language articles as all covering the same announcement. The first two largely share the same title other than the additional text in the front of the second link, while the third title is entirely different. Looking at their first few paragraphs of text, the first two mention Kuaishou and Bilibili, which are missing from the third. Both stories emphasize a set of financial institutions, but their lists do not entirely overlap. Thus, from an entity grouping standpoint, the third link would not likely be grouped with the first two, especially when incorporating entity salience as a filter. At the same time, pure entity grouping would have linked these two Douyin Pay announcements with a number of other Chinese financial tech industry stories of the day that mentioned these same entities in the context of regulation, international video and payments platforms and so on. While in theory such a grouping could be interesting to an analyze as contextual indicators, those storylines are distinct from the announcement story. The first two stories are nearly identical, but just different enough that entity grouping would connect them, but they would not be returned by exact match detection.

Thus, even from this simple example, the difficulties of production story grouping are clear.