Yesterday we explored how applying UMAP and then t-SNE dimensionality reduction one after another prior to HDBSCAN clustering yielded much finer-grained clustering that also allowed t-SNE to shift from diffuse point clouds towards intricate tendril-like microscale structure recovery. One interesting byproduct of this new workflow is that it dramatically reduces the percentage of articles that HDBSCAN is unable to cluster and thus marked as "noise." For example, with UMAP projection to 5 dimensions prior to HDBSCAN, 142,834 articles out of 350,000 (41%) were marked as noise, yielding 10,185 clusters. Applying UMAP projection to 10 dimensions and then further collapsing through t-SNE to 2 dimensions prior to HDBSCAN clustering yields less than half as many noise articles, with 59,039 (17%) classified as noise and a boost to 16,097 clusters.
Let's drill down to how three different languages are impacted. For English, UMAP->HDBSCAN yields a 36% noise rate, while UMAP->t-SNE->HDBSCAN yields just 16% noise. For Spanish it drops from 52% to 23% and for Estonian it drops from 69% to 24%. Clearly, preprocessing the UMAP reduction through t-SNE down to 2 dimensions allows HDBSCAN to extract much more of the structure of the graph. with greater benefits coming to rarer languages.
At the same time, further work is required to see whether articles that are reclassified from noise to cluster association are being legitimately grouped or if they would be better classed as noise. Would running HDBSCAN for each language individually yield better results, especially for the higher levels of noise assignment for rarer languages? Or would alternative parameter selection for HDBSCAN address these issues? These remain to be explored.
You can see the code used below:
#grep Estonian 2023-10-oct20gsg-350k-umaptosne10hdbscan-tsnareduced10paired.tsv | wc -l 204 #grep Estonian 2023-10-oct20gsg-350k-umaptosne10hdbscan-tsnareduced10paired.tsv | grep '\-1' | wc -l 48 24% #grep Estonian output-350k-umaphdbscan-tsnafull.tsv | wc -l 205 #grep Estonian output-350k-umaphdbscan-tsnafull.tsv | grep '\-1' | wc -l 141 69% #grep Spanish 2023-10-oct20gsg-350k-umaptosne10hdbscan-tsnareduced10paired.tsv | wc -l 31000 #grep Spanish 2023-10-oct20gsg-350k-umaptosne10hdbscan-tsnareduced10paired.tsv | grep '\-1' | wc -l 6980 23% #grep Spanish output-350k-umaphdbscan-tsnafull.tsv | wc -l 30496 #grep Spanish output-350k-umaphdbscan-tsnafull.tsv | grep '\-1' | wc -l 15927 52% #grep English 2023-10-oct20gsg-350k-umaptosne10hdbscan-tsnareduced10paired.tsv | wc -l 149360 #grep English 2023-10-oct20gsg-350k-umaptosne10hdbscan-tsnareduced10paired.tsv | grep '\-1' | wc -l 23605 16% #grep English output-350k-umaphdbscan-tsnafull.tsv | wc -l 152262 #grep English output-350k-umaphdbscan-tsnafull.tsv | grep '\-1' | wc -l 54433 36%