Visualizing An Entire Day Of Global News Coverage: Technical Experiments: PCA vs UMAP For HDBSCAN & t-SNE Dimensionality Reduction

This week's visualizations of an entire day of global news coverage were limited by one key factor: computational tractability. Even running on a 64-core GCE N1 VM with 400GB of RAM, each rendering took a considerable amount of time and the final renderings were limited to an upper bound of a 350K random sample rather than the full dataset due to limitations in the underlying algorithm implementations. How might one major set of optimizations help accelerate these visualizations and what tradeoffs might they present?

Our visualizations yesterday made use of UMAP dimensionality reduction to reduce the 512-dimension USEv4 embedding vectors down to 10 dimensions to make HDBSCAN clustering computationally feasible. How does UMAP compare with PCA in terms of the impact they have on the resulting cluster formation and visualizations? In yesterday's visualizations, we used UMAP only for HDBSCAN and then reverted to using the full-resolution 512-dimension vectors for our t-SNE layout. To what degree would using the reduced embeddings accelerate the t-SNE layout and where are we spending the majority of our time?

The end result is that contrary to widely-touted recommendation in some quarters that PCA executes many times faster than UMAP, in our real-world experiments below we find that the gains, at least for USEv4's 512-dimension vectors, are far more modest, yielding just a few minutes in most cases, though with greater impact for larger datasets. Worse, the PCA-reduced results are nearly meaningless, causing HDBSCAN to cluster nearly the entire graph into the noise category due to PCA's global-structure focus. In contrast, UMAP yields strong results as we found in our original experiment. In terms of accelerating the t-SNE layout process, using the UMAP-reduced embeddings, rather than the original full-resolution embeddings yields modest speed improvements, but has a fascinating side effect: it causes t-SNE to replace the diffuse point clouds seen in our original graphs with intricate tightly-woven tendrils of interconnected clusters, even for articles marked as "noise" by HDBSCAN. The resulting visualizations from full-resolution and reduced-resolution t-SNE layouts take roughly similar execution times, but the reduced t-SNE layouts produce vastly more understandable structure that tightly organizes the global media landscape. Whether this additional structure is meaningful and whether it yields better or worse understanding than the diffuse point clouds (especially whether the actual underlying semantic structure in those areas is legitimately more diffuse or structured) is an area for additional research. Comparing different levels of dimensionality reduction, the macro-level visualizations change little across different dimensionality levels, with the exception of UMAP-based 2D projection, suggesting that execution time considerations and available CPU resources (due to the different levels of multicore parallelism available in each stage) may be able to play a larger role in selecting the ideal settings – though key questions remain about how these choices impact downstream cluster utility.

You can see the timing breakdowns below. Note that we use wallclock time for the overall time and Python's "time.process_time()" for the task-level timings, which summarizes the total time across all cores, which is why on this 64-core VM those times are substantially higher than the wallclock time.

Here are the results for using PCA dimensionality reduction to 10 dimensions and t-SNE layout using the full-resolution 512-dimension vectors. We can immediately see that the largest amount of time is spent in the t-SNE layout stage and that it leverages all 64-cores to achieve even these lengthy runtimes:

time python3 ./runn.py ./output-500-pcahdbscan-tsnafull ./MASTER.sample.json.500
PCA: : 2.428847918999999 seconds
HDBSCAN: : 3.195304661999998 seconds
Num Clusters: 3
Noise Articles: 220
Layout (TSNE): 5021.147535501999 seconds
Render: 1.1501990320002733 seconds
#1m27.544s

time python3 ./runn.py ./output-10k-pcahdbscan-tsnafull ./MASTER.sample.json.10k
PCA: 11.469120534999998 seconds
HDBSCAN: : 6.248807965999994 seconds
Num Clusters: 86
Noise Articles: 8266
Layout (TSNE): 10239.753743214 seconds
Render: 0.9250311149990011 seconds
#2m56.447s

time python3 ./runn.py ./output-100k-pcahdbscan-tsnafull ./MASTER.sample.json.100k
PCA: : 88.013860348 seconds
HDBSCAN: : 53.17140504599999 seconds
Num Clusters: 1539
Noise Articles: 82202
Layout (TSNE): 13126.031354806999 seconds
Render: 1.1888463730010699 seconds
#8m3.757s

time python3 ./runn.py ./output-200k-pcahdbscan-tsnafull ./MASTER.sample.json.200k
PCA: : 138.27123590800002 seconds
HDBSCAN: : 204.663445835 seconds
Num Clusters: 3561
Noise Articles: 155658
Layout (TSNE): 18653.367705542 seconds
Render: 0.8991585709991341 seconds
#16m54.485s

time python3 ./runn.py ./output-350k-pcahdbscan-tsnafull ./MASTER.sample.json.350k
PCA: : 185.823394168 seconds
HDBSCAN: : 437.51286312400003 seconds
Num Clusters: 6668
Noise Articles: 261038
Layout (TSNE): 31658.361666378998 seconds
Render: 0.9239981770006125 seconds
#27m35.886s

Here are the results for using PCA dimensionality reduction to 10 dimensions and t-SNE layout using the same reduced 10-dimension vectors. We can see that reducing the dimensionality of the vectors for t-SNE only speeds things up by a few minutes other than for the largest dataset where it has a more measurable impact:

time python3 ./runn.py ./output-500-pcahdbscan-tsnareduced10 ./MASTER.sample.json.500
PCA: : 2.3914137330000003 seconds
HDBSCAN: : 2.2154759669999997 seconds
Num Clusters: 3
Noise Articles: 220
Wrote Clusters
Layout (TSNE): 3143.357393109 seconds
Render: 1.2123616369999581 seconds
#0m57.908s

time python3 ./runn.py ./output-10k-pcahdbscan-tsnareduced10 ./MASTER.sample.json.10k
PCA: : 12.670889239000001 seconds
HDBSCAN: : 6.556422175000002 seconds
Num Clusters: 86
Noise Articles: 8266
Wrote Clusters
Layout (TSNE): 9172.155809331001 seconds
Render: 1.1619373419998738 seconds
#2m44.259s
time python3 ./runn.py ./output-100k-pcahdbscan-tsnareduced10 ./MASTER.sample.json.100k
PCA: : 82.800410946 seconds
HDBSCAN: : 53.497904571999996 seconds
Num Clusters: 1539
Noise Articles: 82202
Wrote Clusters
Layout (TSNE): 12909.733578517 seconds
Render: 1.1690794980004284 seconds
#9m8.552s

time python3 ./runn.py ./output-200k-pcahdbscan-tsnareduced10 ./MASTER.sample.json.200k
PCA: : 137.806066569 seconds
HDBSCAN: : 204.25508139299998 seconds
Num Clusters: 3561
Noise Articles: 155658
Wrote Clusters
Layout (TSNE): 15502.679567828001 seconds
Render: 1.2023969459987711 seconds
#20m29.470s

time python3 ./runn.py ./output-350k-pcahdbscan-tsnareduced10 ./MASTER.sample.json.350k
PCA: : 188.180185 seconds
HDBSCAN: : 430.095274092 seconds
Num Clusters: 6668
Noise Articles: 261038
Wrote Clusters
Layout (TSNE): 21291.865324438997 seconds
Render: 0.9988922129996354 seconds
#41m17.613s

What about repeating our exact workflow from yesterday? This uses UMAP dimensionality reduction to 10 dimensions and t-SNE layout using the raw 512-dimension vectors. This is only slightly slower than using PCA, suggesting that, contrary to some touted recommendations that PCA is several times faster than UMAP, the speed improvements are far more modest:

time python3 ./runn.py ./output-500-umaphdbscan-tsnafull ./MASTER.sample.json.500
UMAP: : 368.454166479 seconds
HDBSCAN: : 0.8235538530000213 seconds
Num Clusters: 16
Noise Articles: 172
Layout (TSNE): 4124.352307461 seconds
Render: 1.1703988250001203 seconds
#1m28.217s

time python3 ./runn.py ./output-10k-umaphdbscan-tsnafull ./MASTER.sample.json.10k
UMAP: : 1232.297609941 seconds
HDBSCAN: : 0.4298013449999871 seconds
Num Clusters: 289
Noise Articles: 4406
Layout (TSNE): 10092.327194671 seconds
Render: 1.2001219960002345 seconds
#3m40.956s

time python3 ./runn.py ./output-100k-umaphdbscan-tsnafull ./MASTER.sample.json.100k
UMAP: : 5882.439154004 seconds
HDBSCAN: : 13.227559089999886 seconds
Num Clusters: 2860
Noise Articles: 44973
Layout (TSNE): 13197.688294931002 seconds
Render: 0.9783284889999777 seconds
#9m59.227s

time python3 ./runn.py ./output-200k-umaphdbscan-tsnafull ./MASTER.sample.json.200k
UMAP: : 18389.744759906 seconds
HDBSCAN: : 35.15340755699799 seconds
Num Clusters: 5765
Noise Articles: 84240
Layout (TSNE): 18575.747295277004 seconds
Render: 1.0938486810046015 seconds
#20m8.909s

time python3 ./runn.py ./output-350k-umaphdbscan-tsnafull ./MASTER.sample.json.350k
UMAP: : 43686.995292346 seconds
HDBSCAN: : 77.35459611599799 seconds
Num Clusters: 10268
Noise Articles: 142510
Layout (TSNE): 31135.895989659002 seconds
Render: 1.388311809001607 seconds
#40m37.998s

What about repeating our exact workflow from yesterday? This uses UMAP dimensionality reduction to 10 dimensions and t-SNE layout using the same reduced 10-dimension vectors. As before, we see that the speedup is fairly modest:

time python3 ./runn.py ./output-500-umaphdbscan-tsnareduced10 ./MASTER.sample.json.500
UMAP: : 598.419491917 seconds
HDBSCAN: : 0.7507073090000631 seconds
Num Clusters: 16
Noise Articles: 183
Wrote Clusters
Layout (TSNE): 1266.0424988309999 seconds
Render: 1.3197430190000432 seconds
#0m46.687s

time python3 ./runn.py ./output-10k-umaphdbscan-tsnareduced10 ./MASTER.sample.json.10k
UMAP: : 1209.028011256 seconds
HDBSCAN: : 0.4505578350001542 seconds
Num Clusters: 276
Noise Articles: 4072
Layout (TSNE): 10004.430830412 seconds
Render: 1.187103238000418 seconds
#3m39.816s

time python3 ./runn.py ./output-100k-umaphdbscan-tsnareduced10 ./MASTER.sample.json.100k
UMAP: : 5870.5331127419995 seconds
HDBSCAN: : 13.17594723900038 seconds
Num Clusters: 2872
Noise Articles: 44164
Layout (TSNE): 12285.250289804 seconds
Render: 1.1833040329984215 seconds
#9m25.074s

time python3 ./runn.py ./output-200k-umaphdbscan-tsnareduced10 ./MASTER.sample.json.200k
UMAP: : 16131.221890108 seconds
HDBSCAN: : 37.84411430199907 seconds
Num Clusters: 5755
Noise Articles: 84963
Layout (TSNE): 14462.390557632001 seconds
Render: 1.2095746349987166 seconds
#18m45.838s

time python3 ./runn.py ./output-350k-umaphdbscan-tsnareduced10 ./MASTER.sample.json.350k
UMAP: : 37198.735697563 seconds
HDBSCAN: : 72.86624144199595 seconds
Num Clusters: 10341
Noise Articles: 144567
Layout (TSNE): 19286.276156513 seconds
Render: 1.0964710260013817 seconds
#36m37.211s

Finally, what if we halve the final reduced dimensions from 10 to 5? We'll use UMAP to reduce the embeddings to 5 dimensions instead of 10 for HDBSCAN, while t-SNE layout will still use the original full-resolution embeddings. This does accelerate the HDBSCAN processing, but the speedup is less noticeable given that the majority of the runtime is spent in the t-SNE layout:

time python3 ./runn.py ./output-500-umap5hdbscan-tsnafull ./MASTER.sample.json.500
UMAP: : 470.496131752 seconds
HDBSCAN: : 0.013146705999986352 seconds
Num Clusters: 17
Noise Articles: 185
Layout (TSNE): 2542.394735153 seconds
Render: 1.2131023789997926 seconds
#1m4.738s

time python3 ./runn.py ./output-10k-umap5hdbscan-tsnafull ./MASTER.sample.json.10k
UMAP: : 1523.339983019 seconds
HDBSCAN: : 0.3251564800000324 seconds
Num Clusters: 280
Noise Articles: 3777
Layout (TSNE): 10200.13727831 seconds
Render: 1.1970147199990606 seconds
#3m48.241s

time python3 ./runn.py ./output-100k-umap5hdbscan-tsnafull ./MASTER.sample.json.100k
UMAP: : 6030.816045222 seconds
HDBSCAN: : 8.934514333000152 seconds
Num Clusters: 2791
Noise Articles: 43781
Layout (TSNE): 13110.000106106001 seconds
Render: 1.209525507001672 seconds
#9m55.496s

time python3 ./runn.py ./output-200k-umap5hdbscan-tsnafull ./MASTER.sample.json.200k
UMAP: : 18515.067846399 seconds
HDBSCAN: : 18.549581690000196 seconds
Num Clusters: 5645
Noise Articles: 83175
Layout (TSNE): 18684.747698655996 seconds
Render: 1.2559453029971337 seconds
#20m1.666s

time python3 ./runn.py ./output-350k-umap5hdbscan-tsnafull ./MASTER.sample.json.350k
UMAP: : 54159.975082827994 seconds
HDBSCAN: : 36.07115693599917 seconds
Num Clusters: 10179
Noise Articles: 140631
Layout (TSNE): 31377.275541920004 seconds
Render: 1.1267633289971855 seconds
#43m17.834s

Let's repeat the process by using the UMAP-reduced 5-dimension embeddings for both HDBSCAN and the t-SNE layout. This yields a measurable speedup across the board.

time python3 ./runn.py ./output-500-umap5hdbscan-tsnareduced5 ./MASTER.sample.json.500
UMAP: : 70.59044030000001 seconds
HDBSCAN: : 0.9316715310000063 seconds
Num Clusters: 16
Noise Articles: 171
Wrote Clusters
Layout (TSNE): 80.408629557 seconds
Render: 1.1585667530000023 seconds
#0m18.249s

time python3 ./runn.py ./output-10k-umap5hdbscan-tsnareduced5 ./MASTER.sample.json.10k
UMAP: : 872.813494192 seconds
HDBSCAN: : 0.31622454399996514 seconds
Num Clusters: 305
Noise Articles: 4202
Wrote Clusters
Layout (TSNE): 1104.881352882 seconds
Render: 1.1129478450000079 seconds
#1m12.285s

time python3 ./runn.py ./output-100k-umap5hdbscan-tsnareduced5 ./MASTER.sample.json.100k
UMAP: : 4547.908635771 seconds
HDBSCAN: : 8.437193822000154 seconds
Num Clusters: 2815
Noise Articles: 43241
Wrote Clusters
Layout (TSNE): 3174.993998594 seconds
Render: 1.0414020039997922 seconds
#6m16.179s

time python3 ./runn.py ./output-200k-umap5hdbscan-tsnareduced5 ./MASTER.sample.json.200k
UMAP: : 17211.637760808 seconds
HDBSCAN: : 16.488291566001863 seconds
Num Clusters: 5748
Noise Articles: 84971
Wrote Clusters
Layout (TSNE): 5928.9593736940005 seconds
Render: 1.1873650999987149 seconds
#15m1.042s

time python3 ./runn.py ./output-350k-umap5hdbscan-tsnareduced5 ./MASTER.sample.json.350k
UMAP: : 48287.342618541996 seconds
HDBSCAN: : 33.55415799700131 seconds
Num Clusters: 10185
Noise Articles: 142834
Wrote Clusters
Layout (TSNE): 10488.731985585 seconds
Render: 1.1234332350068144 seconds
#33m57.061s

What if we increase the number of dimensions up to 25? We'll use UMAP to reduce the 512-dimension USEv4 embeddings down to 25 and use the reduced vectors for both HDBSCAN clustering and t-SNE layout. Surprisingly, the runtime is indistinguishable from our 5-dimension version above:

time python3 ./runn.py ./output-350k-umap25hdbscan-tsnareduced25 ./MASTER.sample.json.350k
UMAP: : 43326.55674977801 seconds
HDBSCAN: : 163.0063349569973 seconds
Num Clusters: 10336
Noise Articles: 144698
Wrote Clusters
Layout (TSNE): 11761.454847369998 seconds
Render: 1.098095370005467 seconds
#34m25.904s

How about 50? This time we see a substantial increase in the HDBSCAN runtime and a moderate increase in t-SNE:

time python3 ./runn50.py ./output-350k-umap50hdbscan-tsnareduced50 ./MASTER.sample.json.350k
UMAP: : 57178.297876740995 seconds
HDBSCAN: : 424.89538593400357 seconds
Num Clusters: 10330
Noise Articles: 143211
Wrote Clusters
Layout (TSNE): 12054.680588453994 seconds
Render: 1.1105074740044074 seconds
#42m58.218s

How about 100? Runtime increases 5.6x to more than 4 hours on a 64-core VM, though much of it is spent in the single-threaded hot regions of the underlying algorithms, offering limited possibilities for acceleration.

time python3 ./runn.py ./output-350k-umap100hdbscan-tsnareduced100 ./MASTER.sample.json.350k
UMAP: : 76983.088959628 seconds
HDBSCAN: : 12140.335241405992 seconds
Num Clusters: 10303
Noise Articles: 143785
Layout (TSNE): 12813.938087203001 seconds
Render: 1.2041103609954007 seconds
#243m8.815s

And finally, what about the other direction – just 2 dimensions to maximally reduce the computational load for the downstream analyses? Strangely, despite reducing the dimensionality considered by HDBSCAN and t-SNE, the total runtime increases measurably. We can see that UMAP's execution time increases substantially, eclipsing the decrease in HDBSCAN's runtime, while t-SNE layout does not improve substantially. Clearly the major scalability in t-SNE is related more to the total number of embeddings, rather than their dimensionality, meaning dimensionality reduction will have only limited ability to overcome scaling limits when attempting to process larger article volumes.

time python3 ./runn.py ./output-350k-umap2hdbscan-tsnareduced2 ./MASTER.sample.json.350k
UMAP: : 112810.46946521799 seconds
HDBSCAN: : 22.663263166003162 seconds
Num Clusters: 11688
Noise Articles: 123621
Layout (TSNE): 10213.084812364003 seconds
Render: 1.2965627959929407 seconds
#50m19.769s

Finally, what about one last workflow: using UMAP to reduce dimensionality down to 10, then using t-SNE to reduce to 2 dimensions and using this final 2D embedding dataset for both HDBSCAN and visualization? Prior to this we have only used t-SNE for visualization, not to assist HDBSCAN clustering. In theory, using the exact same t-SNE-produced 2D dataset should allow HDBSCAN to capture the tendril-like fine structural clustering of t-SNE's layout and preserve that in its produced clustering, making the visual structure accessible in machine-readable format through the HDBSCAN-assigned cluster IDs. This runs at around the same speed as our earlier run that did not use t-SNE for HDBSCAN.

time python3 ./runn.py ./output-350k-umaptosne10hdbscan-tsnareduced10paired ./MASTER.sample.json.350k
Dimensionality Reduction: : 53249.703994692994 seconds
HDBSCAN: : 26.11638962500001 seconds
Num Clusters: 16097
Noise Articles: 59039
Layout (TSNE): 6.258199573494494e-05 seconds
Render: 0.9565012719976949 seconds
#32m0.990s

What do the results actually look like? First, let's look at the PCA results. Here is PCA-reduced (to 10 dimensions) HDBSCAN with full resolution t-SNE at 350K articles. The clustering results are largely useless, with nearly the entire dataset reduced to noise (purplish red):

Below is PCA-reduced HDBSCAN with reduced t-SNE. This is effectively useless, with PCA's globalization of the structure preventing t-SNE from performing any form of meaningful structural resolution. Clearly PCA is not a desirable reduction algorithm from the standpoint both of being only marginally faster than UMAP and resulting in terrible results:

What about the same UMAP reduction for HDBSCAN (to 10 dimensions) and full-resolution (512 dimensions) t-SNE that we used yesterday? For 350K articles, we get exactly what we got yesterday:

Things get really interesting, however, when we use those same UMAP-reduced embeddings for the t-SNE layout as well. In other words, we use UMAP to reduce our 512-dimension vectors down to 10 dimensions and then use those same reduced 10-dimension vectors for both HDBSCAN clustering and t-SNE layout. All of the diffuse areas above become tendril-like tight interlinked clusters below. This version, while marginally faster, is actually more understandable in many ways, exposing much finer-grained structure. Fascinatingly, the noise articles that form many of the diffuse regions above suddenly become tightly clustered meaningful organizations of articles below.

Looking at the reduced 200K sampled version we can see these results even more clearly. Here is the original UMAP-reduced HDBSCAN + full-resolution t-SNE layout:

And here is the version using the same UMAP-reduced embeddings (10 dimensions) for t-SNE layout, showing just how strongly it clusters the results:

Let's look at the impact of UMAP dimensionality reduction as we sequentially reduce the number of dimensions considered by t-SNE and HDBSCAN. We'll start with using 10 dimensions for HDBSCAN and the full 512-dimension embeddings for t-SNE layout. This yields large diffuse clusters:

Now we'll use 100 dimensions for both HDBSCAN and t-SNE layout. This results in a drastic change, with the diffuse point clouds that define our full-resolution t-SNE layouts replaced with long interconnected tendrils:

Moving to 50 dimensions for both HDBSCAN and t-SNE layout has only a minimal impact:

Then 25 dimensions for both HDBSCAN and t-SNE layout. This looks nearly identical to 50 dimensions.

Then 10 dimensions for both HDBSCAN and t-SNE layout. Again, few changes.

Then 5 dimensions for both HDBSCAN and t-SNE layout. The results are nearly identical, though perhaps slightly more tendril-like fine structure.

And finally, 2 dimensions for both HDBSCAN and t-SNE layout (with t-SNE then performing 2D layout based on the 2D collapse performed by UMAP). Unlike the previous reductions, which yielded little difference from step to step, this produces a wholesale transformation, organizing the day exclusively into long tendril-like curved structures. Diffuse areas become reorganized into many curved interconnected tendrils. Does this additional structure capture enhanced semantic meaning from the underlying texts, organizing them into meaningful microstructures or does the additional structure actually harm understanding by grouping together unrelated clusters of articles?

Finally, how about our last workflow, where we used UMAP to reduce to 10 dimensions, then t-SNE to collapse to 2D embeddings and used the t-SNE-produced 2D embeddings for both HDBSCAN clustering and visualization? This time we can see much stronger alignment between the coloration and clustering, as HDBSCAN is able to capture the t-SNE-constructed structure, quantifying it into machine-friendly groupings – effectively transforming the visual representation into an machine understandable representation and synchronizing the coloration and layout.

The final Python script used for the analyses here can be found below. Save to "runn.py" and selectively edit accordingly:

#########################
#LOAD THE EMBEDDINGS...

import numpy as np
import matplotlib.pyplot as plt
from PIL import Image
import jsonlines
import multiprocessing
import sys

# Load the JSON file containing embedding vectors
def load_json_embeddings(filename):
    embeddings = []
    titles = []
    langs = []
    with jsonlines.open(filename) as reader:
        for line in reader:
            embeddings.append(line["embed"])
            titles.append(line["title"])
            langs.append(line["lang"])
    return np.array(embeddings), titles, langs

#json_file = 'MASTER.sample.json.500'
json_file = sys.argv[2]
embeddings, titles, langs = load_json_embeddings(json_file)
len(embeddings)

import hdbscan
import umap
from sklearn.decomposition import PCA
import time
clusters = []
embeddings_reduced = []
def cluster_embeddings(embeddings, min_cluster_size=5):
    global clusters, embeddings_reduced
    start_time = time.process_time()
    #embeddings_reduced = PCA(n_components=10, random_state=42).fit_transform(embeddings)
    embeddings_reduced = umap.UMAP(n_neighbors=30,min_dist=0.0,n_components=10,verbose=True).fit_transform(embeddings)
    #embeddings_reduced = umap.UMAP(n_neighbors=30,min_dist=0.0,n_components=5,verbose=True).fit_transform(embeddings)
    end_time = time.process_time()
    elapsed_time = end_time - start_time
    print(f"Dimensionality Reduction: : {elapsed_time} seconds")
    start_time = time.process_time()
    clusterer = hdbscan.HDBSCAN(min_cluster_size=min_cluster_size, core_dist_n_jobs = multiprocessing.cpu_count())
    clusters = clusterer.fit_predict(embeddings_reduced)
    end_time = time.process_time()
    elapsed_time = end_time - start_time
    print(f"HDBSCAN: : {elapsed_time} seconds")
    #return clusters

#cluster...
cluster_embeddings(embeddings, 5)

#output some basic statistics...
unique_clusters = np.unique(clusters)
print("Clusters: " + str(unique_clusters))
print("Num Clusters: " + str(len(unique_clusters)))
cnt_noise = np.count_nonzero(clusters == -1)
print("Noise Articles: " + str(cnt_noise))

#write the output...
with open(sys.argv[1] + '.tsv', 'w', encoding='utf-8') as tsv_file:
    for cluster_id, title, lang in zip(clusters, titles, langs):
        tsv_file.write(f"{cluster_id}\t{title}\t{lang}\n")
print("Wrote Clusters")

#########################
#VISUALIZE...

from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
import plotly.graph_objs as graph
import plotly.io as pio

def plotPointCloud(embeddings, clusters, titles, langs, title, algorithm):
    if (algorithm == 'PCA'):
        start_time = time.process_time()
        pca = PCA(n_components=2, random_state=42)
        embeds = pca.fit_transform(embeddings)
        end_time = time.process_time()
        elapsed_time = end_time - start_time
        print(f"Layout (PCA): {elapsed_time} seconds")
    if (algorithm == 'TSNE'):
        start_time = time.process_time()
        tsne = TSNE(n_components=2, random_state=42)
        embeds = tsne.fit_transform(embeddings)
        #embeds = tsne.fit_transform(embeddings_reduced)
        end_time = time.process_time()
        elapsed_time = end_time - start_time
        print(f"Layout (TSNE): {elapsed_time} seconds")
    
    start_time = time.process_time()
    trace = graph.Scatter(
        x=embeds[:, 0],
        y=embeds[:, 1],
        marker=dict(
            size=6,
            #color=np.arange(len(embeds)), #color randomly
            color=clusters, #color via HDBSCAN clusters
            colorscale='Rainbow',
            opacity=0.8
        ),
        #text = langs,
        #text = titles,
        #text = [title[:20] for title in titles],
        hoverinfo='text',
        #mode='markers+text',
        mode='markers',
        textposition='bottom right'
    )
    
    layout = graph.Layout(
        #title=title,
        xaxis=dict(title=''),
        yaxis=dict(title=''),
        height=6000,
        width=6000,
        plot_bgcolor='rgba(255,255,255,255)',
        hovermode='closest',
        hoverlabel=dict(bgcolor="black", font_size=14)
    )
    
    fig = graph.Figure(data=[trace], layout=layout)
    fig.update_layout(autosize=True)
    
    # Save the figure as a PNG image
    pio.write_image(fig, sys.argv[1] + '.png', format='png')
    end_time = time.process_time()
    elapsed_time = end_time - start_time
    print(f"Render: {elapsed_time} seconds")

# Call the function with your data
#clusters = []
plotPointCloud(embeddings, clusters, titles, langs, 'Title', 'TSNE')
#########################

For those interested in downloading the underlying TSV files for further analysis to see how various methods split the groupings different ways, you can download them via the URLs below:

https://storage.googleapis.com/data.gdeltproject.org/blog/2023-atscaleembeddingclusteringexperiments/output-100k-pcahdbscan-tsnareduced10.tsv
https://storage.googleapis.com/data.gdeltproject.org/blog/2023-atscaleembeddingclusteringexperiments/output-100k-tsnafull.tsv
https://storage.googleapis.com/data.gdeltproject.org/blog/2023-atscaleembeddingclusteringexperiments/output-100k-umap5hdbscan-tsnafull.tsv
https://storage.googleapis.com/data.gdeltproject.org/blog/2023-atscaleembeddingclusteringexperiments/output-100k-umap5hdbscan-tsnareduced5.tsv
https://storage.googleapis.com/data.gdeltproject.org/blog/2023-atscaleembeddingclusteringexperiments/output-100k-umaphdbscan-tsnafull.tsv
https://storage.googleapis.com/data.gdeltproject.org/blog/2023-atscaleembeddingclusteringexperiments/output-100k-umaphdbscan-tsnareduced10.tsv
https://storage.googleapis.com/data.gdeltproject.org/blog/2023-atscaleembeddingclusteringexperiments/output-10k-pcahdbscan-tsnafull.tsv
https://storage.googleapis.com/data.gdeltproject.org/blog/2023-atscaleembeddingclusteringexperiments/output-10k-pcahdbscan-tsnareduced10.tsv
https://storage.googleapis.com/data.gdeltproject.org/blog/2023-atscaleembeddingclusteringexperiments/output-10k-umap5hdbscan-tsnafull.tsv
https://storage.googleapis.com/data.gdeltproject.org/blog/2023-atscaleembeddingclusteringexperiments/output-10k-umap5hdbscan-tsnareduced5.tsv
https://storage.googleapis.com/data.gdeltproject.org/blog/2023-atscaleembeddingclusteringexperiments/output-10k-umaphdbscan-tsnafull.tsv
https://storage.googleapis.com/data.gdeltproject.org/blog/2023-atscaleembeddingclusteringexperiments/output-10k-umaphdbscan-tsnareduced10.tsv
https://storage.googleapis.com/data.gdeltproject.org/blog/2023-atscaleembeddingclusteringexperiments/output-200k-pcahdbscan-tsnafull.tsv
https://storage.googleapis.com/data.gdeltproject.org/blog/2023-atscaleembeddingclusteringexperiments/output-200k-pcahdbscan-tsnareduced10.tsv
https://storage.googleapis.com/data.gdeltproject.org/blog/2023-atscaleembeddingclusteringexperiments/output-200k-umap5hdbscan-tsnafull.tsv
https://storage.googleapis.com/data.gdeltproject.org/blog/2023-atscaleembeddingclusteringexperiments/output-200k-umap5hdbscan-tsnareduced5.tsv
https://storage.googleapis.com/data.gdeltproject.org/blog/2023-atscaleembeddingclusteringexperiments/output-200k-umaphdbscan-tsnafull.tsv
https://storage.googleapis.com/data.gdeltproject.org/blog/2023-atscaleembeddingclusteringexperiments/output-200k-umaphdbscan-tsnareduced10.tsv
https://storage.googleapis.com/data.gdeltproject.org/blog/2023-atscaleembeddingclusteringexperiments/output-350k-pcahdbscan-tsnafull.tsv
https://storage.googleapis.com/data.gdeltproject.org/blog/2023-atscaleembeddingclusteringexperiments/output-350k-pcahdbscan-tsnareduced10.tsv
https://storage.googleapis.com/data.gdeltproject.org/blog/2023-atscaleembeddingclusteringexperiments/output-350k-umap100hdbscan-tsnareduced100.tsv
https://storage.googleapis.com/data.gdeltproject.org/blog/2023-atscaleembeddingclusteringexperiments/output-350k-umap25hdbscan-tsnareduced25.tsv
https://storage.googleapis.com/data.gdeltproject.org/blog/2023-atscaleembeddingclusteringexperiments/output-350k-umap2hdbscan-tsnareduced2.tsv
https://storage.googleapis.com/data.gdeltproject.org/blog/2023-atscaleembeddingclusteringexperiments/output-350k-umap50hdbscan-tsnareduced50.tsv
https://storage.googleapis.com/data.gdeltproject.org/blog/2023-atscaleembeddingclusteringexperiments/output-350k-umap5hdbscan-tsnafull.tsv
https://storage.googleapis.com/data.gdeltproject.org/blog/2023-atscaleembeddingclusteringexperiments/output-350k-umap5hdbscan-tsnareduced5.tsv
https://storage.googleapis.com/data.gdeltproject.org/blog/2023-atscaleembeddingclusteringexperiments/output-350k-umaphdbscan-tsnafull.tsv
https://storage.googleapis.com/data.gdeltproject.org/blog/2023-atscaleembeddingclusteringexperiments/output-350k-umaphdbscan-tsnareduced10.tsv
https://storage.googleapis.com/data.gdeltproject.org/blog/2023-atscaleembeddingclusteringexperiments/output-350k-umaptosne10hdbscan-tsnareduced10paired.tsv
https://storage.googleapis.com/data.gdeltproject.org/blog/2023-atscaleembeddingclusteringexperiments/output-500-pcahdbscan-tsnafull.tsv
https://storage.googleapis.com/data.gdeltproject.org/blog/2023-atscaleembeddingclusteringexperiments/output-500-pcahdbscan-tsnareduced10.tsv
https://storage.googleapis.com/data.gdeltproject.org/blog/2023-atscaleembeddingclusteringexperiments/output-500-umap5hdbscan-tsnafull.tsv
https://storage.googleapis.com/data.gdeltproject.org/blog/2023-atscaleembeddingclusteringexperiments/output-500-umap5hdbscan-tsnareduced5.tsv
https://storage.googleapis.com/data.gdeltproject.org/blog/2023-atscaleembeddingclusteringexperiments/output-500-umaphdbscan-tsnafull.tsv
https://storage.googleapis.com/data.gdeltproject.org/blog/2023-atscaleembeddingclusteringexperiments/output-500-umaphdbscan-tsnareduced10.tsv

The GDELT Project

Visualizing An Entire Day Of Global News Coverage: Technical Experiments: PCA vs UMAP For HDBSCAN & t-SNE Dimensionality Reduction

Archives