As we continue our embedding series, we've demonstrated that the length of the input text can have an impact on the resulting embeddings, with downstream impact on similarity scores ranging from existential to less significant. One of the challenges in our previous tests was the use of artificial text (a single word repeated) or very short (single sentence) content. To truly test the performance of the embedding models under real-world conditions, we need a greater variety of content lengths, styles and level of detail. While we could simply excerpt paragraphs from real news articles, we want to be able to conduct a controlled experiment with precise manipulation of detail levels and style, which would be difficult to do by simply searching the open web for sample texts. We also want to create a more generalized framework that allows rapid iteration and testing across any use case. To demonstrate this, we're going to use an LLM to generate human-like synthetic text with prompts that guide that text towards specific lengths, styles, detail levels and entities. Using this workflow as a template, you can modify the LLM prompts below to create testing data for any use case, including in other languages, making it possible to instantly generate small to medium-sized completely customizable and tailored testing datasets in seconds.
We'll then use our embedding visualization template to cluster the generated passages using the same set of models as before: the English-only USEv4, the larger English-only USEv5-Large, the 16-language USEv3-Multilingual and the larger 16-language USEv3-Multilingual-Large models (supporting 16 languages: Arabic, Chinese-simplified, Chinese-traditional, English, French, German, Italian, Japanese, Korean, Dutch, Polish, Portuguese, Spanish, Thai, Turkish, Russian), the 100-language LaBSEv2 model optimized for translation-pair scoring and the Vertex AI Embeddings for Text API.
Overall, these findings reinforce the impact of textual length on embedding similarity, with similar passages being stratified by length, though it is unclear as to what degree the additional text may be legitimately nudging the passages apart due to making their topical spaces more diffuse and mismatched. Regardless, all of the embedding models tested here do exhibit sensitivity to length, though further experiments will be required to control for the increased topical coverage of that additioan
Below are the set of prompts we used with a major commercial LLM to generate our benchmark dataset. We test the impact of length (20, 50, 100 and 200-word passages), style (statistics, medical professionals, "highly technical medical terminology") and entities ("summarizes Dr. Fauci's advice"):
"Write a 100-word paragraph about social distancing that includes some statistics." "Write a 100-word paragraph about social distancing that quotes medical professionals." "Write a 100-word paragraph about social distancing that gives the range of safe distances." "Write a 100-word paragraph about social distancing that also mentions vaccines and masks." "Write a 200-word paragraph about social distancing that includes some statistics." "Write a 20-word paragraph about social distancing that includes some statistics." "Write a 200-word paragraph about social distancing written for doctors using highly technical medical terminology." "Write a 200-word paragraph about social distancing that summarizes Dr. Fauci's advice." "Write a 50-word paragraph about social distancing that summarizes Dr. Fauci's advice."
To test how each model might cluster the resulting passages against user queries, we created a set of four example user queries and added them to the list. Here you can see the final results:
sentences = [ "what are the safest distances for social distancing?", "how much of a decrease in number of infections does social distancing lead to?", "minimum safe distance for respiratory droplet protection?", "what did fauci say about social distancing?", "what is the latest scientific evidence about social distancing, especially the recommended safe distances to stand apart and the underlying science about respiratory droplets and protective barriers as strategies?", "Social distancing has emerged as a crucial measure to mitigate the spread of infectious diseases, as evidenced by recent global events. With the aim of reducing person-to-person contact, this practice has proven effective in curbing the transmission of viruses. According to a study conducted by renowned epidemiologists, maintaining a distance of at least six feet (approximately two meters) from others can significantly decrease the risk of infection by up to 80%. Furthermore, statistical analysis reveals that areas implementing rigorous social distancing measures have experienced a notable decline in the rate of new cases, showcasing the direct correlation between adherence to these guidelines and overall public health outcomes.", "Social distancing has emerged as a crucial measure in combating the spread of contagious diseases, with its impact being undeniable. Studies indicate that maintaining a distance of at least six feet (two meters) from others can significantly reduce the transmission of respiratory droplets containing pathogens. In fact, a recent analysis revealed that implementing social distancing measures led to a 23% decrease in the incidence of influenza-like illnesses during the previous flu season. Furthermore, data from a nationwide survey indicated that 80% of individuals practiced social distancing, emphasizing its widespread adoption as a vital strategy in safeguarding public health and curbing the spread of infectious diseases.", "Social distancing is a crucial measure implemented to curb the spread of contagious diseases, such as COVID-19. By maintaining a safe physical distance from others, we can minimize the risk of transmission. Recent statistics reveal the significant impact of social distancing in combating the pandemic. A study found that regions with strict social distancing measures experienced a 30% reduction in infection rates compared to those without such measures. Additionally, research suggests that if 80% of the population adheres to social distancing guidelines, it could potentially reduce COVID-19 cases by more than 90%. These numbers underscore the importance of social distancing as an effective tool to protect public health and prevent the spread of infectious diseases.", "Medical professionals worldwide have emphasized the critical role of social distancing in preventing the spread of infectious diseases. Dr. Emily Smith, an epidemiologist, states, 'Maintaining physical distance is essential to minimize close contact, which is a primary mode of transmission for viruses like COVID-19.' Dr. Michael Johnson, a public health expert, adds, 'Social distancing acts as a protective barrier, reducing the chances of respiratory droplets reaching others and interrupting the chain of transmission.' Dr. Sarah Patel, an infectious disease specialist, emphasizes, 'By following social distancing guidelines, we can protect ourselves, our loved ones, and vulnerable populations from the devastating effects of viral outbreaks.' Their expert opinions highlight the consensus among medical professionals regarding the crucial role of social distancing in safeguarding public health.", "According to medical professionals, social distancing is an essential practice to mitigate the transmission of infectious diseases. Dr. Emily Johnson, an epidemiologist, emphasizes, 'Maintaining a distance of at least six feet from others is crucial to minimize the risk of respiratory droplets containing viruses reaching you.' Dr. Michael Ramirez, a renowned infectious disease specialist, states, 'Social distancing is a powerful tool that can help break the chain of transmission and prevent overwhelming healthcare systems.' These experts agree that following social distancing guidelines, along with other preventive measures, is crucial in protecting ourselves and our communities from the spread of contagious diseases.", "Social distancing is a vital practice that involves maintaining a safe physical distance from others to reduce the risk of disease transmission. Health authorities recommend a range of safe distances to adhere to in different settings. In general, a distance of at least six feet (two meters) is widely recommended as a safe distance between individuals in public spaces. However, in certain situations, such as crowded areas or when interacting with vulnerable populations, a greater distance may be advisable. Ultimately, the specific safe distance may vary depending on local guidelines and the nature of the disease. Adhering to these recommended distances is crucial in safeguarding public health and minimizing the spread of contagious diseases.", "Social distancing guidelines recommend maintaining a safe distance from others to reduce the risk of transmitting infectious diseases. The range of safe distances varies depending on the context and specific recommendations from health authorities. In general, a common guideline suggests maintaining a minimum distance of about six feet or two meters from individuals outside our immediate household. However, it is essential to note that different situations may call for adjusting this range. For instance, in settings where respiratory droplets can spread more easily, such as healthcare facilities or crowded public spaces, a greater distance might be advisable. Adhering to the recommended safe distances plays a crucial role in safeguarding public health and minimizing the spread of contagious illnesses.", "Social distancing, combined with vaccines and mask-wearing, forms a robust strategy to combat the spread of infectious diseases. While vaccines provide protection against the virus, social distancing remains vital to reduce transmission, especially in areas with low vaccination rates. Dr. Sarah Thompson, an immunologist, explains, 'Vaccines greatly reduce the risk of severe illness, but they may not prevent asymptomatic transmission entirely. Social distancing helps limit close contact, minimizing the chance of viral spread.' Additionally, wearing masks acts as an additional barrier by preventing respiratory droplets from being released into the air. By practicing all three measures diligently, we can collectively contribute to the collective effort in safeguarding public health and curbing the spread of contagious diseases.", "Social distancing, a crucial measure implemented to curb the spread of contagious diseases, has shown significant impact in reducing transmission rates. Recent statistics highlight its effectiveness and underline the importance of adhering to this practice. A study conducted by the World Health Organization (WHO) revealed that regions implementing strict social distancing measures experienced a 30% reduction in infection rates compared to areas without such measures. Furthermore, research conducted by epidemiologists at Johns Hopkins University demonstrated that maintaining a physical distance of six feet or more can reduce the risk of viral transmission by up to 80%. The impact of social distancing becomes even more pronounced when a substantial portion of the population adheres to these guidelines. According to a modeling study published in the journal Science, if 80% of the population follows social distancing protocols, it could potentially reduce COVID-19 cases by more than 90%. This staggering statistic highlights the effectiveness of social distancing in mitigating the spread of infectious diseases. Beyond COVID-19, social distancing has also shown efficacy in curbing the transmission of other respiratory illnesses. A study published in The Lancet estimated that implementing social distancing measures during the 2009 H1N1 influenza pandemic reduced the rate of infection by 29% to 37%. These figures emphasize the broader applicability and long-term significance of social distancing as a preventive strategy. In conclusion, social distancing, supported by robust statistical evidence, remains an indispensable tool in safeguarding public health and preventing the rapid spread of contagious diseases. By adhering to recommended guidelines and maintaining a safe physical distance, we can collectively contribute to breaking the chain of transmission and protecting vulnerable populations.", "Social distancing is a crucial measure implemented to curb the spread of contagious diseases, such as COVID-19. By maintaining a safe physical distance from others, we can minimize the risk of transmission. Recent statistics reveal the significant impact of social distancing in combating the pandemic. According to a study conducted by the World Health Organization (WHO), areas that implemented strict social distancing measures experienced a substantial reduction in infection rates. In fact, regions with social distancing guidelines in place observed a 30% decrease in new cases compared to those without such measures. Furthermore, research suggests that the effectiveness of social distancing is directly linked to compliance rates. A study published in The Lancet found that if 80% of the population adheres to social distancing guidelines, it could potentially reduce COVID-19 cases by more than 90%. These numbers underscore the importance of widespread adherence to social distancing practices. Additionally, data from contact tracing efforts have demonstrated the effectiveness of maintaining distance. A study conducted by the Centers for Disease Control and Prevention (CDC) found that close contacts of COVID-19 cases were less likely to contract the virus when they maintained a distance of at least six feet from the infected individual. These statistics highlight the critical role social distancing plays in mitigating the spread of infectious diseases. Adhering to recommended physical distancing guidelines, alongside other preventive measures like mask-wearing and vaccination, can significantly reduce the risk of transmission and protect public health.", "Social distancing has been shown to reduce COVID-19 infection rates by 30% in areas with strict measures implemented.", "Social distancing reduces infection rates by about 30% and could lead to over 90% reduction in COVID-19 cases with high compliance.", "Social distancing, a crucial measure during the pandemic, reduced COVID-19 transmission by up to 80% according to studies.", "Social distancing, also referred to as physical distancing, is a fundamental preventive measure implemented to mitigate the transmission of infectious diseases within populations. It involves maintaining a safe physical distance between individuals to minimize the potential for close contact and the subsequent spread of pathogens. By adhering to social distancing guidelines, individuals aim to reduce the risk of respiratory droplet transmission, which is a common mode of transmission for respiratory infections. Various studies have demonstrated the efficacy of social distancing in controlling disease outbreaks. For instance, research has shown that implementing strict social distancing measures can result in a significant decline in infection rates. In certain contexts, maintaining a distance of approximately six feet or two meters is recommended to ensure adequate protection. Adhering to this recommended distance decreases the likelihood of respiratory droplets containing infectious agents reaching susceptible individuals. It is important to note that the effectiveness of social distancing is contingent upon widespread compliance. By diligently practicing social distancing alongside other preventive measures, such as mask-wearing and vaccination, healthcare systems and communities can better manage the spread of contagious diseases and safeguard public health.", "Social distancing, known as physical distancing in medical parlance, constitutes a fundamental prophylactic modality harnessed to mitigate the transference of infectious pathogenic agents amidst populations. It encompasses the deliberate maintenance of a safe interindividual physical space to minimize the potential for proximal contact and subsequent contagion propagation. By adhering diligently to the tenets of social distancing, practitioners strive to curtail the risk of transmission via respiratory droplets, which typically serves as a predominant mode of respiratory infection dissemination. Empirical investigations have consistently demonstrated the efficacy of social distancing in exerting control over disease outbreaks. Notably, imposing stringent social distancing measures has evinced substantial declines in infection rates across various scenarios. The convention of sustaining an approximate distance of six feet or two meters serves as a prevalent recommendation, thereby enabling an appropriate preventive envelope. By meticulously adhering to this stipulated spatial expanse, the prospects of respiratory droplets carrying infectious agents accessing vulnerable subjects undergo marked reduction. It is of paramount importance to underscore the pivotal reliance on extensive adherence to social distancing protocols. Concomitant implementation of this practice in tandem with adjunctive preventive modalities, including mask utilization and vaccination, shall augment the manifold endeavors to effectively manage contagion propagation and safeguard the public welfare.", "Social distancing, known as spatial segregation, entails the intentional maintenance of an appropriate physical distance between individuals, thereby impeding the potential for close-range contact and subsequent transmission of communicable pathogens. This preventive measure, predominantly applied in the context of infectious disease control, aims to curtail the spread of infectious agents through respiratory droplets or aerosols. Empirical evidence corroborates the efficacy of spatial segregation in mitigating disease outbreaks, with robust studies elucidating a notable reduction in infection rates upon strict implementation. The recommended minimum distance for optimal protection is commonly approximated at six feet or two meters, based on established scientific rationale for curtailing the dissemination of respiratory droplets laden with pathogenic agents. It is imperative to underscore that the effectiveness of spatial segregation is predicated on broad compliance across the populace. Concurrent integration of complementary preventative measures, such as facial barrier usage and widespread immunization, augments the impact of social distancing in fostering an environment that efficiently manages the transmission of contagious diseases and bolsters public health resilience. Attentive adherence to these multifaceted strategies is pivotal for clinicians to mitigate disease transmission and uphold their commitment to patient welfare.", "Dr. Anthony Fauci, a renowned infectious disease expert, has consistently emphasized the importance of social distancing in mitigating the spread of infectious diseases. His advice underscores the significance of maintaining a safe physical distance from others, particularly during outbreaks or pandemics. Dr. Fauci stresses that adhering to social distancing guidelines, such as maintaining at least six feet of separation, is crucial in reducing the risk of person-to-person transmission, especially when respiratory droplets are a primary mode of infection. He also highlights that social distancing is not limited to public spaces but should be practiced in all settings, including homes and workplaces. Dr. Fauci emphasizes that social distancing, when combined with other preventive measures like mask-wearing and hand hygiene, plays a vital role in breaking the chain of transmission. Furthermore, he encourages individuals to follow local health guidelines and recommendations from public health authorities to ensure the most effective implementation of social distancing measures. Dr. Fauci's advice underscores the importance of individual responsibility in safeguarding public health and underscores the significant impact that collective adherence to social distancing can have in controlling the spread of contagious diseases.", "Dr. Fauci, a renowned infectious disease expert, emphasizes the importance of social distancing to curb the spread of COVID-19. His advice includes maintaining at least six feet of distance, avoiding crowded places, wearing masks, and practicing good hand hygiene. These measures help protect individuals and communities from the virus.", ]
Despite LLMs being touted as highly creative, we can see here that many of the passages are extremely similar, with the phrase "crucial measure" appearing in several of them and the majority beginning the passage with "social distancing". It is unclear whether this reflects a more limited range of language use around social distancing found in the LLM's training data or whether this topic triggered guardrails that artificially constrained the final output. Rerunning the prompt many times yielded highly similar results every time, suggesting this behavior is innate. In this case we chose to leave the passages as-is to test how well the models are able to abstract beyond exact versus similar word choices (especially for LaBSE).
It is worth noting that the prompts above could be modified to more strenuously test the ability of models to abstract beyond word choice. For example, a prompt could be worded like "rewrite the passage below to say exactly the same thing using different words" or similar to generate texts with known similarity scores that differ only in the words they use. Of course, LLM rewording would have to be evaluated for the impact of hallucination and inadvertent connotative or contextual shifts, but this offers the ability to generate large sample datasets with known similarity scores between passages in order to test how well the embedding models cluster them.
We generated embeddings for each passage through each of the models and then visualized them via 2D and 3D PCA projection as before. One change from our previous visualizing code is that since textual length is so central to our analysis here, we updated the visualization code to add two new fields in parentheses to the start of each label. The first is the array ID of the sentence (0 offset), followed by the length in characters of the passage. Thus, "(7) (774) Social distancing is a crucial measure implemented to curb the spread of co" tells us that this is sentence ID #7 (since arrays are o-offset, this is the eighth sentence down in the list) and that it is 774 characters in length, even though the displayed text is truncated.
You can see the updated code for the 2D and 3D PCA visualizers at the end of this page and can simply swap out the two previous functions in your Colab notebooks with these versions.
Vertex AI
Here we can see the impact of input length on similarity score. Other than for passages where the LLM generated extremely similar text, passages appear to be clustered fairly consistently by length, with passages of similar length being clustered more closely together, while the shortest passages are scattered as isolates. Importantly, all four user queries are scored fairly distantly from the response passages, despite being specifically crafted to have very high semantic overlap for selected sentences. On the one hand, this intuitively makes sense, since the search queries are narrowly crafted and emphasize a single topic, yielding embeddings that are extremely focused. In contrast, the textual passages cover a broader range of topics, yielding more diffuse embeddings. On the other hand, this makes matching more difficult, though a top-X similarity search would, in theory, still yield those texts that are more similar, though with the caveat that a long diffuse text that contains the exact answer buried within a larger volume of unrelated text will likely be scored less similar than one that is topically more aligned with the query, but without the exact answer.
The complete table is below:
[[0.99999932 0.75640763 0.79578782 0.70370049 0.77587715 0.73800688 0.73085578 0.7326817 0.69009474 0.72974972 0.79437397 0.79367713 0.69037738 0.72230964 0.73790276 0.72569559 0.71480263 0.71721094 0.74981202 0.75681271 0.75472359 0.71617591 0.71956724] [0.75640763 0.99999945 0.65086565 0.7064419 0.70722183 0.77867459 0.76963518 0.76741915 0.71044168 0.74035332 0.72754948 0.70782002 0.71590521 0.76881251 0.76385197 0.80588834 0.8254248 0.7805652 0.76236676 0.76613193 0.76277602 0.71969737 0.70921654] [0.79578782 0.65086565 0.9999995 0.61930836 0.72618023 0.68023953 0.702165 0.67412352 0.65608079 0.69412266 0.73795628 0.74384889 0.65514577 0.6745711 0.67820364 0.62401222 0.60656656 0.6391596 0.72686826 0.74097827 0.73965038 0.67789669 0.66819945] [0.70370049 0.7064419 0.61930836 0.99999954 0.73445195 0.69941023 0.69165063 0.68569251 0.71391254 0.71286797 0.69271484 0.66491468 0.67953071 0.68200901 0.67713103 0.71205488 0.70308313 0.69877213 0.68070964 0.70609219 0.71012701 0.78148148 0.81692109] [0.77587715 0.70722183 0.72618023 0.73445195 0.99999836 0.79874962 0.78886723 0.7764799 0.77200224 0.7822815 0.79160625 0.77828884 0.74094443 0.8056407 0.78711498 0.75770567 0.71901239 0.73851951 0.80745873 0.81041473 0.81252432 0.74434334 0.74032125] [0.73800688 0.77867459 0.68023953 0.69941023 0.79874962 0.99999638 0.95613061 0.93113507 0.89098462 0.92660224 0.92491794 0.89999915 0.85744791 0.9410216 0.94121636 0.85657038 0.8298646 0.85625482 0.94102176 0.94168995 0.92830242 0.88276134 0.84411868] [0.73085578 0.76963518 0.702165 0.69165063 0.78886723 0.95613061 0.99999646 0.94632592 0.87192345 0.91177165 0.91433123 0.88792871 0.83404058 0.94766728 0.94121991 0.86190119 0.83167576 0.87406844 0.93834104 0.93615331 0.92075173 0.87333782 0.82387259] [0.7326817 0.76741915 0.67412352 0.68569251 0.7764799 0.93113507 0.94632592 0.99999677 0.88438853 0.90427072 0.89640269 0.86113696 0.84778162 0.96943681 0.98149224 0.88389493 0.85689062 0.88212769 0.91891143 0.91341418 0.8951122 0.86699629 0.83348649] [0.69009474 0.71044168 0.65608079 0.71391254 0.77200224 0.89098462 0.87192345 0.88438853 0.99999647 0.95201607 0.87529587 0.85076136 0.84570558 0.88645231 0.89653718 0.78773609 0.7793516 0.8088402 0.87915983 0.89211136 0.87843569 0.88670846 0.8692154 ] [0.72974972 0.74035332 0.69412266 0.71286797 0.7822815 0.92660224 0.91177165 0.90427072 0.95201607 0.99999634 0.93050639 0.90029678 0.866869 0.90212227 0.91191914 0.81236158 0.80347379 0.81889751 0.92210796 0.928811 0.91442707 0.90069992 0.87613971] [0.79437397 0.72754948 0.73795628 0.69271484 0.79160625 0.92491794 0.91433123 0.89640269 0.87529587 0.93050639 0.99999598 0.95362409 0.83093996 0.8850091 0.90757552 0.79225627 0.77577795 0.7953918 0.92217531 0.9260056 0.91483345 0.8815107 0.84776616] [0.79367713 0.70782002 0.74384889 0.66491468 0.77828884 0.89999915 0.88792871 0.86113696 0.85076136 0.90029678 0.95362409 0.9999962 0.8090337 0.85787662 0.87843887 0.76175765 0.75576089 0.75613102 0.90615102 0.90852232 0.89756951 0.86103946 0.81795731] [0.69037738 0.71590521 0.65514577 0.67953071 0.74094443 0.85744791 0.83404058 0.84778162 0.84570558 0.866869 0.83093996 0.8090337 0.99999615 0.86427993 0.86125227 0.79210618 0.78387054 0.77403187 0.87261424 0.87689713 0.87441998 0.84148056 0.81324903] [0.72230964 0.76881251 0.6745711 0.68200901 0.8056407 0.9410216 0.94766728 0.96943681 0.88645231 0.90212227 0.8850091 0.85787662 0.86427993 0.99999554 0.98210167 0.88434284 0.84905505 0.87417175 0.93524359 0.92495239 0.91252936 0.86339896 0.82309012] [0.73790276 0.76385197 0.67820364 0.67713103 0.78711498 0.94121636 0.94121991 0.98149224 0.89653718 0.91191914 0.90757552 0.87843887 0.86125227 0.98210167 0.99999522 0.86630004 0.85228513 0.86599666 0.93933021 0.92983147 0.91351446 0.87691613 0.83749483] [0.72569559 0.80588834 0.62401222 0.71205488 0.75770567 0.85657038 0.86190119 0.88389493 0.78773609 0.81236158 0.79225627 0.76175765 0.79210618 0.88434284 0.86630004 0.9999989 0.93593243 0.88646356 0.8372476 0.84414558 0.84112251 0.76867304 0.77579582] [0.71480263 0.8254248 0.60656656 0.70308313 0.71901239 0.8298646 0.83167576 0.85689062 0.7793516 0.80347379 0.77577795 0.75576089 0.78387054 0.84905505 0.85228513 0.93593243 0.99999866 0.85232414 0.82985803 0.82968568 0.8289763 0.76636772 0.76259105] [0.71721094 0.7805652 0.6391596 0.69877213 0.73851951 0.85625482 0.87406844 0.88212769 0.8088402 0.81889751 0.7953918 0.75613102 0.77403187 0.87417175 0.86599666 0.88646356 0.85232414 0.99999851 0.82680487 0.82899441 0.81434729 0.78209668 0.76848237] [0.74981202 0.76236676 0.72686826 0.68070964 0.80745873 0.94102176 0.93834104 0.91891143 0.87915983 0.92210796 0.92217531 0.90615102 0.87261424 0.93524359 0.93933021 0.8372476 0.82985803 0.82680487 0.99999581 0.97712022 0.9653548 0.87401131 0.82510612] [0.75681271 0.76613193 0.74097827 0.70609219 0.81041473 0.94168995 0.93615331 0.91341418 0.89211136 0.928811 0.9260056 0.90852232 0.87689713 0.92495239 0.92983147 0.84414558 0.82968568 0.82899441 0.97712022 0.99999441 0.96967893 0.88629102 0.84421449] [0.75472359 0.76277602 0.73965038 0.71012701 0.81252432 0.92830242 0.92075173 0.8951122 0.87843569 0.91442707 0.91483345 0.89756951 0.87441998 0.91252936 0.91351446 0.84112251 0.8289763 0.81434729 0.9653548 0.96967893 0.99999486 0.87483606 0.8362413 ] [0.71617591 0.71969737 0.67789669 0.78148148 0.74434334 0.88276134 0.87333782 0.86699629 0.88670846 0.90069992 0.8815107 0.86103946 0.84148056 0.86339896 0.87691613 0.76867304 0.76636772 0.78209668 0.87401131 0.88629102 0.87483606 0.9999956 0.95466715] [0.71956724 0.70921654 0.66819945 0.81692109 0.74032125 0.84411868 0.82387259 0.83348649 0.8692154 0.87613971 0.84776616 0.81795731 0.81324903 0.82309012 0.83749483 0.77579582 0.76259105 0.76848237 0.82510612 0.84421449 0.8362413 0.95466715 0.99999728]]
Universal Sentence Encoder
USE yields results largely in line with those of Vertex:
[[ 0.9999997 0.45169628 0.3214715 0.46479958 0.5095659 0.2226175 0.26745355 0.37339768 0.30133083 0.25290817 0.40963453 0.33633476 0.22104517 0.34778708 0.36587605 0.22627635 0.266127 0.26101875 0.33185133 0.25373524 0.25351807 0.30065644 0.25055906] [ 0.45169628 0.99999994 0.08252919 0.45862046 0.23053509 0.24939737 0.2764443 0.3782475 0.27403048 0.21788454 0.15733229 0.12955561 0.22767998 0.3378724 0.34611446 0.32869703 0.49409437 0.3447796 0.32340765 0.25267404 0.20973432 0.28540358 0.20290762] [ 0.3214715 0.08252919 1. -0.03758498 0.35107782 0.1468137 0.18500352 0.17115237 0.25791422 0.22715083 0.39208266 0.42995125 0.23284176 0.16794202 0.18690154 0.01417665 0.13144253 0.05467898 0.28496885 0.17736961 0.27196634 0.19104493 0.2503485 ] [ 0.46479958 0.45862046 -0.03758498 0.99999976 0.26714092 0.17381531 0.24841383 0.29038057 0.23044193 0.20203613 0.10347656 0.07181363 0.13317598 0.30307364 0.2981696 0.15747505 0.17933142 0.17351006 0.21629968 0.19599286 0.17190611 0.32304174 0.22202998] [ 0.5095659 0.23053509 0.35107782 0.26714092 0.9999995 0.48652858 0.49076158 0.47908142 0.4728686 0.4360386 0.5137739 0.5377501 0.46368456 0.4865601 0.5131484 0.22219795 0.2901019 0.304316 0.5473639 0.52787566 0.56573045 0.4602453 0.42896903] [ 0.2226175 0.24939737 0.1468137 0.17381531 0.48652858 0.99999976 0.84192103 0.8166665 0.57681274 0.6336639 0.60866815 0.6220866 0.6317493 0.8292406 0.84906816 0.44656646 0.47578394 0.47372347 0.7553511 0.7091935 0.7217746 0.6971116 0.6058929 ] [ 0.26745355 0.2764443 0.18500352 0.24841383 0.49076158 0.84192103 0.99999994 0.83989024 0.689969 0.7376478 0.6495908 0.64682376 0.73420084 0.87083554 0.8424083 0.32802498 0.35740292 0.4724066 0.8095901 0.73782444 0.76554215 0.7818186 0.66194 ] [ 0.37339768 0.3782475 0.17115237 0.29038057 0.47908142 0.8166665 0.83989024 0.9999994 0.6828223 0.7227336 0.65497476 0.65283024 0.6856675 0.9108255 0.91996145 0.46891177 0.5280026 0.5446104 0.82764816 0.71706855 0.7245451 0.7888931 0.65001005] [ 0.30133083 0.27403048 0.25791422 0.23044193 0.4728686 0.57681274 0.689969 0.6828223 1. 0.8494664 0.6313499 0.6202539 0.68574446 0.70034957 0.68611443 0.11566064 0.286282 0.27340177 0.6876683 0.60595053 0.61351115 0.78625137 0.63001305] [ 0.25290817 0.21788454 0.22715083 0.20203613 0.4360386 0.6336639 0.7376478 0.7227336 0.8494664 0.9999999 0.6535007 0.65249085 0.6805792 0.6976787 0.69981444 0.15908068 0.28184524 0.30156082 0.7676524 0.67332006 0.6540196 0.82366437 0.71006435] [ 0.40963453 0.15733229 0.39208266 0.10347656 0.5137739 0.60866815 0.6495908 0.65497476 0.6313499 0.6535007 0.9999999 0.9020388 0.6367414 0.6072844 0.663584 0.12127694 0.25134817 0.25293207 0.73634195 0.5772892 0.62240386 0.64408386 0.6138238 ] [ 0.33633476 0.12955561 0.42995125 0.07181363 0.5377501 0.6220866 0.64682376 0.65283024 0.6202539 0.65249085 0.9020388 0.99999976 0.6315836 0.59184194 0.6634756 0.14456005 0.32308656 0.23573536 0.74628246 0.60584724 0.6568146 0.6704532 0.58761406] [ 0.22104517 0.22767998 0.23284176 0.13317598 0.46368456 0.6317493 0.73420084 0.6856675 0.68574446 0.6805792 0.6367414 0.6315836 0.99999964 0.6742992 0.69698405 0.20063505 0.36605716 0.31794894 0.7858997 0.7262095 0.72442865 0.65613025 0.6048504 ] [ 0.34778708 0.3378724 0.16794202 0.30307364 0.4865601 0.8292406 0.87083554 0.9108255 0.70034957 0.6976787 0.6072844 0.59184194 0.6742992 0.9999998 0.921976 0.41307753 0.44932872 0.5125053 0.7899685 0.72043014 0.74549735 0.77743375 0.6186968 ] [ 0.36587605 0.34611446 0.18690154 0.2981696 0.5131484 0.84906816 0.8424083 0.91996145 0.68611443 0.69981444 0.663584 0.6634756 0.69698405 0.921976 1. 0.4154992 0.49387312 0.48420984 0.8480766 0.7393793 0.75769067 0.79736936 0.6293061 ] [ 0.22627635 0.32869703 0.01417665 0.15747505 0.22219795 0.44656646 0.32802498 0.46891177 0.11566064 0.15908068 0.12127694 0.14456005 0.20063505 0.41307753 0.4154992 0.9999999 0.57290304 0.5062063 0.31199777 0.3153113 0.31177878 0.2287296 0.21594447] [ 0.266127 0.49409437 0.13144253 0.17933142 0.2901019 0.47578394 0.35740292 0.5280026 0.286282 0.28184524 0.25134817 0.32308656 0.36605716 0.44932872 0.49387312 0.57290304 1. 0.44014525 0.42907634 0.37666798 0.34029108 0.34790146 0.25012302] [ 0.26101875 0.3447796 0.05467898 0.17351006 0.304316 0.47372347 0.4724066 0.5446104 0.27340177 0.30156082 0.25293207 0.23573536 0.31794894 0.5125053 0.48420984 0.5062063 0.44014525 1. 0.3862613 0.38397783 0.35948735 0.34097832 0.29812837] [ 0.33185133 0.32340765 0.28496885 0.21629968 0.5473639 0.7553511 0.8095901 0.82764816 0.6876683 0.7676524 0.73634195 0.74628246 0.7858997 0.7899685 0.8480766 0.31199777 0.42907634 0.3862613 0.9999999 0.8385879 0.8389827 0.7703891 0.6249066 ] [ 0.25373524 0.25267404 0.17736961 0.19599286 0.52787566 0.7091935 0.73782444 0.71706855 0.60595053 0.67332006 0.5772892 0.60584724 0.7262095 0.72043014 0.7393793 0.3153113 0.37666798 0.38397783 0.8385879 1.0000002 0.871482 0.71733737 0.60900784] [ 0.25351807 0.20973432 0.27196634 0.17190611 0.56573045 0.7217746 0.76554215 0.7245451 0.61351115 0.6540196 0.62240386 0.6568146 0.72442865 0.74549735 0.75769067 0.31177878 0.34029108 0.35948735 0.8389827 0.871482 1. 0.70722353 0.5995029 ] [ 0.30065644 0.28540358 0.19104493 0.32304174 0.4602453 0.6971116 0.7818186 0.7888931 0.78625137 0.82366437 0.64408386 0.6704532 0.65613025 0.77743375 0.79736936 0.2287296 0.34790146 0.34097832 0.7703891 0.71733737 0.70722353 1.0000001 0.74188215] [ 0.25055906 0.20290762 0.2503485 0.22202998 0.42896903 0.6058929 0.66194 0.65001005 0.63001305 0.71006435 0.6138238 0.58761406 0.6048504 0.6186968 0.6293061 0.21594447 0.25012302 0.29812837 0.6249066 0.60900784 0.5995029 0.74188215 1.0000002 ]]
Universal Sentence Encoder Large
As does USE Large:
[[0.99999964 0.4463734 0.40780455 0.39272273 0.54640687 0.42159796 0.40383247 0.49330556 0.33892357 0.44181323 0.66375697 0.58904314 0.3033753 0.24376011 0.35680312 0.25017816 0.16070944 0.20250306 0.4344272 0.26446712 0.37428427 0.24736184 0.2419214 ] [0.4463734 0.99999976 0.13150972 0.43378308 0.39179468 0.5360008 0.4892948 0.5851512 0.42892277 0.47727162 0.46758315 0.3981697 0.40011215 0.31285906 0.4257425 0.39509982 0.4277062 0.33632874 0.4532978 0.33495766 0.2957852 0.2686839 0.2731736 ] [0.40780455 0.13150972 0.9999998 0.07459432 0.4710376 0.21703565 0.2634836 0.14604716 0.21989065 0.27545786 0.3360262 0.48385108 0.17467687 0.19592585 0.19565064 0.17182067 0.06810345 0.11315325 0.3118395 0.35973096 0.30570543 0.21658538 0.24800998] [0.39272273 0.43378308 0.07459432 1. 0.38586536 0.32104293 0.37619966 0.46572506 0.33027148 0.39907175 0.36331552 0.28965247 0.3188285 0.24600121 0.34505802 0.21730635 0.15918246 0.26004243 0.37517104 0.23324412 0.2430815 0.5083943 0.47137177] [0.54640687 0.39179468 0.4710376 0.38586536 1. 0.45612186 0.49553412 0.45025688 0.41594368 0.48880318 0.52721 0.53866273 0.40830538 0.33525544 0.36525953 0.2984953 0.1906932 0.28638887 0.47537935 0.43295628 0.4220109 0.3173539 0.38958645] [0.42159796 0.5360008 0.21703565 0.32104293 0.45612186 0.9999999 0.8668403 0.81076896 0.7870767 0.815562 0.74399316 0.73492134 0.7033038 0.68924385 0.7558068 0.4818167 0.40124303 0.4752658 0.768189 0.6787615 0.65254766 0.5885624 0.56753606] [0.40383247 0.4892948 0.2634836 0.37619966 0.49553412 0.8668403 1. 0.8281997 0.75234747 0.81141686 0.7428409 0.731025 0.7124259 0.6863092 0.7340695 0.4584852 0.33829212 0.46563548 0.800025 0.7065996 0.63295805 0.60991454 0.57745886] [0.49330556 0.5851512 0.14604716 0.46572506 0.45025688 0.81076896 0.8281997 0.99999976 0.7396761 0.8096036 0.8139217 0.6960589 0.7099719 0.6150996 0.7837118 0.50997704 0.4486102 0.5130646 0.8140043 0.61254114 0.6103731 0.60054433 0.54025763] [0.33892357 0.42892277 0.21989065 0.33027148 0.41594368 0.7870767 0.75234747 0.7396761 0.9999999 0.8636663 0.66701746 0.6744478 0.7671553 0.6668061 0.7140342 0.34408593 0.2549841 0.3625769 0.73098934 0.6709546 0.5805844 0.7031447 0.62083685] [0.44181323 0.47727162 0.27545786 0.39907175 0.48880318 0.815562 0.81141686 0.8096036 0.8636663 1.0000001 0.7646835 0.7412759 0.73157585 0.61813796 0.69842416 0.37736872 0.29062653 0.38779742 0.79557705 0.6651474 0.60484505 0.6599877 0.64470744] [0.66375697 0.46758315 0.3360262 0.36331552 0.52721 0.74399316 0.7428409 0.8139217 0.66701746 0.7646835 0.9999999 0.8817135 0.6054075 0.47325373 0.62679315 0.38927385 0.27118498 0.29913983 0.74013 0.5383884 0.6460553 0.5310042 0.5503017 ] [0.58904314 0.3981697 0.48385108 0.28965247 0.53866273 0.73492134 0.731025 0.6960589 0.6744478 0.7412759 0.8817135 0.99999964 0.5963837 0.521129 0.6191448 0.3703172 0.23301551 0.27503797 0.7307216 0.62098396 0.6618993 0.5887923 0.57531524] [0.3033753 0.40011215 0.17467687 0.3188285 0.40830538 0.7033038 0.7124259 0.7099719 0.7671553 0.73157585 0.6054075 0.5963837 1. 0.6430376 0.71542144 0.3090133 0.26042166 0.33218506 0.713047 0.6479206 0.58321255 0.6288029 0.5283266 ] [0.24376011 0.31285906 0.19592585 0.24600121 0.33525544 0.68924385 0.6863092 0.6150996 0.6668061 0.61813796 0.47325373 0.521129 0.6430376 0.99999994 0.8912434 0.38154072 0.31259578 0.4611349 0.7280953 0.78523076 0.5885006 0.66748023 0.4153235 ] [0.35680312 0.4257425 0.19565064 0.34505802 0.36525953 0.7558068 0.7340695 0.7837118 0.7140342 0.69842416 0.62679315 0.6191448 0.71542144 0.8912434 0.99999994 0.4702752 0.40132666 0.48747897 0.84633315 0.806998 0.6605449 0.7328238 0.47240683] [0.25017816 0.39509982 0.17182067 0.21730635 0.2984953 0.4818167 0.4584852 0.50997704 0.34408593 0.37736872 0.38927385 0.3703172 0.3090133 0.38154072 0.4702752 0.99999964 0.6851031 0.4433123 0.424443 0.38154015 0.3360422 0.25401583 0.3055519 ] [0.16070944 0.4277062 0.06810345 0.15918246 0.1906932 0.40124303 0.33829212 0.4486102 0.2549841 0.29062653 0.27118498 0.23301551 0.26042166 0.31259578 0.40132666 0.6851031 1. 0.41160944 0.3165993 0.28578848 0.21253854 0.19239895 0.21984509] [0.20250306 0.33632874 0.11315325 0.26004243 0.28638887 0.4752658 0.46563548 0.5130646 0.3625769 0.38779742 0.29913983 0.27503797 0.33218506 0.4611349 0.48747897 0.4433123 0.41160944 1.0000002 0.34750068 0.31796148 0.25111696 0.28571406 0.31290707] [0.4344272 0.4532978 0.3118395 0.37517104 0.47537935 0.768189 0.800025 0.8140043 0.73098934 0.79557705 0.74013 0.7307216 0.713047 0.7280953 0.84633315 0.424443 0.3165993 0.34750068 1.0000001 0.8474816 0.7723906 0.69467723 0.4835025 ] [0.26446712 0.33495766 0.35973096 0.23324412 0.43295628 0.6787615 0.7065996 0.61254114 0.6709546 0.6651474 0.5383884 0.62098396 0.6479206 0.78523076 0.806998 0.38154015 0.28578848 0.31796148 0.8474816 1. 0.7517723 0.7049886 0.4890275 ] [0.37428427 0.2957852 0.30570543 0.2430815 0.4220109 0.65254766 0.63295805 0.6103731 0.5805844 0.60484505 0.6460553 0.6618993 0.58321255 0.5885006 0.6605449 0.3360422 0.21253854 0.25111696 0.7723906 0.7517723 1. 0.608629 0.40640068] [0.24736184 0.2686839 0.21658538 0.5083943 0.3173539 0.5885624 0.60991454 0.60054433 0.7031447 0.6599877 0.5310042 0.5887923 0.6288029 0.66748023 0.7328238 0.25401583 0.19239895 0.28571406 0.69467723 0.7049886 0.608629 0.9999999 0.71072406] [0.2419214 0.2731736 0.24800998 0.47137177 0.38958645 0.56753606 0.57745886 0.54025763 0.62083685 0.64470744 0.5503017 0.57531524 0.5283266 0.4153235 0.47240683 0.3055519 0.21984509 0.31290707 0.4835025 0.4890275 0.40640068 0.71072406 1. ]]
Universal Sentence Encoder Multilingual
USE Multilingual creates more of a singular supercluster that mashes together the majority of the passages, but similarly scatters the short queries as isolates. As with the others, text length appears to highly influence the clustering:
[[0.99999964 0.3821517 0.45024776 0.38651383 0.553162 0.25524923 0.22102883 0.32575917 0.21365227 0.2170289 0.5318121 0.45615244 0.19694312 0.254474 0.326201 0.16801234 0.07011999 0.10948925 0.3408941 0.28637022 0.2770556 0.23285808 0.21739739] [0.3821517 0.9999999 0.24043909 0.4355793 0.3405605 0.3540783 0.3908475 0.40531445 0.2881738 0.2705875 0.21729288 0.24646911 0.3095141 0.3522442 0.3748135 0.4062819 0.3392794 0.20535107 0.4122028 0.35206956 0.31613356 0.31157243 0.20463523] [0.45024776 0.24043909 1. 0.01902577 0.5327867 0.17742856 0.20062669 0.20166083 0.28080076 0.2806768 0.44412607 0.49324393 0.31574422 0.19009343 0.23423089 0.11139169 0.05079632 0.02698344 0.35994992 0.3152632 0.31712604 0.24411768 0.23967221] [0.38651383 0.4355793 0.01902577 0.9999997 0.27775824 0.12822564 0.22574642 0.21662277 0.1552619 0.11788428 0.08494937 0.07213853 0.11911805 0.18822274 0.18384236 0.1552047 0.0665822 0.14660682 0.18368438 0.14223395 0.11446094 0.19777958 0.11455577] [0.553162 0.3405605 0.5327867 0.27775824 0.99999994 0.32087538 0.35851872 0.35436466 0.35799062 0.33294767 0.43064463 0.45154607 0.32704085 0.35085854 0.3779031 0.1452479 0.085164 0.16215767 0.43322062 0.4054423 0.43351364 0.29343185 0.28552458] [0.25524923 0.3540783 0.17742856 0.12822564 0.32087538 1.0000002 0.8303269 0.75472164 0.58807284 0.65173334 0.49487692 0.4828142 0.54401666 0.80204177 0.7983935 0.43030447 0.39958602 0.44617122 0.72013354 0.69145364 0.6657778 0.58631 0.48920587] [0.22102883 0.3908475 0.20062669 0.22574642 0.35851872 0.8303269 0.99999946 0.7793808 0.63699806 0.6220852 0.45033464 0.4767918 0.63617873 0.83685505 0.77367234 0.46951148 0.38395882 0.48134077 0.7709144 0.6961074 0.6672975 0.6229021 0.4815659 ] [0.32575917 0.40531445 0.20166083 0.21662277 0.35436466 0.75472164 0.7793808 1. 0.65858257 0.63476354 0.513004 0.48222333 0.6229097 0.9187617 0.9377219 0.61136496 0.5528867 0.5679225 0.7626149 0.67726946 0.64472234 0.669981 0.54021424] [0.21365227 0.2881738 0.28080076 0.1552619 0.35799062 0.58807284 0.63699806 0.65858257 0.9999997 0.86684287 0.5215328 0.52702713 0.674093 0.6721638 0.67061234 0.30187237 0.26458496 0.29406923 0.71306384 0.69233435 0.64148486 0.7925354 0.6437404 ] [0.2170289 0.2705875 0.2806768 0.11788428 0.33294767 0.65173334 0.6220852 0.63476354 0.86684287 0.9999999 0.5587348 0.5677372 0.6638037 0.63127005 0.6437777 0.3136477 0.24749415 0.2292979 0.7061344 0.6541914 0.6154669 0.73056895 0.7364607 ] [0.5318121 0.21729288 0.44412607 0.08494937 0.43064463 0.49487692 0.45033464 0.513004 0.5215328 0.5587348 0.99999976 0.91581064 0.48511127 0.4543122 0.5634884 0.16346578 0.06950188 0.11009818 0.65729415 0.5938399 0.58826125 0.5951812 0.5639472 ] [0.45615244 0.24646911 0.49324393 0.07213853 0.45154607 0.4828142 0.4767918 0.48222333 0.52702713 0.5677372 0.91581064 0.9999997 0.50654185 0.43876058 0.54340917 0.17996837 0.10958616 0.08585687 0.6776138 0.60551715 0.58918977 0.60785043 0.5490624 ] [0.19694312 0.3095141 0.31574422 0.11911805 0.32704085 0.54401666 0.63617873 0.6229097 0.674093 0.6638037 0.48511127 0.50654185 0.99999994 0.6278553 0.6279174 0.3420914 0.3241136 0.22122341 0.727782 0.6925658 0.63957334 0.5744941 0.5597247 ] [0.254474 0.3522442 0.19009343 0.18822274 0.35085854 0.80204177 0.83685505 0.9187617 0.6721638 0.63127005 0.4543122 0.43876058 0.6278553 0.9999995 0.90714383 0.5242987 0.46965027 0.5120147 0.7665255 0.7125038 0.6851874 0.6719861 0.49727353] [0.326201 0.3748135 0.23423089 0.18384236 0.3779031 0.7983935 0.77367234 0.9377219 0.67061234 0.6437777 0.5634884 0.54340917 0.6279174 0.90714383 0.9999999 0.5596044 0.48495743 0.50928724 0.8035759 0.7378328 0.69840944 0.6916361 0.5437863 ] [0.16801234 0.4062819 0.11139169 0.1552047 0.1452479 0.43030447 0.46951148 0.61136496 0.30187237 0.3136477 0.16346578 0.17996837 0.3420914 0.5242987 0.5596044 0.99999976 0.73053145 0.57301915 0.41159713 0.3456623 0.31934023 0.32307374 0.35096157] [0.07011999 0.3392794 0.05079632 0.0665822 0.085164 0.39958602 0.38395882 0.5528867 0.26458496 0.24749415 0.06950188 0.10958616 0.3241136 0.46965027 0.48495743 0.73053145 0.9999998 0.56954414 0.29346105 0.25837082 0.23032826 0.23102689 0.2140174 ] [0.10948925 0.20535107 0.02698344 0.14660682 0.16215767 0.44617122 0.48134077 0.5679225 0.29406923 0.2292979 0.11009818 0.08585687 0.22122341 0.5120147 0.50928724 0.57301915 0.56954414 1.0000005 0.27308065 0.23927358 0.23470189 0.24608383 0.2845137 ] [0.3408941 0.4122028 0.35994992 0.18368438 0.43322062 0.72013354 0.7709144 0.7626149 0.71306384 0.7061344 0.65729415 0.6776138 0.727782 0.7665255 0.8035759 0.41159713 0.29346105 0.27308065 0.99999976 0.9023773 0.8714614 0.7489587 0.5762777 ] [0.28637022 0.35206956 0.3152632 0.14223395 0.4054423 0.69145364 0.6961074 0.67726946 0.69233435 0.6541914 0.5938399 0.60551715 0.6925658 0.7125038 0.7378328 0.3456623 0.25837082 0.23927358 0.9023773 1. 0.9238929 0.73693407 0.5500272 ] [0.2770556 0.31613356 0.31712604 0.11446094 0.43351364 0.6657778 0.6672975 0.64472234 0.64148486 0.6154669 0.58826125 0.58918977 0.63957334 0.6851874 0.69840944 0.31934023 0.23032826 0.23470189 0.8714614 0.9238929 1.0000002 0.6809622 0.5252785 ] [0.23285808 0.31157243 0.24411768 0.19777958 0.29343185 0.58631 0.6229021 0.669981 0.7925354 0.73056895 0.5951812 0.60785043 0.5744941 0.6719861 0.6916361 0.32307374 0.23102689 0.24608383 0.7489587 0.73693407 0.6809622 1.0000001 0.65499306] [0.21739739 0.20463523 0.23967221 0.11455577 0.28552458 0.48920587 0.4815659 0.54021424 0.6437404 0.7364607 0.5639472 0.5490624 0.5597247 0.49727353 0.5437863 0.35096157 0.2140174 0.2845137 0.5762777 0.5500272 0.5252785 0.65499306 1.0000006 ]]
Universal Sentence Encoder Multilingual Large
The Large edition produces more diffuse clustering:
[[1.0000002 0.4783839 0.4817693 0.4712466 0.56261563 0.4116705 0.40338933 0.46088076 0.35040337 0.4749282 0.61163485 0.546823 0.30725718 0.26662785 0.35235658 0.16562518 0.07371345 0.15381053 0.44608092 0.36155173 0.33707517 0.23658532 0.24728313] [0.4783839 0.9999999 0.2311587 0.4041643 0.3660168 0.5736789 0.48349714 0.5470454 0.4313576 0.46703255 0.46911022 0.41964296 0.40888625 0.44780275 0.4704321 0.3938298 0.38047403 0.33135396 0.489581 0.4608932 0.4410277 0.35365957 0.28407016] [0.4817693 0.2311587 0.9999999 0.04651021 0.53833485 0.17333731 0.27813837 0.1388495 0.30594873 0.2869386 0.28197563 0.47594422 0.20708692 0.19444405 0.18600476 0.07511519 0.04178877 0.04330425 0.27745432 0.26156303 0.32679734 0.16453072 0.22371885] [0.4712466 0.4041643 0.04651021 0.99999976 0.32022828 0.27294943 0.31266963 0.3384179 0.2562589 0.38402516 0.32859105 0.17511822 0.23598936 0.15053937 0.19217007 0.17530839 0.0858243 0.17200087 0.3265208 0.20740175 0.1752184 0.4001314 0.29071417] [0.56261563 0.3660168 0.53833485 0.32022828 0.99999976 0.37502614 0.4289709 0.37250555 0.41165692 0.47287083 0.4540993 0.4724759 0.32095453 0.30697575 0.3378609 0.15670381 0.09087186 0.1926041 0.4652437 0.36346573 0.4284413 0.24400716 0.32350877] [0.4116705 0.5736789 0.17333731 0.27294943 0.37502614 0.9999999 0.87383664 0.8503391 0.7433723 0.7724922 0.80353296 0.71735084 0.7462467 0.73341656 0.7673867 0.3651783 0.30276972 0.28646082 0.76877654 0.7056 0.68174595 0.5849103 0.5471228 ] [0.40338933 0.48349714 0.27813837 0.31266963 0.4289709 0.87383664 0.9999999 0.80270296 0.7709918 0.8324299 0.74645853 0.712396 0.8008114 0.70363784 0.7265636 0.3130262 0.23378868 0.28001338 0.82145035 0.72943866 0.69478923 0.6064786 0.54108363] [0.46088076 0.5470454 0.1388495 0.3384179 0.37250555 0.8503391 0.80270296 1. 0.71809113 0.77477235 0.82438564 0.6812904 0.7029244 0.74632025 0.86573315 0.45157665 0.39822814 0.40490943 0.81969887 0.75476825 0.7098756 0.5767327 0.54484475] [0.35040337 0.4313576 0.30594873 0.2562589 0.41165692 0.7433723 0.7709918 0.71809113 1.0000002 0.7815105 0.66829133 0.7019276 0.7710363 0.7496435 0.75822973 0.34063497 0.31060755 0.2699337 0.7875068 0.7532947 0.7318125 0.69354236 0.63263535] [0.4749282 0.46703255 0.2869386 0.38402516 0.47287083 0.7724922 0.8324299 0.77477235 0.7815105 0.99999976 0.7600659 0.6505023 0.69295573 0.5883616 0.6390996 0.2589752 0.22119242 0.2266622 0.8724171 0.7376735 0.69686186 0.5770696 0.51205754] [0.61163485 0.46911022 0.28197563 0.32859105 0.4540993 0.80353296 0.74645853 0.82438564 0.66829133 0.7600659 0.99999976 0.8516253 0.62686384 0.61572003 0.7006409 0.28873855 0.20147215 0.19993459 0.78736645 0.71506274 0.70505404 0.55063915 0.50215733] [0.546823 0.41964296 0.47594422 0.17511822 0.4724759 0.71735084 0.712396 0.6812904 0.7019276 0.6505023 0.8516253 0.99999976 0.6307117 0.6579759 0.6916436 0.28804064 0.20631424 0.14361782 0.70875734 0.68097925 0.69807595 0.5873989 0.52978307] [0.30725718 0.40888625 0.20708692 0.23598936 0.32095453 0.7462467 0.8008114 0.7029244 0.7710363 0.69295573 0.62686384 0.6307117 0.9999995 0.68220913 0.7247564 0.26835525 0.24240106 0.1963816 0.73443174 0.7380682 0.62939537 0.5801154 0.52344584] [0.26662785 0.44780275 0.19444405 0.15053937 0.30697575 0.73341656 0.70363784 0.74632025 0.7496435 0.5883616 0.61572003 0.6579759 0.68220913 1.0000001 0.90279484 0.3604626 0.2835781 0.27875248 0.74477637 0.8193475 0.7528708 0.6928923 0.49901915] [0.35235658 0.4704321 0.18600476 0.19217007 0.3378609 0.7673867 0.7265636 0.86573315 0.75822973 0.6390996 0.7006409 0.6916436 0.7247564 0.90279484 1.0000001 0.43648702 0.36721593 0.34458762 0.8019166 0.8483258 0.76164675 0.6667704 0.5525642 ] [0.16562518 0.3938298 0.07511519 0.17530839 0.15670381 0.3651783 0.3130262 0.45157665 0.34063497 0.2589752 0.28873855 0.28804064 0.26835525 0.3604626 0.43648702 0.99999976 0.76353884 0.5477669 0.29286838 0.2940613 0.32247275 0.24737266 0.40067172] [0.07371345 0.38047403 0.04178877 0.0858243 0.09087186 0.30276972 0.23378868 0.39822814 0.31060755 0.22119242 0.20147215 0.20631424 0.24240106 0.2835781 0.36721593 0.76353884 0.99999976 0.54852295 0.23448858 0.22647385 0.23244977 0.17486073 0.31736386] [0.15381053 0.33135396 0.04330425 0.17200087 0.1926041 0.28646082 0.28001338 0.40490943 0.2699337 0.2266622 0.19993459 0.14361782 0.1963816 0.27875248 0.34458762 0.5477669 0.54852295 1.0000004 0.23438352 0.19086182 0.22459163 0.12922868 0.27695698] [0.44608092 0.489581 0.27745432 0.3265208 0.4652437 0.76877654 0.82145035 0.81969887 0.7875068 0.8724171 0.78736645 0.70875734 0.73443174 0.74477637 0.8019166 0.29286838 0.23448858 0.23438352 0.99999994 0.8994067 0.8242731 0.62944233 0.49624443] [0.36155173 0.4608932 0.26156303 0.20740175 0.36346573 0.7056 0.72943866 0.75476825 0.7532947 0.7376735 0.71506274 0.68097925 0.7380682 0.8193475 0.8483258 0.2940613 0.22647385 0.19086182 0.8994067 1.0000001 0.84492636 0.65656805 0.48804983] [0.33707517 0.4410277 0.32679734 0.1752184 0.4284413 0.68174595 0.69478923 0.7098756 0.7318125 0.69686186 0.70505404 0.69807595 0.62939537 0.7528708 0.76164675 0.32247275 0.23244977 0.22459163 0.8242731 0.84492636 1.0000001 0.61311233 0.5159383 ] [0.23658532 0.35365957 0.16453072 0.4001314 0.24400716 0.5849103 0.6064786 0.5767327 0.69354236 0.5770696 0.55063915 0.5873989 0.5801154 0.6928923 0.6667704 0.24737266 0.17486073 0.12922868 0.62944233 0.65656805 0.61311233 0.99999994 0.71614647] [0.24728313 0.28407016 0.22371885 0.29071417 0.32350877 0.5471228 0.54108363 0.54484475 0.63263535 0.51205754 0.50215733 0.52978307 0.52344584 0.49901915 0.5525642 0.40067172 0.31736386 0.27695698 0.49624443 0.48804983 0.5159383 0.71614647 1.0000004 ]]
LaBSE
In keeping with its optimization for translation bitexts, LaBSE strongly clusters based on word similarity:
[[1. 0.67247397 0.6873561 0.6621635 0.6597442 0.29197532 0.26837116 0.28702286 0.3772807 0.4072991 0.4662396 0.4268237 0.35115847 0.17442884 0.22643831 0.33111128 0.2689853 0.29177403 0.34960458 0.29355702 0.28749877 0.29163188 0.38512015] [0.67247397 0.9999999 0.5280295 0.5600559 0.47179648 0.37584177 0.33578432 0.35003945 0.3725159 0.40069604 0.33550632 0.31760746 0.35866022 0.27721694 0.29502714 0.41662568 0.43769234 0.4417482 0.35632822 0.28514242 0.30108005 0.28063238 0.3619454 ] [0.6873561 0.5280295 1.0000004 0.40815887 0.6120821 0.2809708 0.32867056 0.27958068 0.32400626 0.3810162 0.38754946 0.40008414 0.35401088 0.20015356 0.23462072 0.25702503 0.2520841 0.2553283 0.3668958 0.3719799 0.3524964 0.29915106 0.3745644 ] [0.6621635 0.5600559 0.40815887 1.0000002 0.55517906 0.2235544 0.24144243 0.26740825 0.26580682 0.258268 0.2549967 0.26239556 0.22444837 0.17244136 0.19851753 0.33246714 0.2996694 0.3349889 0.25105566 0.19002895 0.20106032 0.323749 0.40617222] [0.6597442 0.47179648 0.6120821 0.55517906 1. 0.43009108 0.44774318 0.41429332 0.45800772 0.46886605 0.48694855 0.48205328 0.45127392 0.33100677 0.35423723 0.2876352 0.31251812 0.3531081 0.4869077 0.44276902 0.45618027 0.43196487 0.45425522] [0.29197532 0.37584177 0.2809708 0.2235544 0.43009108 1. 0.92051774 0.8325955 0.6902112 0.69353867 0.7621333 0.7393694 0.70059854 0.8201953 0.8681854 0.4575057 0.46683943 0.5588279 0.82445294 0.8092078 0.81156147 0.69890165 0.6120151 ] [0.26837116 0.33578432 0.32867056 0.24144243 0.44774318 0.92051774 0.9999999 0.86281335 0.6479807 0.6759499 0.73296213 0.72299594 0.71183556 0.8420916 0.8766359 0.47542226 0.4873417 0.5717679 0.831669 0.8192879 0.8088148 0.7118497 0.6137197 ] [0.28702286 0.35003945 0.27958068 0.26740825 0.41429332 0.8325955 0.86281335 0.99999964 0.7118065 0.66114825 0.664857 0.68127817 0.71772057 0.89279467 0.92441154 0.58793414 0.6718056 0.63145614 0.80328256 0.74737513 0.7518481 0.7228593 0.67589414] [0.3772807 0.3725159 0.32400626 0.26580682 0.45800772 0.6902112 0.6479807 0.7118065 1.0000001 0.876067 0.6609905 0.65553004 0.7912638 0.6748695 0.6998443 0.33468094 0.39594185 0.42542744 0.7381099 0.68476725 0.69852155 0.75673175 0.67395115] [0.4072991 0.40069604 0.3810162 0.258268 0.46886605 0.69353867 0.6759499 0.66114825 0.876067 1.0000001 0.71433574 0.70347977 0.7584033 0.6441624 0.65338415 0.3353158 0.3797719 0.40038526 0.75342995 0.6838908 0.6962477 0.72893083 0.68989235] [0.4662396 0.33550632 0.38754946 0.2549967 0.48694855 0.7621333 0.73296213 0.664857 0.6609905 0.71433574 0.99999994 0.9415981 0.6822444 0.6205362 0.6837848 0.34632504 0.3519457 0.33957145 0.8558923 0.80966187 0.79632115 0.7157341 0.67957157] [0.4268237 0.31760746 0.40008414 0.26239556 0.48205328 0.7393694 0.72299594 0.68127817 0.65553004 0.70347977 0.9415981 1. 0.67988056 0.64370406 0.6978285 0.3384518 0.3788278 0.3153078 0.8522828 0.79914284 0.7909465 0.72824407 0.65426624] [0.35115847 0.35866022 0.35401088 0.22444837 0.45127392 0.70059854 0.71183556 0.71772057 0.7912638 0.7584033 0.6822444 0.67988056 1. 0.69830614 0.7134171 0.34626567 0.43513077 0.37541094 0.80245 0.7354835 0.7711449 0.74018645 0.64318216] [0.17442884 0.27721694 0.20015356 0.17244136 0.33100677 0.8201953 0.8420916 0.89279467 0.6748695 0.6441624 0.6205362 0.64370406 0.69830614 1. 0.9494139 0.4480049 0.5176003 0.47589067 0.8013717 0.7905541 0.79160196 0.7365806 0.55824775] [0.22643831 0.29502714 0.23462072 0.19851753 0.35423723 0.8681854 0.8766359 0.92441154 0.6998443 0.65338415 0.6837848 0.6978285 0.7134171 0.9494139 1. 0.47608182 0.5187434 0.48342943 0.83952564 0.8239398 0.8156655 0.75518864 0.59889096] [0.33111128 0.41662568 0.25702503 0.33246714 0.2876352 0.4575057 0.47542226 0.58793414 0.33468094 0.3353158 0.34632504 0.3384518 0.34626567 0.4480049 0.47608182 1. 0.70474404 0.6701243 0.3767983 0.34282792 0.35157815 0.30337715 0.45343128] [0.2689853 0.43769234 0.2520841 0.2996694 0.31251812 0.46683943 0.4873417 0.6718056 0.39594185 0.3797719 0.3519457 0.3788278 0.43513077 0.5176003 0.5187434 0.70474404 1. 0.6455714 0.41482013 0.35530913 0.38808122 0.35263947 0.45238963] [0.29177403 0.4417482 0.2553283 0.3349889 0.3531081 0.5588279 0.5717679 0.63145614 0.42542744 0.40038526 0.33957145 0.3153078 0.37541094 0.47589067 0.48342943 0.6701243 0.6455714 0.99999976 0.36328647 0.3244062 0.3284732 0.3004976 0.46657455] [0.34960458 0.35632822 0.3668958 0.25105566 0.4869077 0.82445294 0.831669 0.80328256 0.7381099 0.75342995 0.8558923 0.8522828 0.80245 0.8013717 0.83952564 0.3767983 0.41482013 0.36328647 1.0000001 0.9269943 0.9345279 0.8155589 0.6670839 ] [0.29355702 0.28514242 0.3719799 0.19002895 0.44276902 0.8092078 0.8192879 0.74737513 0.68476725 0.6838908 0.80966187 0.79914284 0.7354835 0.7905541 0.8239398 0.34282792 0.35530913 0.3244062 0.9269943 1.0000001 0.9567478 0.780187 0.6026601 ] [0.28749877 0.30108005 0.3524964 0.20106032 0.45618027 0.81156147 0.8088148 0.7518481 0.69852155 0.6962477 0.79632115 0.7909465 0.7711449 0.79160196 0.8156655 0.35157815 0.38808122 0.3284732 0.9345279 0.9567478 0.99999994 0.7745755 0.6209729 ] [0.29163188 0.28063238 0.29915106 0.323749 0.43196487 0.69890165 0.7118497 0.7228593 0.75673175 0.72893083 0.7157341 0.72824407 0.74018645 0.7365806 0.75518864 0.30337715 0.35263947 0.3004976 0.8155589 0.780187 0.7745755 1. 0.77228796] [0.38512015 0.3619454 0.3745644 0.40617222 0.45425522 0.6120151 0.6137197 0.67589414 0.67395115 0.68989235 0.67957157 0.65426624 0.64318216 0.55824775 0.59889096 0.45343128 0.45238963 0.46657455 0.6670839 0.6026601 0.6209729 0.77228796 1. ]]
Updated PCA Visualizations
Given that one of our hypotheses revolves around input length, which can be difficult to discern from the truncated text in the graphs, we've updated our 2D and 3D PCA visualizers to include at the start of each truncated sentence the ID number of the sentence (0 offset) and its length in characters. We've also slightly updated the knee point epsilon optimizer for DBSCAN to result in better clustering and added code to plot the distance graph for manual visual assessment of the epsilon calculation.
The 2D Visualizer:
from sklearn.decomposition import PCA import plotly.graph_objs as graph import numpy as np from sklearn.cluster import DBSCAN from sklearn.neighbors import NearestNeighbors from matplotlib import pyplot as plt #visualize kneeplot if desired def embedPCAVisual2D(embeds, graphtitle): #collapse to 2D via PCA... pca = PCA(n_components=2) embeds_pca = pca.fit_transform(embeds) #print(embeds_pca.tolist()) #compute the optimal epsilon value for DBSCAN #compute the k-distance graph neigh = NearestNeighbors(n_neighbors=2) #adjust to the minimum number of points required for a cluster neigh.fit(embeds) distances, _ = neigh.kneighbors(embeds) k_distances = np.sort(distances[:, 1]) plt.plot(k_distances) #visualize kneeplot if desired #compute the knee point differences = np.diff(k_distances) knee_index = np.argmax(differences) + 0 #or +1 #epsilon is the distance at knee point epsilon = k_distances[knee_index] print("Optimal Epsilon: ", epsilon) #cluster via DBSCAN cluster = DBSCAN(metric="euclidean", n_jobs=-1, eps=epsilon, min_samples=2) cluster.fit(embeds) print(cluster.labels_) trace = graph.Scatter( x=embeds_pca[:, 0], y=embeds_pca[:, 1], marker=dict( size=7, #color=np.arange(len(embeds_pca)), #color randomly color=cluster.labels_, #color via DBSCAN clusters colorscale='Rainbow', opacity=0.8 ), #text = [sentence[:75] for sentence in sentences], #normal truncation... #text = ['(' + str(len(sentence)) + ')' + sentence[:75] for sentence in sentences], #add length before truncation text = ['(' + str(idx) + ')' + '(' + str(len(sentence)) + ')' + sentence[:75] for idx, sentence in enumerate(sentences)], #display offset, length and truncation hoverinfo='text', mode='markers+text', textposition='bottom right' ) layout = graph.Layout( title=graphtitle, scene=dict( xaxis=dict(title=''), yaxis=dict(title=''), ), height=800, plot_bgcolor='rgba(0,0,0,0)' ) fig = graph.Figure(data=[trace], layout=layout) fig.update_layout(hovermode='closest', hoverlabel=dict(bgcolor="white", font_size=12)) #fig.update_xaxes(showline=True, linewidth=2, linecolor='lightgrey', gridcolor='lightgrey') #fig.update_yaxes(showline=True, linewidth=2, linecolor='lightgrey', gridcolor='lightgrey') fig.show() #override the max height of the cell to fully display the graph from IPython.display import Javascript display(Javascript('''google.colab.output.setIframeHeight(0, true, {maxHeight: 1000})'''))
The 3D Visualizer:
from sklearn.decomposition import PCA import plotly.graph_objs as graph import numpy as np from sklearn.cluster import DBSCAN from sklearn.neighbors import NearestNeighbors from matplotlib import pyplot as plt #visualize kneeplot if desired def embedPCAVisual3D(embeds, graphtitle): #collapse to 3D via PCA... pca = PCA(n_components=3) embeds_pca = pca.fit_transform(embeds) #print(embeds_pca.tolist()) #compute the optimal epsilon value for DBSCAN #compute the k-distance graph neigh = NearestNeighbors(n_neighbors=2) #adjust to the minimum number of points required for a cluster neigh.fit(embeds) distances, _ = neigh.kneighbors(embeds) k_distances = np.sort(distances[:, -1]) #plt.plot(k_distances) #visualize kneeplot if desired #compute the knee point differences = np.diff(k_distances) knee_index = np.argmax(differences) + 0 #or +1 #epsilon is the distance at knee point epsilon = k_distances[knee_index] print("Optimal Epsilon: ", epsilon) #cluster via DBSCAN cluster = DBSCAN(metric="euclidean", n_jobs=-1, eps=epsilon, min_samples=2) cluster.fit(embeds) print(cluster.labels_) trace = graph.Scatter3d( x=embeds_pca[:, 0], y=embeds_pca[:, 1], z=embeds_pca[:, 2], marker=dict( size=5, #color=np.arange(len(embeds_pca)), #color randomly color=cluster.labels_, #color via DBSCAN clusters colorscale='Rainbow', #Viridis opacity=0.8 ), #text = [sentence[:75] for sentence in sentences], #normal truncation... #text = ['(' + str(len(sentence)) + ')' + sentence[:75] for sentence in sentences], #add length before truncation text = ['(' + str(idx) + ')' + '(' + str(len(sentence)) + ')' + sentence[:75] for idx, sentence in enumerate(sentences)], #display offset, length and truncation hoverinfo='text', mode='markers+text' ) layout = graph.Layout( title=graphtitle, scene=dict( xaxis=dict(title=''), yaxis=dict(title=''), zaxis=dict(title=''), ), height=1200 ) fig = graph.Figure(data=[trace], layout=layout) fig.update_layout(hovermode='closest', hoverlabel=dict(bgcolor="white", font_size=12)) fig.show() #override the max height of the cell to fully display the graph from IPython.display import Javascript display(Javascript('''google.colab.output.setIframeHeight(0, true, {maxHeight: 1200})'''))