2 Comments

Thank you for the question. It gives me the opportunity to go into more technical detail, which I avoided to keep the post accessible.

My blog pages are already clustered by separator "-+-+-+-+". This is because my aim in this exercise is to identify the best way of organising the source text to get 100% reliable LLM performance.

My 57 web page addresses are split into 411 chunks of text by using that separator string. The similarity search is performed on that 411. I started with nomic because it is very quick (1 minutes compared to 25 minutes by Llama embedder). This makes it easy when you are debugging your script. When I switched to Llama after the script got stabilised, I noticed that Llama was worse than nomic in identifying the right chunk for my queries.

I use "cosine similarity" for distance measurement between two embedding vectors. I tried other metrics but decided on cosine similarity (which is basically the dot product of the two vectors).

Please ask again if the above is not clear enough.

Expand full comment

Can you elaborate on the task you thought nomic-embed-text outperformed the others? Is there a technical distinction between finding the correct document / sections of the document, or clusters of content within one long document?

Example: you have a long, meandering multi-day diary by a man marooned on an island. His writing could conceivably be clustered conceptually into 'obtaining food' 'obtaining water' 'creating shelter' 'improving visibility to planes'. But it is just one long document (which you could divide arbitrarily into chunks).

If you wanted LLama to summarise this and similar diary entries by first identifying all these relevant clusters, and then summarising the diary content for each cluster, are you saying the Nomic product would outperform LLama at (the clustering part) of this task, or is there a technical distinction between this task and finding the relevant document section(s)?

Expand full comment