How are text chunks for embeddings generated?

As documented here, when creating text vector entries in the Retool Vector DB, Retool does the following:

  1. If a document is provided, Retool extracts the text.
  2. Splits the text into chunks.
  3. Calculates the vector for each chunk.
  4. Stores this information in the vector.

To optimise the usage of the vector db, I want to make sure that the vectors are mutually exclusive and always only contain one topic. However, I cannot find out anything about how the text chunks are generated, for example a maximum amount of characters or similar restrictions.

Can anyone provide more information on this? Thanks in advance!

1 Like

Hey Nikita, engineer working on Retool Vectors here!

After we extracted the text from the document, we passed it into a function called chunkText, which splits the text into smaller chunks. It does this using a class called TextSplitter , which breaks the text at specified points like spaces or punctuation, respecting a maximum chunk size (1000 letters). If a chunk is too large, it's split further. The result is an array of text chunks, each within a set size limit, making the text easier to manage, especially in situations where handling large texts directly is challenging due to size constraints.

We will be experimenting to see what's the best splitting method for query performance, so feedbacks are always appreciated!

1 Like

Hi @huytool157 ,

Can we have more granular control in the Retool Vector query? For example, when we choose to upload a document to a vector, it would be nice to have options to decide the splitters, maximum chunk size, etc. Essentially, it would be an explicit call of the chunkText and TextSplitter you mentioned above.

Also being able to create document vectors with workflows / queries would be needed.