hey all: how do you ensure text fits into an embedding model? you can't know apriori what tokenizer an embedding model uses - or even its input size! Or can you somehow? if I have some arbitrary string 'text' and I need to trim it shorter so it fits into 'embed_model', whats the approach?
You can use tiktoken to count the number of tokens in the text given. Another thing Iβm doing is creating summaries of larger document where you summarize chunks with overlaps. Check out documentsummaryindex