How to Trim Text to Fit an Embedding Model Without Knowing the Tokenizer or Input Size

Question

hey all: how do you ensure text fits into an embedding model?you can't know apriori what tokenizer an embedding model uses - or even its input size! Or can you somehow?if I have some arbitrary string 'text' and I need to trim it shorter so it fits into 'embed_model', whats the approach? There must be a simple solution I am missing!

fiksii4290 · Answer

https://docs.llamaindex.ai/en/stable/examples/observability/TokenCountingHandler/

SnowBloom · Answer

thanks ill look into this 🙂

Logan M · Answer

Most embedding models also just truncate if it goes over, so as long as you are "close enough" its usually fine imo

SnowBloom · Answer

I guess its just the nvidia model that doesn't. if im 1 token over it throws an exception! I guess I just assumed it was typical

Logan M · Answer

Oh this is a param

Logan M · Answer

for nvidia

Logan M · Answer

NVIDIAEmbedding( model="nvidia/nv-embedqa-e5-v5", truncate="END"
)

Logan M · Answer

like that

ShubyZ · Answer

You can use tiktoken to count the number of tokens in the text given. Another thing I’m doing is creating summaries of larger document where you summarize chunks with overlaps. Check out documentsummaryindex

Find answers from the community

How to Trim Text to Fit an Embedding Model Without Knowing the Tokenizer or Input Size