Find answers from the community

Updated 6 months ago

Nltk

At a glance

The post indicates that the answer to the question "Docker NLTK download" is "no". The comments discuss various approaches to handling NLTK downloads in a Docker environment, such as setting the cache directory using the LLAMA_INDEX_CACHE_DIR environment variable, preloading the Dockerfile with NLTK files, and using Llama's cache directory as a fallback. Community members also discuss issues with the NLTK downloader and suggest using a custom version. Additionally, a community member points out a type hint issue in the split_by_sentence_tokenizer() function.

Useful resources
L
i
16 comments
You can set the cache dir it downloads to with the LLAMA_INDEX_CACHE_DIR env var
Then you couuuuld preload the dockerfile with the nltk files
Seems like the did something similar in that stackoverflow thread
Are you sure that’s the one? Source looks like it’s NLTK_DATA
The fallback is Llama’s cache dir
I think that means it might work without modifying the dockerfile…seems llama already handles this support even though nltk doesn’t :)
haha yea, NLTK_DATA isn't even used by their downloader, which is super annoying, so we made our own version xD
@Logan M btw the type hint here is wrong
Plain Text
def split_by_sentence_tokenizer() -> Callable[[str], List[str]]:
Callable[[str, str], List[str]]
Plain Text
def sent_tokenize(text, language="english"):
ah yea thats fair
For others referencing:

Plain Text
TOKENIZER: Callable[[str, str], List[str]] = typing.cast(
   Callable[[str, str], List[str]], split_by_sentence_tokenizer()
)
Add a reply
Sign up and join the conversation on Discord