Nltk

At a glance

The post indicates that the answer to the question "Docker NLTK download" is "no". The comments discuss various approaches to handling NLTK downloads in a Docker environment, such as setting the cache directory using the LLAMA_INDEX_CACHE_DIR environment variable, preloading the Dockerfile with NLTK files, and using Llama's cache directory as a fallback. Community members also discuss issues with the NLTK downloader and suggest using a custom version. Additionally, a community member points out a type hint issue in the split_by_sentence_tokenizer() function.

Useful resources

iisaackogan

Answer is no: https://stackoverflow.com/questions/31143015/docker-nltk-download

16 comments

LLogan M

You can set the cache dir it downloads to with the LLAMA_INDEX_CACHE_DIR env var

LLogan M

Then you couuuuld preload the dockerfile with the nltk files

LLogan M

I think

LLogan M

Seems like the did something similar in that stackoverflow thread

iisaackogan

Are you sure that’s the one? Source looks like it’s NLTK_DATA

iisaackogan

Oh I see

iisaackogan

The fallback is Llama’s cache dir

iisaackogan

I think that means it might work without modifying the dockerfile…seems llama already handles this support even though nltk doesn’t :)

LLogan M

haha yea, NLTK_DATA isn't even used by their downloader, which is super annoying, so we made our own version xD

iisaackogan

ahahah

iisaackogan

@Logan M btw the type hint here is wrong

Plain Text

def split_by_sentence_tokenizer() -> Callable[[str], List[str]]:

iisaackogan

Callable[[str, str], List[str]]

iisaackogan

Plain Text

def sent_tokenize(text, language="english"):

LLogan M

ah yea thats fair

iisaackogan

i shall cast

iisaackogan

For others referencing:

Plain Text

TOKENIZER: Callable[[str, str], List[str]] = typing.cast(
   Callable[[str, str], List[str]], split_by_sentence_tokenizer()
)

Add a reply

Find answers from the community

Nltk