Find answers from the community

Updated last year

Hello everyone,

At a glance

The community member is interested in using the openchat_3.5 Language Model (LLM) instead of ChatGPT for Retrieval-Augmented Generation. They have successfully downloaded the openchat_3.5.Q8_0.gguf model and are using the llama_cpp library to establish a connection to the LLM. The community member is seeking guidance on how to link LlamaIndex to the local LLM.

In the comments, another community member suggests using the HuggingFaceLLM class from the LlamaIndex library, which makes it easier to load other models. They provide a link to an example demonstrating this approach. Another community member also mentions using the AutoModelForCausalLM.from_pretrained method.

The community member who originally posted has reviewed the information and has a follow-up question about the HuggingFaceLLM class and the significance of the tokenizer_name and model_name parameters. They are relatively new to the AI field and are interested in developing Large Language Model Applications.

The comments also include a code snippet that doesn't compile, and the community member asks for help. The issue is related to the Accelerate library, which is required when using low_cpu_mem_usage=True or a

Useful resources
Hello everyone,

I intend to utilize openchat_3.5 as my Language Model (LLM) instead of ChatGPT for Retrieval-Augmented Generation. To achieve this, I've successfully downloaded the openchat_3.5.Q8_0.gguf model onto my computer. I'm employing the llama_cpp library to establish a connection to the LLM, as illustrated below:
Plain Text
from llama_cpp import Llama

llm = Llama(model_path="/Users/developer/ai/models/openchat_3.5.Q8_0.gguf", n_gpu_layers=1, n_ctx=2048)

Now, I'm seeking guidance on how to link LlamaIndex to the local LLM, such as openchat_3.5.Q8_0.gguf.

Thank you.
S
b
15 comments
Hi,
I used huggingface to load other models, makes it much easier. (https://gpt-index.readthedocs.io/en/latest/examples/customization/llms/SimpleIndexDemo-Huggingface_camel.html)
IDK if this answers your question, Im just a newby here, but I hope it helps πŸ™‚
however, I used the AutoModelForCausalLM.from_pretrained method, you can find some samples for that ,too.
Hi @Semirke,
Thank you for the response. I have reviewed the information at https://docs.llamaindex.ai/en/stable/examples/customization/llms/SimpleIndexDemo-Huggingface_camel.html, and it seems to align with my requirements. I have a question: Does the class HuggingFaceLLM aim to connect directly to HuggingFace? If so, could you explain the significance of the following parameters:
Plain Text
tokenizer_name="Writer/camel-5b-hf",
model_name="Writer/camel-5b-hf",

As I'm relatively new to exploring the AI field, my goal is to develop Large Language Model Applications.
Best regards
look for "huggingface models", you'll see they have a model repository
it downlaods automagically (the public ones)
The following code doesn't compile
Plain Text
import logging
import sys
from llama_index import VectorStoreIndex, SimpleDirectoryReader, ServiceContext
from llama_index.llms import HuggingFaceLLM
from llama_index.prompts import PromptTemplate

logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

documents = SimpleDirectoryReader("./data/paul_graham/").load_data()

# This will wrap the default prompts that are internal to llama-index
# taken from https://huggingface.co/Writer/camel-5b-hf
query_wrapper_prompt = PromptTemplate(
    "Below is an instruction that describes a task. "
    "Write a response that appropriately completes the request.\n\n"
    "### Instruction:\n{query_str}\n\n### Response:"
)


llm = HuggingFaceLLM(
    context_window=2048,
    max_new_tokens=256,
    generate_kwargs={"temperature": 0.25, "do_sample": False},
    query_wrapper_prompt=query_wrapper_prompt,
    tokenizer_name="openchat/openchat_3.5",
    model_name="openchat/openchat_3.5",
    device_map="auto",
    tokenizer_kwargs={"max_length": 2048},
    # uncomment this if using CUDA to reduce memory usage
    # model_kwargs={"torch_dtype": torch.float16}
)
service_context = ServiceContext.from_defaults(chunk_size=512, llm=llm)
index = VectorStoreIndex.from_documents(
    documents, service_context=service_context
)
query_engine = index.as_query_engine()
response = query_engine.query("What did the author do growing up?")
print(response)
May you help me?
The compiler is complaining:
Plain Text
  File "/Users/developer/Library/Caches/pypoetry/virtualenvs/playground-2AP3SaSf-py3.11/lib/python3.11/site-packages/transformers/modeling_utils.py", line 2674, in from_pretrained
    raise ImportError(
ImportError: Using `low_cpu_mem_usage=True` or a `device_map` requires Accelerate: `pip install accelerate`
It seems to work
ye, it's always worth actually reading the output messages πŸ˜„
Add a reply
Sign up and join the conversation on Discord