Hello everyone,

At a glance

The community member is interested in using the openchat_3.5 Language Model (LLM) instead of ChatGPT for Retrieval-Augmented Generation. They have successfully downloaded the openchat_3.5.Q8_0.gguf model and are using the llama_cpp library to establish a connection to the LLM. The community member is seeking guidance on how to link LlamaIndex to the local LLM.

In the comments, another community member suggests using the HuggingFaceLLM class from the LlamaIndex library, which makes it easier to load other models. They provide a link to an example demonstrating this approach. Another community member also mentions using the AutoModelForCausalLM.from_pretrained method.

The community member who originally posted has reviewed the information and has a follow-up question about the HuggingFaceLLM class and the significance of the tokenizer_name and model_name parameters. They are relatively new to the AI field and are interested in developing Large Language Model Applications.

The comments also include a code snippet that doesn't compile, and the community member asks for help. The issue is related to the Accelerate library, which is required when using low_cpu_mem_usage=True or a

Useful resources

bbifunctor

Hello everyone,

I intend to utilize openchat_3.5 as my Language Model (LLM) instead of ChatGPT for Retrieval-Augmented Generation. To achieve this, I've successfully downloaded the openchat_3.5.Q8_0.gguf model onto my computer. I'm employing the llama_cpp library to establish a connection to the LLM, as illustrated below:

Plain Text

from llama_cpp import Llama

llm = Llama(model_path="/Users/developer/ai/models/openchat_3.5.Q8_0.gguf", n_gpu_layers=1, n_ctx=2048)

Now, I'm seeking guidance on how to link LlamaIndex to the local LLM, such as openchat_3.5.Q8_0.gguf.

Thank you.

15 comments

SSemirke

Hi,
I used huggingface to load other models, makes it much easier. (https://gpt-index.readthedocs.io/en/latest/examples/customization/llms/SimpleIndexDemo-Huggingface_camel.html)
IDK if this answers your question, Im just a newby here, but I hope it helps 🙂

SSemirke

however, I used the AutoModelForCausalLM.from_pretrained method, you can find some samples for that ,too.

bbifunctor

Hi @Semirke,
Thank you for the response. I have reviewed the information at https://docs.llamaindex.ai/en/stable/examples/customization/llms/SimpleIndexDemo-Huggingface_camel.html, and it seems to align with my requirements. I have a question: Does the class HuggingFaceLLM aim to connect directly to HuggingFace? If so, could you explain the significance of the following parameters:

Plain Text

tokenizer_name="Writer/camel-5b-hf",
model_name="Writer/camel-5b-hf",

As I'm relatively new to exploring the AI field, my goal is to develop Large Language Model Applications.
Best regards

SSemirke

look for "huggingface models", you'll see they have a model repository

SSemirke

it downlaods automagically (the public ones)

bbifunctor

Thanks

SSemirke

ye 🙂

bbifunctor

The following code doesn't compile

Plain Text

import logging
import sys
from llama_index import VectorStoreIndex, SimpleDirectoryReader, ServiceContext
from llama_index.llms import HuggingFaceLLM
from llama_index.prompts import PromptTemplate

logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

documents = SimpleDirectoryReader("./data/paul_graham/").load_data()

# This will wrap the default prompts that are internal to llama-index
# taken from https://huggingface.co/Writer/camel-5b-hf
query_wrapper_prompt = PromptTemplate(
    "Below is an instruction that describes a task. "
    "Write a response that appropriately completes the request.\n\n"
    "### Instruction:\n{query_str}\n\n### Response:"
)


llm = HuggingFaceLLM(
    context_window=2048,
    max_new_tokens=256,
    generate_kwargs={"temperature": 0.25, "do_sample": False},
    query_wrapper_prompt=query_wrapper_prompt,
    tokenizer_name="openchat/openchat_3.5",
    model_name="openchat/openchat_3.5",
    device_map="auto",
    tokenizer_kwargs={"max_length": 2048},
    # uncomment this if using CUDA to reduce memory usage
    # model_kwargs={"torch_dtype": torch.float16}
)
service_context = ServiceContext.from_defaults(chunk_size=512, llm=llm)
index = VectorStoreIndex.from_documents(
    documents, service_context=service_context
)
query_engine = index.as_query_engine()
response = query_engine.query("What did the author do growing up?")
print(response)

bbifunctor

May you help me?

bbifunctor

The compiler is complaining:

Plain Text

  File "/Users/developer/Library/Caches/pypoetry/virtualenvs/playground-2AP3SaSf-py3.11/lib/python3.11/site-packages/transformers/modeling_utils.py", line 2674, in from_pretrained
    raise ImportError(
ImportError: Using `low_cpu_mem_usage=True` or a `device_map` requires Accelerate: `pip install accelerate`

Wait a sec

It seems to work

;]

ye, it's always worth actually reading the output messages 😄

bbifunctor

Attachment

Add a reply

Find answers from the community

Hello everyone,