Find answers from the community

J
JAX
Offline, last seen 3 months ago
Joined September 25, 2024
Another question - in the query engine, we are using the default COMPACT response sythethizer but we noticed that it does a significant amount of chunking which seems to significantly increase the costs and the latency. According to the documentation this seems to be normal ( https://docs.llamaindex.ai/en/stable/module_guides/querying/response_synthesizers/ ).
Is there any way of disabling the chunking in any shape or form ?
5 comments
J
L
J
JAX
Β·

hello!

hello!
I've been trying to use the retriever evaluation but I can't seem to get it working.
No matter what I run, it always returns this:
Metrics: {'mrr': 0.0, 'hit_rate': 0.0}
22 comments
J
a
L
Hey - I am noticing that when I use the sentence window node postprocessor, now it seems to send the full metadata to the LLM from the retrieved documents instead of just the window content from inside the metadata. TBH I don't recall seeing it this way so I'm curious if it is a recent change or it was always that way (I could be I'm just imagining things πŸ˜„ )
3 comments
L
hello @Logan M ! I'm trying to be an early adopter of the new Nomic AI embedding model but I seem to be running into an error. Unfortunately I cannot use their API so it must run locally, I am embedding around 100k nodes on a T4 machine into a Weaviate vector db.

I am defining the model like this:
Plain Text
model = AutoModel.from_pretrained("nomic-ai/nomic-embed-text-v1", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

embed_model = HuggingFaceEmbedding(
        model=model,
        tokenizer=tokenizer,
        max_length = 2048
        )


Trying to keep a short index batch size:
Plain Text
index = VectorStoreIndex(nodes, storage_context=storage_context, service_context=service_context, show_progress=True, insert_batch_size = 512)


This is the error I'm getting:
Plain Text
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 54.00 MiB. GPU 0 has a total capacity of 14.58 GiB of which 45.56 MiB is free. Including non-PyTorch memory, this process has 14.53 GiB memory in use. Of the allocated memory 14.08 GiB is allocated by PyTorch, and 335.53 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)


Any idea? πŸ™‚
19 comments
b
J
L
hello community ! I have a more generic question: we are using documentation that is divided in roughly 60% technical documentation and 40% more marketing oriented documentation. Currently we're using a single index for everything.
In the past days i've been trying to finetune an embedding model with roughly 8k synthethic data generated from the 60% technical documentation mentioned above. What seems to happen now is that there is a tendency for the retriever to favor more marketing documents which is not necessarily something that I want.
My questions are this:
  1. is this bad practice to keep all documentation in the same index?
  2. does it make sense if (1) is not so terrible to increase the top-k and introduce a reranker?
  3. if (1) is indeed bad practice, what are the recommendations?
1 comment
V
@Logan M is there any way of triggering the use of GPU during embedding finetuning? I'm running the finetuning activity from the llamaindex embedding documentation and it seems to use CPU only. I can't seem to find a way to make it use the GPU
5 comments
J
L
@Logan M - quick question: I am running llama_index 0.9.31, typing_extensions 4.9.0 and openai 1.7.9
3 comments
L
J
Hello - I have a more architectural question, hopefully you can help me like with every other question I had before (thank you @Logan M πŸ˜„ ). Using the sentencewindow postprocessor, we end up with a ton of embeddings to calculate. We ingest new updated documentation almost every week and it started to get difficult to calculate the amount of embeddings. Running it on CPU, it takes a couple of hours, on GPU its less than 30 minutes. From a cost standpoint, I am curious if there is any way of having on demand GPUs / machines that can do the embedding calculations for us (e.g serverless GPUs) or any solution that you would recommend to use? We are currently deployed in GCP so that would make more sense.
Thank you!
31 comments
J
L
hello @Logan M . Got a question, hopefully you can help me clarify my confusion. I've recently read the documentation where using OpenAI Embeddings + Reranking (BGE / Cohere) should significantly increase the hit rate on retrieval. Currently I am using the all-mpnet-base-v2 embedding with the NodeSentenceWindow retrieval. I am considering to switch to OpenAI Embeddings using BGE Reranker Base (mainly due to the fact that our documentation changes frequently and to recalculate a ton of embeddings every week does not really scale well). I am confused a little bit about how to make this type of retrieval to work with reranking.
5 comments
L
J
hello! I have a question, hopefully you can help me out with it. I have multiple data sources like public documentation, community pages, articles written by architects and sometimes some documents have conflicting information. For example an article written by an architect might be somehow different than the same thing that exists in the public documentation (i know we should not have this but oh well πŸ˜„ ). I am curious if there is a way to set sort of a "preferred" way should different sources being retrieved for a question. For example if I retrieve a doc from my architect published docs and another doc from the public documentation, id like to have the LLM to prefer the architect's document. How do you recommend to do that? Should I just index everything in the same vector store or use separate vector stores? If I put them together in the same vector store and they get retrieved, documents retrieved might be both semantically relevant to the question but the information might be conflicting so I am not sure if reranking would help here. Any suggestion is welcome πŸ™
3 comments
J
L
Hey @Logan M - got some weird situation when switching from the typical VectorStoreIndex to a Vector DB (tried with Chroma & FAISS for the moment). When writing the embeddings to the vectorstore, after about 1000 embeddings being calculated, I get this:

Generating embeddings: 2%
1020/44049 [00:14<07:35, 94.40it/s]
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-15-b41c66c4891b> in <cell line: 3>()
1 vector_store = FaissVectorStore(faiss_index=faiss_index)
2 storage_context = StorageContext.from_defaults(vector_store=vector_store)
----> 3 index = VectorStoreIndex(nodes, storage_context=storage_context, service_context=ctx, show_progress=True)

14 frames
/usr/local/lib/python3.10/dist-packages/transformers/models/mpnet/modeling_mpnet.py in compute_position_bias(self, x, position_ids, num_buckets)
376
377 rp_bucket = self.relative_position_bucket(relative_position, num_buckets=num_buckets)
--> 378 rp_bucket = rp_bucket.to(x.device)
379 values = self.relative_attention_bias(rp_bucket)
380 values = values.permute([2, 0, 1]).unsqueeze(0)

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
11 comments
L
J
Hello, I'm getting some weird AssertionError when using FAISS vector store. Any idea?
Adding the code in the threads
10 comments
J
L
J
JAX
Β·

hello community

hello community!
I have a question, is there a way to use Google PaLM2 using the json credentials instead of API key? I see it here only with the api key: https://gpt-index.readthedocs.io/en/latest/examples/llm/palm.html
6 comments
J
L