LlamaIndex

Log inLog into community

Find answers from the community

Updated 9 months ago

llama_index/llama-index-core/llama_index...

llama_index/llama-index-core/llama_index...

At a glance

The community members are discussing performance issues with ingestion pipelines and docstores, particularly when dealing with large documents with a large number of nodes. The main points are:

- The ingestion pipeline makes too many calls to the vector store and docstore, and should batch everything at the end.

- There is a significant performance hit when doing document management like delete/add, as it does a put on every node action.

- The community members are considering waiting until all the nodes are removed before doing the put for ref_doc_info to improve performance.

- There is an issue with the ingestion pipeline not supporting Pinecone serverless, and a PR is in progress to address this.

- The community members have identified a bug where the ref_doc_info can balloon to over 1M refs, causing performance issues. A fix PR has been created to address this.

- There are discussions around how to best handle the high number of refs and potential solutions like namespacing by document and deleting the namespace directly in the key-value store.

Useful resources

·

Hi there, have been working with ingestion pipelines, docstores and I am finding that for a large document with a large number of nodes there can be a significant performance hit when doing any document management like delete/add. This is because it does a put on every node action, in delete e.g. https://github.com/run-llama/llama_index/blob/a24292c79424affeeb47920b327c20eca5ba85ff/llama-index-core/llama_index/core/storage/docstore/keyval_docstore.py#L485), and depending on the number of remaining nodes, it can take a while. Would it make more sense to wait til all the nodes are removed before doing the put for ref_doc_info?

L

r

24 comments

I think in general, the ingestion pipeline makes too many calls to the vector store and docstore -- it should be batching everything at the end 😅

Very open to any improvements or PRs here!

ok, I'll create one for delete for review

Is this a bug or feature?

(for the git issue)

mmm... lets call this an enhancement

dont see that as an option when I create an issue

or do I just create the PR

oh yea, I guess its a feature then lol

But you can just create a PR too

ok

The other issue is that the ingestion pipeline doesnt support pinecone serverless (metadata filtering for delete not allowed e.g.)

ugh I know 😅 There is a PR somewhere for this, but I don't think its fully baked yet

Ok. If you point me to the PR, I can see if I can have someone on my team help out.

What's interesting about the high number of nodes is that the document chunking results in about a thousand nodes. But once the documentsummary index gets created, the ref_doc_info balloons to 1M+ refs. So there's something strange going on there. Have to look into that. If that number of refs is unavoidable, we may have to namespace by document and delete the namespace in the kv store directly.

https://github.com/run-llama/llama_index/pull/14282

Here's the pr

@Logan M I tracked down the cause of the 1M refs. It looks like if the doc store is used to store multiple indexes each index will cause an exponential increase in refs. Here is a basic notebook to see the issue (I didnt test it this late on a friday, but it should work)

The code that's causing it is this https://github.com/run-llama/llama_index/blob/dd6910757fa846370d3e04183838dee7f0ddec28/llama-index-legacy/llama_index/legacy/storage/docstore/keyval_docstore.py#L103

once the refs are that high any kv put on the refs will take an inordinate amount of time.

🤔 hmm

I'm not sure how refs are used in general to know how the merge routine could change and not break things downstream

let me know if I should create a bug in github

fixed the example so it runs (needs openai key or fails with connection error). if you repeatedly run the last step, the refs will continue to grow.

fix PR is here - https://github.com/run-llama/llama_index/pull/14470

@Logan M question, I am trying to use this updated version in my application, and I refer to it in the requirements.txt as the git repo, but it always tries to get the wheel for 10.51 instead of using the updated core (because it's defined in poetry as such). How can I make it use the latest?

I did something hacky looking by using my local copy and this in requirements.txt which seems to have worked

Plain Text

../llama_index/llama-index-core
../llama_index

Yea the llama-index package is just a starter wrapper on several packages (including core). You could just skip installing that

Add a reply

Sign up and join the conversation on Discord