The community members are discussing performance issues with ingestion pipelines and docstores, particularly when dealing with large documents with a large number of nodes. The main points are:
- The ingestion pipeline makes too many calls to the vector store and docstore, and should batch everything at the end.
- There is a significant performance hit when doing document management like delete/add, as it does a put on every node action.
- The community members are considering waiting until all the nodes are removed before doing the put for ref_doc_info to improve performance.
- There is an issue with the ingestion pipeline not supporting Pinecone serverless, and a PR is in progress to address this.
- The community members have identified a bug where the ref_doc_info can balloon to over 1M refs, causing performance issues. A fix PR has been created to address this.
- There are discussions around how to best handle the high number of refs and potential solutions like namespacing by document and deleting the namespace directly in the key-value store.
What's interesting about the high number of nodes is that the document chunking results in about a thousand nodes. But once the documentsummary index gets created, the ref_doc_info balloons to 1M+ refs. So there's something strange going on there. Have to look into that. If that number of refs is unavoidable, we may have to namespace by document and delete the namespace in the kv store directly.
@Logan M I tracked down the cause of the 1M refs. It looks like if the doc store is used to store multiple indexes each index will cause an exponential increase in refs. Here is a basic notebook to see the issue (I didnt test it this late on a friday, but it should work)
@Logan M question, I am trying to use this updated version in my application, and I refer to it in the requirements.txt as the git repo, but it always tries to get the wheel for 10.51 instead of using the updated core (because it's defined in poetry as such). How can I make it use the latest?