I have a main index built, created a query engine and query, and checked the retrieved nodes in the response object, but the nodes don't seem to match the query, instead they appear to be similar to the top_k setting, which is not relevant to the query contents.

Question

Hello, I have a my main index built. I created a query engine and query and then checked on the retrieved nodes in from the response object. I thought the nodes would give me the list of nodes used to build the response. What it looks like is that it’s a similar number to top_k that I setup. So whatever top_k settings I have turns out to be the number of nodes even if the query has nothing to do with the contents of the node. Any help in how to only get the nodes that were used to build the response?

WhiteFang_Jr · Answer

Hi, you can refer to this: https://llamahub.ai/l/llama-packs/llama-index-packs-fuzzy-citation?from=

This basically extract the part of the total text from the nodes which have been used to form the final answer.

ShubyZ · Answer

Thank you, will look into it

Logan M · Answer

fyi those are the nodes that were sent to the llm -- the llm read/saw all that text

ShubyZ · Answer

I did run into this: https://github.com/run-llama/llama_index/issues/430

Logan M · Answer

yea, thats the same explanation 🙂

ShubyZ · Answer

The (incorrect) interpretation I got reading JerryLiu's response was only the nodes that contribute to the response will be included in source_nodes, and since default chunk size is large, it could mean many (all in some cases) documents. However, from what I see, it's a combination of item: chunk size and top_k setting. Top_K will always return the number of nodes specified in the setting. Specifically, in my case, I had multiple (different) documents and I knew only one of the document contained a specific subject that isn't found in the other documents; yet the result in source_nodes was more affected by the top_k value so that didn't help.

So, my thought, is that this is about probability scoring and I'm looking at the scores in the NodeWithScore.

Currently, node[x].score all return "None" but perhaps because of a setting I'm using.

I'm looking here: https://github.com/run-llama/llama_index/issues/14157

Looking at the Fuzzy-citation that @WhiteFang_Jr shared, seems this is the post processing similarity scoring. Also read that certain types of query_engines such as TreeIndex would need post processing. I already did a documentSummaryIndex and will be using a retriever engine so hopefully, the score will already be in the NodesWithScoring

Find answers from the community

I have a main index built, created a query engine and query, and checked the retrieved nodes in the response object, but the nodes don't seem to match the query, instead they appear to be similar to the top_k setting, which is not relevant to the query contents.