Find answers from the community

Updated 2 years ago

In the query response you can get the

In the query response, you can get the list of source_nodes. Is there a parameter for retrieving the file/document the source node came from?
L
O
53 comments
If you set the extra_info of each document to contain the filename, that will also show up in the source nodes

more details here

https://gpt-index.readthedocs.io/en/latest/how_to/customization/custom_documents.html
I followed this doc. All of the node.extra_info return None even when there's source text.
Works for me

Plain Text
>>> from llama_index import Document, GPTVectorStoreIndex
>>> doc = Document("this is some text", extra_info={'test_key': 'test_val'})
>>> index = GPTVectorStoreIndex.from_documents([doc])
>>> response = index.as_query_engine().query('hello world')
>>> response.source_nodes[0].node.extra_info
{'test_key': 'test_val'}
>>> 
It also works if you set doc.extra_info directly
The instructions show creating a lambda function for filenames. It's not clear how you do that with the SimpleDirectoryReader.
I essentially did this:
Plain Text
from llama_index import SimpleDirectoryReader
filename_fn = lambda filename: {'file_name': filename}

# automatically sets the extra_info of each document according to filename_fn
documents = SimpleDirectoryReader('./data', file_metadata=filename_fn)
Almost!

Plain Text
>>> from llama_index import SimpleDirectoryReader
>>> filename_fn = lambda filename: {'file_name': filename}
>>> documents = SimpleDirectoryReader('./paul_graham', file_metadata=filename_fn).load_data()
>>> documents[0].extra_info
{'file_name': 'paul_graham/paul_graham_essay.txt'}
>>> 
Should that be available in the source nodes in a query response?
I did this to create the index:
Plain Text
# Read in Documents
filename_fn = lambda filename: {'file_name': filename}
documents = []
print("Reading documents.")
for file_path in file_dirs:
    documents.extend(SimpleDirectoryReader(
        input_dir=file_path,
        file_metadata=filename_fn,
        recursive=True).load_data()
    )
        
print("Building index.")
index = GPTVectorStoreIndex.from_documents(documents, service_context=service_context)

Then, in the query, I do this:
Plain Text
evaluator = ResponseEvaluator(service_context=service_context)
response = query_engine.query(query)
return {
    "query": query,
    "response": str(response),
    "source_documents": [x.node.extra_info for x in response.source_nodes],
    "source_text": self._source_text(response.source_nodes),
    "evaluation": evaluator.evaluate_source_nodes(response)
}


But source_documents always shows [None, None, ...]
It should be πŸ€” or at least it is for me
I will double check my sanity here. This should work lol
Thanks. I thought I was doing everything the same but with the source_documents portion added in.
hmmm yea it works for me in a test script πŸ˜… Not sure what the difference is here...
Plain Text
from llama_index import SimpleDirectoryReader, GPTVectorStoreIndex

filename_fn = lambda filename: {'file_name': filename}
documents = SimpleDirectoryReader(
    input_dir="./paul_graham",
    file_metadata=filename_fn,
    recursive=True).load_data()

index = GPTVectorStoreIndex.from_documents(documents)

response = index.as_query_engine().query("what did the author do growing up?")

print(str(response))
print([x.node.extra_info for x in response.source_nodes])
Output

Plain Text
Growing up, the author wrote short stories, programmed on an IBM 1401, built a microcomputer with a Heathkit, wrote simple games and a word processor on a TRS-80, and studied philosophy in college.
[{'file_name': 'paul_graham/paul_graham_essay.txt'}, {'file_name': 'paul_graham/paul_graham_essay.txt'}]
Hmmm...I load the index after persisting it. Any chance that's an issue?
hmm, I will check, I'll add a save/load part to my test
added this before running the query, still works for me

Plain Text
index.storage_context.persist(persist_dir='./nodes_index')

from llama_index import StorageContext, load_index_from_storage
index = load_index_from_storage(StorageContext.from_defaults(persist_dir="./nodes_index"))
Maybe try with a fresh venv? Not sure why it's not working on your end πŸ˜…

Plain Text
python -m venv venv
source venv/bin/activate
pip install llama-index
I have a query function that does this:
Plain Text
index_dir = os.path.join(self.indexes_dir, index_id)

# Load index from requested docs
storage_context = StorageContext.from_defaults(persist_dir=index_dir)
service_context = self.create_service_context(**kwargs)
index = load_index_from_storage(
    storage_context=storage_context,
    service_context=service_context,
)
query_engine = index.as_query_engine()
responses = [self._query(x, query_engine, service_context) for x in queries]

the _query function looks like this:
Plain Text
evaluator = ResponseEvaluator(service_context=service_context)
response = query_engine.query(query)
return {
    "query": query,
    "response": str(response),
    "source_documents": [x.node.extra_info for x in response.source_nodes],
    "source_text": self._source_text(response.source_nodes),
    "evaluation": evaluator.evaluate_source_nodes(response)
}

Do you see anything wrong?
Nah that looks right to me. And no extra_info I'm guessing?

Actually, we can confirm that the documents were ingested properly. If you run nodes = index.docstore.docs) it will get a list of every node in the index.

From there, you can verify that the nodes look correct
oh good. that was my next question
They're all showing as None Is there a method to look at the files to see if the data is there but not being ingested properly (vs it not being stored in the first place)?
Actually. Just looked in the docstore.json. It's all None there too
When you call from_documents(), are you 100% sure each document has an extra_info field filled in?
Sounds like it might not be for some reason
I just did this for 1 doc. Let me show what it prints
Here is the output. It looks like after from_documents extra_info disappears
bruh how is this possible πŸ˜…
why can't I replicate this...
Here is the code matching up to those print statements:
Plain Text
print("Printing documents...")
pprint(documents)
        
print("Building index.")
index = GPTVectorStoreIndex.from_documents(documents, service_context=service_context)

print("Printing Nodes")
pprint(index.docstore.docs)
I know I've asked before, but you are super sure you have a recent version of llama-index? pip show llama-index can check that
I will run exactly this as a sanity check lol
So, I'm running this inside a class. Any chance there's some global variable or something getting nerfed?
mmm I dont think so... what does your service_context look like again?
Why does it work for me lol

Plain Text
>>> from llama_index import Document, GPTVectorStoreIndex
>>> documents = [Document('text', extra_info={'test': 'val'})]
>>> documents[0]
Document(text='text', doc_id='03b4c6e9-2bd2-4687-8980-f388eeebd6d7', embedding=None, doc_hash='1d3f05b1647ad55d6c09b356fe5d1fe670be262d5c3ea0ccda070e365a94809b', extra_info={'test': 'val'})
>>> index = GPTVectorStoreIndex.from_documents(documents)
>>> print(index.docstore.docs)
{'faf195d4-1295-425b-acb9-4289dcbc1c33': Node(text='text', doc_id='faf195d4-1295-425b-acb9-4289dcbc1c33', embedding=None, doc_hash='1d3f05b1647ad55d6c09b356fe5d1fe670be262d5c3ea0ccda070e365a94809b', extra_info={'test': 'val'}, node_info={'start': 0, 'end': 4, '_node_type': <NodeType.TEXT: '1'>}, relationships={<DocumentRelationship.SOURCE: '1'>: '03b4c6e9-2bd2-4687-8980-f388eeebd6d7'})}
>>> 
node_parser = SimpleNodeParser(text_splitter=splitter, include_extra_info=False, ...
set that bad boy to True lol
I copy pasted some shit from somewhere
sanity restored !
I'll just delete it
Thank you for working through my stupidity. I don't even understand why that's a flag.
or where I copied someone setting it to false
No idea why that's a flag either haha glad we figured it out though!
I had started diving into the code that the lambda is called through and was like "IT"S JUST extra_info = str(filepath) WHY IS IT VANISHING!!!"
Add a reply
Sign up and join the conversation on Discord