Find answers from the community

Updated last year

Question: What is the best way to

Question: What is the best way to summarize single documents? Would that be with a VectorIndex and then just prompting it "summarize this document for me"?
b
k
k
21 comments
probably SummaryIndex
@bmax even for a single document?
@kapa.ai whats best index to use for summarization
Let me check the knowledge base for you. One moment! :books:
a single document, how many nodes does it turn into?
----------------------
For summarization tasks, the best index to use with LlamaIndex is the SummaryIndex. As the name suggests, it is designed to generate a summary of the text in your documents. It stores all of the documents and returns all of them to your query engine. This index is best suited to queries where you are trying to generate a summary of the text in your documents.

For instance, a summarization query could look like one of the following:
  • “What is a summary of this collection of text?”
  • “Give me a summary of person X’s experience with the company.”
In general, a summary index would be suited for this use case. A summary index by default goes through all the data. Empirically, setting response_mode="tree_summarize" also leads to better summarization results.

Here is an example of how to use it:

Plain Text
index = SummaryIndex.from_documents(documents)

query_engine = index.as_query_engine(response_mode="tree_summarize")
response = query_engine.query("<summarization_query>")


You can find more information about the SummaryIndex in the documentation () and notebooks ().
----------------------
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
----------------------
hmmm, not sure but like larger PDFs of like 50+ pages are possible.
you want to summarize the whole PDF?
yeah exactly.
the best for summarization purposes is going to be SummaryIndex that will go through each node
but it will be expensive both time and cost
VectorIndex will use semantic searching so you might not get as good of a summarization
but vectorindex will prboably be better w/ a high top_k_similarity
and response_mode=tree_summarize
but isn't the logic behind it that it returns the top_k similar vectors, which is super unreliable in case it searches similar vectors to for example "Summarize this document for me"
that's the issue
which is why I said "summaryindex" to begin. with 🙂
Ahh got it. Interesting. Thanks a lot @bmax I'll experiment a bit.
I wonder if we could combine a keyword extractor and then summarize based on keywords that are in the document maybe?
Add a reply
Sign up and join the conversation on Discord