Question: What is the best way to

kkevingoed

Question: What is the best way to summarize single documents? Would that be with a VectorIndex and then just prompting it "summarize this document for me"?

21 comments

bbmax

probably SummaryIndex

kkevingoed

@bmax even for a single document?

bbmax

@kapa.ai whats best index to use for summarization

kkapa.ai

Let me check the knowledge base for you. One moment! :books:

bbmax

a single document, how many nodes does it turn into?

kkapa.ai

----------------------
For summarization tasks, the best index to use with LlamaIndex is the SummaryIndex. As the name suggests, it is designed to generate a summary of the text in your documents. It stores all of the documents and returns all of them to your query engine. This index is best suited to queries where you are trying to generate a summary of the text in your documents.

For instance, a summarization query could look like one of the following:

“What is a summary of this collection of text?”
“Give me a summary of person X’s experience with the company.”

In general, a summary index would be suited for this use case. A summary index by default goes through all the data. Empirically, setting response_mode="tree_summarize" also leads to better summarization results.

Here is an example of how to use it:

Plain Text

index = SummaryIndex.from_documents(documents)

query_engine = index.as_query_engine(response_mode="tree_summarize")
response = query_engine.query("<summarization_query>")

You can find more information about the SummaryIndex in the documentation () and notebooks ().
----------------------
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
----------------------

kkevingoed

hmmm, not sure but like larger PDFs of like 50+ pages are possible.

bbmax

you want to summarize the whole PDF?

kkevingoed

yeah exactly.

bbmax

the best for summarization purposes is going to be SummaryIndex that will go through each node

bbmax

but it will be expensive both time and cost

bbmax

VectorIndex will use semantic searching so you might not get as good of a summarization

bbmax

but vectorindex will prboably be better w/ a high top_k_similarity

bbmax

and response_mode=tree_summarize

kkevingoed

but isn't the logic behind it that it returns the top_k similar vectors, which is super unreliable in case it searches similar vectors to for example "Summarize this document for me"

yes

exactly

that's the issue

which is why I said "summaryindex" to begin. with 🙂

kkevingoed

Ahh got it. Interesting. Thanks a lot @bmax I'll experiment a bit.

kkevingoed

I wonder if we could combine a keyword extractor and then summarize based on keywords that are in the document maybe?

Add a reply

Find answers from the community

Question: What is the best way to