Find answers from the community

Updated 2 years ago

Help with Querying across and within Transcripts

At a glance
Hey @jerryjliu0 , fantastic work! I have read through all of the docs, and tried getting the different query modes, response modes, and the overall style of using gpt index down. I'm having a hard time figuring out what considerations are needed wrt cost/performance/detail when it comes to choosing an index and what style of composing is needed if at all.

My use case is I have 100s of long meeting transcripts, and am trying to run two types of queries - one which runs only on a given transcript, and the second which runs across transcripts. I decided to start with making a Listindex out of each transcript document, and doing set_text on each with tree_summarize. Now I'm not sure if that's better than making it a tree, and I'm now sure at all how to think about the impact if I change it to a tree index.

I then build a GPT SimpleVectorIndex over these document ListIndexes (this I understand uses openai embeddings) and save it to disk.

Finally, for querying across transcripts, I query the simplevectorIndex with say "How has the speakers perception about the topic changed through the meetings?" and with response_mode tree_summarize and topK=50.

If you or others, could help me out with understanding what my considerations here should be wrt choosing indices and composing styles, I would be so grateful, and would love to pay it forward with helping the community here as well πŸ™‚
S
j
30 comments
I also read all the previous threads here on transcripts, and composability,
Help with Querying across and within Transcripts
Still trying to figure this out, would appreciate any insights
hey @ShantanuNair , apologies to the delay. i'll try to respond to other messages soon!

Thanks for the questions. I'll try to answer here but let me know if you have any followup thoughts:
No worries, truly appreciate all the support and rapid PRs πŸ™‚ It's hard work!
What I'm figuring out now - I am still trying to figure out the structuring/composing needs for my transcripts use case. I made a thread on it above.
  • for big collections of documents, use gpt simple vector index (which it seems like you're doing). use gptlistindex over a carefully selected set of nodes, don't do it over everything (a query will iterate through every node)
  • gpt simple vector index is probably the cheapest/quickest to index your documents. tree index is an interesting experimental tool but i find it more effective as a high-level router, can get expensive if you're trying to use it over a large corpus
  • it's cool you're using composability. I think the way you're using it makes a lot of sense. By default, the chunk sizes are really big, and so top_k=50 will fetch 50 of these big chunks (around ~4000 tokens each). You can set it to smaller with chunk_size_limit=512 (or 256 or any other number) when building an index - you could try that out
Ahh my bad, you are addressing the thread!
for 1) are you saying use GPT simple vector over a listindex, with each listindex being a individual transcript?
using list index over an individual transcript makes a lot of sense (well unless the transcripts are super super long), i was mostly saying don't use it over large collections of documents because slow and expensive
Unfortunately it looks like my recursive query stops at the list indexes summary. And so the fact that the transcripts are indeed long hasn't been an issue yet. The query runs on the summaries but doesn't work on the subindex of listindex.
are you specifying mode="recursive"?
oh wait you mentioned that in a diff thread
Yes, I do set it. And I think tree_summarize is the right response mode for the list and simplevector indices?
An example output from querying
Plain Text
response = higherIndex.query("Given these individual meeting summaries, how many meetings have there been in total, and what are their main topics?", mode='recursive', query_configs=query_configs, verbose=True)
Plain Text
---


Loaded higherIndex from disk
Top 6 nodes:
[Node 25a389c3-8414-4665-b1a3-aaefae20c627] [Similarity score: 0.747065]
This meeting transcript is from the podcast JS Party, hosted by Nick Nisi and featuring Divya Sa...
[Node 5c323242-a7b9-4567-b533-ed7c9bf8575b] [Similarity score: 0.746532]
This meeting transcript features Nick Nisi, Kevin Ball, and special guest Matteo Collina, a tech...
[Node cf729133-d8c9-4617-b771-51689e9f07ba] [Similarity score: 0.742899]
This meeting transcript is from the yayQuery podcast, hosted by Alex Sexton, Rebecca Murphey, Pa...
[Node b0b29f15-65a7-4b28-bd12-d4d06c71eb27] [Similarity score: 0.742324]
In this meeting transcript, Jerod Santo, Kevin Ball, and Emma Wedekind discussed the importance ...
[Node e5f6b35e-9328-4f5a-97ad-4c6a189481cc] [Similarity score: 0.738915]
This meeting transcript is between Mikeal Rogers, Alex Sexton, and Rachel White, three technolog...
[Node 8bbb7cd4-c0b6-47cd-a641-83729ade9473] [Similarity score: 0.725232]
Kevin Ball and Phil Hawksworth, a developer experience engineer at Netlify, met at JAMstack Conf...
Searching in chunk:
This meeting transcript is from the podcast JS...
Searching in chunk:
This meeting transcript features Nick Nisi, Ke...
Searching in chunk:
This meeting transcript is from the yayQuery p...
Searching in chunk:
In this meeting transcript, Jerod Santo, Kevin...
Searching in chunk:
This meeting transcript is between Mikeal Roge...
Searching in chunk:
Kevin Ball and Phil Hawksworth, a developer ex...
Building index from nodes: 0 chunks
0/5, summary:
There have been five meetings in total, and th...
Initial response:
There have been five meetings in total, and their main topics are:
  1. The confusion surrounding the name of the programming language JavaScript, and the idea of rebranding it.
  2. The complexities of the Node.js Streams API and how it relates to the WHATWG Streams API.
  3. The use of the Chrome DevTools Coverage feature, the Explodal plugin, the LABjs script loader, the yayQuery Beginner's Corner, and the Redux course on Egghead.io.
  4. The importance of collaboration between designers and developers when building design systems.
  5. The potential of WebGL and augmented reality applications.
[query] Total LLM token usage: 2062 tokens
[query] Total embedding token usage: 23 tokens

There have been five meetings in total, and their main topics are:
  1. The confusion surrounding the name of the programming language JavaScript, and the idea of rebranding it.
  2. The complexities of the Node.js Streams API and how it relates to the WHATWG Streams API.
  3. The use of the Chrome DevTools Coverage feature, the Explodal plugin, the LABjs script loader, the yayQuery Beginner's Corner, and the Redux course on Egghead.io.
  4. The importance of collaboration between designers and developers when building design systems.
  5. The potential of WebGL and augmented reality applications.
Plain Text
-------

There are 6 meetings in the index, and it retrieved the top 6 but only gave the top 5. Results otherwise pretty good!
@ShantanuNair i'll look into the composability issue where it doesn't go into subindices, but in the meantime you can also try response_mode="compact" or response_mode="default" - this is a different response mode that iterates over the list instead of doing hierarchical summarization
Yes, I'll try that now.
Ahh, you're saying try that in the query_configs, and for the list index particularly? I did try default, and the issue with default would be it would treat them like one meeting, and refine that one meetings description with the consecutive contexts. I haven't tried compact, though I did wonder if it would increase context passed while reducing LLM calls.
But that would then try and refine once it went past the prompt length anyway, and result in the same issue as default.
Oh that's interesting.
Just to make sure I understand it correctly - for the original case - the subindices are listIndex of entire documents. Now say the higher vectorIndex returns 10 embeddings, those would each be mapped to a listIndex summary. But then what? What traversal do we expect? That it traverses each of those 10 ListIndex documents in entirety, correct? Which is why you stated the consideration that they shouldn't be super super long.
that's exactly right
Thank you so much! I'll make sure to give back to the community as well, now that I'm getting a stronger idea of the entire API!
ofc! would love your help
Apologies - I was mistaken above. The scenario I explained with default and compact happens when I set the response mode for the vectorIndex not ListIndex
The ListIndex never gets traversed anyway, so regardless of what response_mode I enter, the response is same. It runs the queries on the summaries of each listIndex and stops there.
Add a reply
Sign up and join the conversation on Discord