Find answers from the community

J
JasonV
Offline, last seen last month
Joined September 25, 2024
Any idea when Python 3.13 will be supported across llama-index?
28 comments
L
J
Any practical tips for handling invalidly formatted JSON results from the model?
1 comment
J
I see from llama_index.program.openai import OpenAIPydanticProgram is there an equivalent for Ollama? I can't seem to find one.
6 comments
L
J
Congrats on the workflows, I'll definitely kick the tires on that.
1 comment
L
Did instrumentation change? Arize Pheonix which was totally happy is now constantly warning me. Hmm.
2 comments
J
W
Anyone have best practices in mind?
22 comments
J
L
I love how many people are using tqdm now. It really took off.
1 comment
L
Quick question. My ingestion pipeline works just fine for building my vector store. But, as my postgres DB expanded, I needed to embed a few other fields not related to the original ingestion. Would those here, hand-roll a new embedding table? Just create a new VectorStoreIndex? I definitely don't want to add the new embeddings to the original. Any perspectives welcome.
6 comments
J
L
Has anyone used instructor yet?
2 comments
J
j
How are folks' experience with Anthropic? I heard good things in a meeting today, but in my hands for the past few hours, it's been abysmal.
3 comments
L
J
Looks like I need hand, if someone so wise is around. 😎

I've been screwing up my node filtering all along. The only reason I've gotten such good results is that the queries embed a relevant cue and I'm getting lucky on filtering.

Here's my use case. I want to ingest lots of documents -- estimating close to 300,000 into pgvector. During ingestion, I set a metadata key business_id. I can verify that each node in the table has .metadata['business_id'] set to the correct value.

I need to, at query time, pull only those docs with the specific metadata['business_id'] == some_value the filter the top_k from that set, NOT pull top_k from all nodes and then return those matching. Make sense? I just need a where clause on my SQL query. 🙂
10 comments
J
L
I'm embarrassed to even ask this, but here goes. 😰

I have a very strange issue. I recursively load a directory full of HTML using
Plain Text
documents = SimpleDirectoryReader(
    input_dir=source_directory,
    file_extractor={".html": UnstructuredReader()},
    file_metadata=lambda x: {"biz_id": int(biz_id)},
    required_exts=[".html"],
    recursive=True,
).load_data()


It loads all 193 documents and the data look correct. BUT, when I run the ingestion pipeline off the loaded docs, I always only get 7 nodes! Furthermore, if I change up the transformations in the pipeline, swapping params and even different transformers, I still always only get 7 nodes back!

There's a person w/a very unique name in the docs. I can search the doc text and find it. But, it's not in the transformed nodes; I'm missing data. What am I doing wrong?

Here's the pipeline. (The commented out code was me trying different variants. It makes no difference.):
Plain Text
pipeline = IngestionPipeline(
    transformations=[
        # Option 1: Use SemanticSplitterNodeParser for semantic splitting
        # SemanticSplitterNodeParser(
        #     buffer_size=512,
        #     breakpoint_percentile_threshold=95,
        #     embed_model=embed_model,
        #     verbose=True,
        # ),
        # Option 2: Use SentenceSplitter for sentence-level splitting
        SentenceSplitter(),
        # Option 3: Use UnstructuredElementNodeParser for custom parsing
        # UnstructuredElementNodeParser(),
    ],
    docstore=SimpleDocumentStore(),
    vector_store=vector_store,
    cache=IngestionCache(),
)
nodes = pipeline.run(documents=documents, show_progress=True, in_place=True)
23 comments
L
J
Anyone else seeing duplicate OpenAI calls when using MultiStepQueryEngine?
3 comments
J
L
Can't I do a query-time metadata filter?

Let's say I indexed 5 documents each from different authors. The node's metadata has the author on it. The docs seem to indicate I can only add a metadata filter to the retriever then instantiate the query_engine. That means I'm constantly having to re-create the engine if the metadata over which I'm querying changes, like searching for author1 in query1 then author2 in query2.

Other frameworks allow me to filter at query time.
6 comments
J
L