Find answers from the community

Updated last week

Integrating openai's audio gpt-4o-audio-preview model with agents

Has anyone tried to use OpenAI's audio gpt-4o-audio-preview model with agents? The integration seems quite challenging because streaming events are not supported... πŸ˜₯ I think the only approach to be able to stream events is to use TTS instead and convert audio to text. Any ideas?
L
A
13 comments
I'm not sure what you mean by streaming events are not supported πŸ€” Are you talking about the realtime API, or just chat messages with audio?
I've tested it very lightly, and seems to work fine ok-ish -- but yea, not actually sure what stream_chat will return
https://docs.llamaindex.ai/en/stable/examples/llm/openai/#audio-support
@Logan M Hey again πŸ™‚ So, the issue is I get this error:
`llama_index.core.workflow.errors.WorkflowRuntimeError: Error in step 'run_agent_step': Audio is not supported for chat streaming``
I added a custom step in my workflow:
Plain Text
    @step
    async def handle_audio(self, ctx: Context, ev: AudioInput) -> ProcessedInput:
        # Process audio input
        result = self.process_audio(ev.audio_path)
        return ProcessedInput(result=result)

...
    # use this function to process audio 
    def process_audio(self, audio_path: str) -> str:
        """Process audio input and return a description."""
        messages = [
            ChatMessage(
                role="user",
                blocks=[
                    AudioBlock(path=audio_path, format="wav"),
                    TextBlock(text="Describe the content of this audio."),
                ],
            )
        ]
        llm=self.agents[self.root_agent].llm 
        response =  llm.chat(messages)
        return str(response)
I haven't enabled streaming anywhere, therefore my question. I just have an AgentWorkflow with function calling . I've also set up HITL by iterating over handler.stream_events() as we've seen before.
the actual traceback will be higher up (there's two tracebacks in one)

Do you have the full thing?
I think I know the issue though -- if you are using AgentWorkflow, it will automatically call llm.astream_chat() on the chat messages -- so if you have audio messages, this will probably cause an issue if OpenAI doesn't support streaming here
You might have to avoid AgentWorkflow and make your own custom workflow if you want audio to work here πŸ€” Alternatively, I could maybe add a flag to disable using llm streaming, but feels a tad hacky
@Logan M Here's the entire traceback. I've also attached the custom workflow steps I've defined : shouldn't this override the default AgentWorkflow step you mentioned? Where await self.llm.astream_chat_with_tools is being called. It's an AgentOutput step actually... hmmm I haven't overriden that one. Maybe I need to override that too.
Yea exactly, this is being called inside run_agent_step in the AgentWorkflow -- which is just calling agent.take_step() -- so I think you'd want to override that last method
@Logan M I need to override take_step right? It's not enough to define a step that returns AgentOutput, or would that work too?
I think you need to override take step yea -- or, override run_agent_step such that it never calls take_step
Add a reply
Sign up and join the conversation on Discord