Integrating openai's audio gpt-4o-audio-preview model with agents

Question

Has anyone tried to use OpenAI's audio gpt-4o-audio-preview model with agents? The integration seems quite challenging because streaming events are not supported... 😥 I think the only approach to be able to stream events is to use TTS instead and convert audio to text. Any ideas?

Logan M · Answer

I'm not sure what you mean by streaming events are not supported 🤔 Are you talking about the realtime API, or just chat messages with audio?

Logan M · Answer

I've tested it very lightly, and seems to work fine ok-ish -- but yea, not actually sure what stream_chat will returnhttps://docs.llamaindex.ai/en/stable/examples/llm/openai/#audio-support

Ariel · Answer

@Logan M Hey again 🙂 So, the issue is I get this error:`llama_index.core.workflow.errors.WorkflowRuntimeError: Error in step 'run_agent_step': Audio is not supported for chat streaming``

Ariel · Answer

I added a custom step in my workflow: @step async def handle_audio(self, ctx: Context, ev: AudioInput) -> ProcessedInput: # Process audio input result = self.process_audio(ev.audio_path) return ProcessedInput(result=result) ... # use this function to process audio def process_audio(self, audio_path: str) -> str: """Process audio input and return a description.""" messages = [ ChatMessage( role="user", blocks=[ AudioBlock(path=audio_path, format="wav"), TextBlock(text="Describe the content of this audio."), ], ) ] llm=self.agents[self.root_agent].llm response = llm.chat(messages) return str(response)

Ariel · Answer

I haven't enabled streaming anywhere, therefore my question. I just have an AgentWorkflow with function calling . I've also set up HITL by iterating over handler.stream_events() as we've seen before.

Logan M · Answer

the actual traceback will be higher up (there's two tracebacks in one)Do you have the full thing?

Logan M · Answer

I think I know the issue though -- if you are using AgentWorkflow, it will automatically call llm.astream_chat() on the chat messages -- so if you have audio messages, this will probably cause an issue if OpenAI doesn't support streaming here

Logan M · Answer

You might have to avoid AgentWorkflow and make your own custom workflow if you want audio to work here 🤔 Alternatively, I could maybe add a flag to disable using llm streaming, but feels a tad hacky

Logan M · Answer

https://github.com/run-llama/llama_index/blob/ea1f987bb880519bb7c212b33d8615ae4b8fdbf8/llama-index-core/llama_index/core/agent/workflow/function_agent.py#L41This is where the agent streams

Ariel · Answer

@Logan M Here's the entire traceback. I've also attached the custom workflow steps I've defined : shouldn't this override the default AgentWorkflow step you mentioned? Where await self.llm.astream_chat_with_tools is being called. It's an AgentOutput step actually... hmmm I haven't overriden that one. Maybe I need to override that too.

Logan M · Answer

Yea exactly, this is being called inside run_agent_step in the AgentWorkflow -- which is just calling agent.take_step() -- so I think you'd want to override that last method

Ariel · Answer

@Logan M I need to override take_step right? It's not enough to define a step that returns AgentOutput, or would that work too?

Logan M · Answer

I think you need to override take step yea -- or, override run_agent_step such that it never calls take_step

Find answers from the community

Integrating openai's audio gpt-4o-audio-preview model with agents