Building autonomous agents that respond in near real-time is a challenge. When we launched our first version, average round-trip latency was hovering around 3.4 seconds. This was acceptable for asynchronous tasks but jarring for chat interfaces.
This post details how we identified bottlenecks and re-architected our event loop to bring latency down to sub-800ms.
The Latency Problem
Our initial diagnostics showed that the LLM inference itself was only responsible for 40% of the delay. The remaining 60% was overhead:
- Serial tool execution
- Redundant context loading
- Database round-trips for session state
Architecture Changes
We moved from a serial execution model to a speculative parallel execution model. Instead of waiting for the LLM to request a tool, we now start pre-fetching relevant context based on the user's intent classification.
Code Optimization
Here is a simplified example of how we parallelize the context gathering using standard Javascript Promises.
async function handleUserMessage(message) {
// Old Serial Way
// const user = await db.getUser(message.userId);
// const history = await db.getHistory(message.sessionId);
// const context = await vecDB.search(message.content);
// New Parallel Way
const [user, history, context] = await Promise.all([
db.getUser(message.userId),
db.getHistory(message.sessionId),
vecDB.search(message.content)
]);
return generateResponse({ user, history, context, message });
}
This simple change saved us approximately 400ms per request.
Results
After deploying these changes, along with moving our vector database to the edge, we observed a significant improvement in P99 latency.
We are continuing to optimize our stack. Next up: experimenting with smaller, specialized models for routing to further reduce inference costs and time.