Field Notes Strategy Feb 12, 2026

Optimizing Latency in Real-time Agents

Dev Team

Building autonomous agents that respond in near real-time is a challenge. When we launched our first version, average round-trip latency was hovering around 3.4 seconds. This was acceptable for asynchronous tasks but jarring for chat interfaces.

This post details how we identified bottlenecks and re-architected our event loop to bring latency down to sub-800ms.

The Latency Problem

Our initial diagnostics showed that the LLM inference itself was only responsible for 40% of the delay. The remaining 60% was overhead:

  • Serial tool execution
  • Redundant context loading
  • Database round-trips for session state

Architecture Changes

We moved from a serial execution model to a speculative parallel execution model. Instead of waiting for the LLM to request a tool, we now start pre-fetching relevant context based on the user's intent classification.

Code Optimization

Here is a simplified example of how we parallelize the context gathering using standard Javascript Promises.


async function handleUserMessage(message) {
    // Old Serial Way
    // const user = await db.getUser(message.userId);
    // const history = await db.getHistory(message.sessionId);
    // const context = await vecDB.search(message.content);

    // New Parallel Way
    const [user, history, context] = await Promise.all([
        db.getUser(message.userId),
        db.getHistory(message.sessionId),
        vecDB.search(message.content)
    ]);

    return generateResponse({ user, history, context, message });
}

This simple change saved us approximately 400ms per request.

Results

After deploying these changes, along with moving our vector database to the edge, we observed a significant improvement in P99 latency.

Before
3.4s
After
0.8s
Improvement
4.2x

We are continuing to optimize our stack. Next up: experimenting with smaller, specialized models for routing to further reduce inference costs and time.