LIVE · 34 sessions traced · 7 signals firing · 1 fix drafted
rev v0.4.2region eu-westuptime 99.98%latency p95 84ms
Flowlines logoflowlines/behavioral-observability
ProductUse casesCustomer supportCoding agentsEdTech & tutoringHealthcareChangelog
← All posts · Engineering

How to detect agent drift in production

Agent drift is the failure mode every AI team talks about and nobody measures. Here is how to detect it, which signals matter, and how structured memory stops it.

Most teams building production AI agents can describe agent drift in a single sentence: the agent starts aligned with its instructions and gradually isn't. By turn ten, it has dropped a constraint. By session three, it has forgotten a user preference. By week two, it is recommending things the system prompt explicitly disallowed.

Everyone who runs agents in production has seen it. Very few have instrumented for it.

This post is about what agent drift actually looks like in traces, which signals reliably indicate it, and how to build a detection pipeline that catches it before your users do.

What agent drift actually is

Agent drift is the gradual divergence of agent behavior from the system's original instructions, constraints, or learned user state. It is not one failure. It is a family of related ones.

Concretely, it shows up as:

  • Constraint drift. The agent drops a rule it was given at the start. "Never recommend exercise for post-op patients without medical clearance" works for three turns, then quietly stops applying.
  • Tone drift. The agent starts formal, the user nudges casual, and by turn eight the agent is using slang the system prompt never authorized.
  • Scope drift. The agent was instructed to help with Python. It is now writing bash, rationalizing it as "related."
  • Memory drift. The agent retrieved the user's stated preference on turn 2, then generated a response on turn 9 that contradicts it.
  • Persona drift. The agent's self-description changes mid-conversation. "I am a tutoring assistant" quietly becomes "I am an AI language model" after a tricky question.

Each of these is structurally the same problem: the agent's output is a function of the current prompt, and the current prompt does not carry enough history to enforce what was true earlier.

Why traditional observability misses it

If you look at agent drift through the lens of a conventional observability tool (LangSmith, Langfuse, Basalt, generic OpenTelemetry), you see nothing interesting. Latency is fine. Token counts are normal. No errors. The model returned a response. The span graph is green.

This is because drift is not an infrastructure failure. It is a behavioral one. The agent is technically working. It just is not doing what you wanted.

Traditional observability answers three questions:

  • Did the call succeed?
  • How long did it take?
  • How much did it cost?

None of those questions catch a tutor that started drilling the student twenty minutes after being told not to. The request succeeded. It was fast. It was cheap. The student was also miserable.

To detect drift, you need observability that answers a fourth question: is the agent still behaving consistently with its state? That is a behavioral question, not an infrastructure one, and it requires a different data model. Silent failures don't crash. They just slowly corrupt the user experience while your dashboards stay green.

The four signals of drift

In practice, drift shows up in four measurable ways. You can detect at least three of them with data you already have in your traces.

### 1. Constraint violation

This is the cleanest signal. You extract constraints from the system prompt or from explicit memory ("user is vegetarian," "never discuss pricing," "always verify identity"). Then for each agent turn, you check whether the output violates any of them.

The detection is a classifier run over the agent output: given these constraints, did this response violate any of them? Use a smaller model like Haiku, gpt-4o-mini, or a fine-tuned classifier to keep the cost down. Run it async so it does not block the response.

Constraint violation is the signal that maps most cleanly to "the agent forgot what it was told."

### 2. User correction pressure

Users do not file a bug when an agent drifts. They push back. They restate what they already said. They get terser. They rewrite the agent's response back at it in a more acceptable form.

These are detectable patterns in the user's side of the conversation:

  • Repetition of previously-stated information ("As I said before...")
  • Tone escalation, measurable as a sentiment delta across turns
  • Explicit corrections ("No, I meant...", "That's wrong")
  • Shorter replies, a collapse in engagement length

Correction pressure is a lagging indicator. By the time you see it, the drift already happened. But it is also the cheapest signal to compute, because it does not require any LLM call. It is text features on the user turns you already have.

### 3. Output delta from anchor

This is the most powerful signal and the hardest to implement correctly.

At the start of the session, anchor the agent's behavior: take the first three or four turns and summarize them along axes you care about: tone, scope, role, constraints applied. That summary is your anchor. For every subsequent agent turn, generate the same summary and compare.

If the agent's tone delta exceeds threshold, or the scope has expanded, or the role description has shifted, you have drift. The threshold is noisy and you will tune it, but the signal is real and it catches slow drift that no single-turn check can see.

The trick is not to compare against the very first turn (too rigid) or the immediately previous turn (too permissive). Compare against a rolling anchor of the first N healthy turns.

### 4. Memory injection miss

If you are already injecting structured memory into prompts, you have a fourth signal for free: did the agent use the memory field that was injected?

This sounds obvious but is rarely checked. You inject learning_preferences: visual, step-by-step on turn 4. Did the agent actually produce a visual, step-by-step response? Or did it ignore the field and revert to whatever its default behavior was?

Memory injection miss is detected by running a classifier over the agent output asking: given this injected memory, did the response use it? If not, either the injection failed, the model ignored it, or the prompt is drowning it out. All three are fixable, but you have to detect them first.

A detection framework you can actually ship

Here is what a minimal drift detection pipeline looks like. It runs async against your trace stream. It assumes you already have traces with user turns, agent turns, and any structured memory you have injected.

import flowlines

flowlines.init(api_key="fl_...")

# Constraint violation: classifier over agent output vs. known constraints
flowlines.signals.register(
    "constraint_violation",
    detect=lambda turn, ctx: classify_violations(
        turn.agent_output,
        ctx.constraints,
    ),
    severity="danger",
)

# Correction pressure: text features on the user side, no LLM call
flowlines.signals.register(
    "correction_pressure",
    detect=lambda turn, ctx: detect_correction_patterns(
        turn.user_input,
        ctx.previous_turns,
    ),
    severity="warning",
)

# Output delta: compare current turn summary to rolling anchor
flowlines.signals.register(
    "output_delta",
    detect=lambda turn, ctx: compare_to_anchor(
        turn.agent_output,
        ctx.anchor_summary,
    ),
    severity="warning",
)

# Memory injection miss: did the agent actually use what was injected?
flowlines.signals.register(
    "memory_injection_miss",
    detect=lambda turn, ctx: check_memory_usage(
        turn.agent_output,
        ctx.injected_memory,
    ),
    severity="info",
)

Each signal takes a turn and a context object and returns a score. When the score crosses a threshold, the signal fires. Fired signals get attached to the session, correlated across sessions, and surfaced in a dashboard.

Flowlines does this automatically on every trace. If you are not using Flowlines, you can build the same thing with any observability stack and a classifier model. The hard part is not the detection logic, it is connecting detection to memory so the fix can be automated. LLMs are stateless, but systems shouldn't be.

Correlate signals with missing memory fields

A single drift signal is information. Correlated drift signals are a diagnosis.

The highest-leverage move after detection is to correlate drift signals with memory field coverage. For every session that fired a drift signal, check which memory fields were missing or stale. For every session that did not fire a drift signal, check the same thing. Compute the lift.

You will often find one or two memory fields that are extremely predictive. "Sessions without avoidance_patterns had constraint violations 34% of the time. Sessions with it had them 11% of the time." That is a 23-point reduction waiting to be captured.

This is the feedback loop that makes drift detection actually useful. Detection on its own just tells you there is a problem. Correlation tells you what to do about it: inject the field that is missing. This is the difference between memory as a write API and memory as observation. You cannot correlate what you did not capture.

What comes after detection

Detecting drift is half the problem. The other half is fixing it without waiting for a human to update the prompt.

The fix has three parts:

  • Extract the missing state. If the correlation analysis says avoidance_patterns is the gap, extract it from the existing traces. Structured memory extraction runs on the session transcripts and produces the field.
  • Inject it into the next call. Add it to the prompt at the right scope (per-user, per-session, per-turn) and make sure the agent actually uses it. If the memory-injection-miss signal fires, the injection is not landing and you need to rewrite the prompt.
  • Verify the signal drops. Rerun the correlation after the fix is live. If constraint violations dropped on sessions with the new field, the fix worked. If they did not, the field was not the right diagnosis.

This is the loop: observe, detect, correlate, fix, verify. Every turn of the loop tightens the agent's behavior. Over time the rate of drift signals goes down and stays down. That is how you actually make an agent learn in production instead of relying on quarterly prompt-engineering reviews.

Start with one signal

If you are instrumenting drift detection for the first time, do not try to ship all four signals at once. Start with constraint violation.

Constraint violation has the cleanest mapping to "the agent did something it was explicitly told not to do." It has the lowest false-positive rate, the clearest action item (add a guardrail, extract the constraint into memory, or rewrite the prompt), and it is the most common form of drift in production agents.

Once constraint violation is running and you trust the signal, add correction pressure. It is free: no LLM call, just text features. After that, add output delta if you care about long conversations, and memory injection miss if you are already injecting structured memory.

What drift tells you about your agent architecture

If your agent is drifting, it is not because the model is bad. It is because the agent architecture expected the model to carry state it cannot carry.

Every drift signal is a signal about missing memory infrastructure. Constraint drift means the constraint was only in the system prompt and got buried. Memory drift means the user preference was retrieved once and then never reinforced. Persona drift means the agent's identity was stated once and then overwritten by the conversation. Intent itself drifts when there is nowhere for it to persist between turns.

The right fix is not longer prompts. It is a memory layer that persists between calls and re-injects what matters when it matters. That is what turns drift from a recurring problem into a detectable, correctable, diminishing one.

The measurement changes the behavior

Once you start measuring drift, two things happen. The first is obvious: you catch drift faster and fix it. The second is less obvious but more important: measuring drift changes how you build agents.

You stop writing system prompts that are hoping for the best. You start writing system prompts that assume the model will drift and designing feedback loops to catch it. You stop treating memory as a feature and start treating it as infrastructure.

That shift is what separates agents that work in demos from agents that work in production. The demos do not need drift detection because nothing has time to drift. Production agents do. And the teams that are running drift detection are the ones whose agents are actually getting more reliable over time.

Drift is not a bug you fix once. It is a steady-state failure mode you instrument against. Once you see it, you can't unsee it. Once you catch it, you can correlate it. Once you correlate it, you can fix it. Once you fix it enough times, you have a system that compounds.

Try Flowlines

Behavioral observability for your agent, with evidence.

Request access