Flowlines is a behavioral observability and structured memory platform for production AI agents.
Your logs see tokens.
→ we_see_behavior.
Execution traces measure one call. Flowlines measures behavior across sessions, across users, across the weeks it takes a failure mode to emerge, and surfaces the exact memory fix that prevents the next occurrence. Evidence-backed. Reversible.
/api/search: 60 req/min per user. Add tests. Commit when passing.feat: rate limit /api/search. Ready for review.Three channels,
one closed_loop().
Traces stream in. Signals fire on behavior. Memory is extracted, reviewed, and injected back into future calls. Each layer is observable, each write is reversible, each decision carries its evidence.
Every call, captured.
Two-line SDK. Each prompt, tool use, memory read, and decision becomes a structured execution graph: replayable, searchable, joined on user and session.
Behavior, decoded.
Every trace is scored across 12 failure modes. Patterns that correlate with missing memory fields get surfaced with statistical evidence.
past_escalationsContext that persists.
Approve the fix and Flowlines extracts typed, scoped, versioned memory, injected into future calls. Reversible.
One trace →
fix_approved()
A real session, end-to-end: raw trace, session build, signal fire, statistical evidence, drafted fix, approved write. Click a stage to inspect it, or let it auto-advance.
Flowlines will catch it.
The scope
behavioral ⊃ execution ∪ memory
Execution observability sees inside one call. Memory stores give the agent a place to write things down. The behavioral layer is the one that sees the pattern and names the fix. We run alongside both on day one.
What ships in the box.
Nine building blocks across trace, signal, and memory. Each one instrumented, queryable, and independently replaceable.
Two-line instrumentation
Wrap your agent entrypoint. Every call, tool, and memory read is captured with full context and cost.
Session replay with branching
Step through a conversation turn-by-turn. Branch from any span, swap the model or memory, see what would have happened.
Cost attribution to the span
Dollars per session, per user, per cohort, per failure mode. Join on anything you trace.
12 failure modes out of the box
Drift, frustration, context loss, repetition, hallucination, constraint violation, abandonment, plus your own.
Evidence-backed correlation
Every alert comes with the population it's based on, the confidence interval, and the correlated memory gap.
Slack alerts, grouped by cohort
Pipe behavioral alerts into your team channel. Group by cohort, rate-limit by severity, digest the rest.
Typed, scoped, versioned fields
Define the schema once: preferences, constraints, task state, recurring patterns. Scope by user, session, or cohort.
Reviewed before they land
Every write shows the draft diff, the evidence behind it, and the expected impact. Approve, reject, or let the auto-merge rules handle it.
Every field, traceable
Click any memory field in any call. See the interactions that produced it, the signals that justified it, the engineer who approved it.
Where behavioral matters most.
Any agent with users over time benefits. These are the production domains where cross-session behavior and structured memory compound into real economic outcomes.
Catch the recurring vulnerability before the sixth time you ship it.
SQL injection, hardcoded secrets, unscoped permissions, learned across the developer's history, fixed in their style.
Detect the third contact before your customer gives up.
Repeat contacts, declining sentiment, resolution failure. The missing field surfaces before the escalation.
See frustration in the language before you see it in the churn.
Engagement drops, avoidance patterns, topic abandonment, correlated to the exact field that would reduce it.
Typed memory with full provenance, for regulated environments.
Every write traceable to the interaction that produced it. PII redaction by default. Self-hostable on your cloud.
Things people ask.
Short answers to the questions we hear most. For anything deeper, book a 20-minute call with the founders. The calendar is in your welcome email.
Is Flowlines a replacement for LangSmith?
Not on day one. LangSmith instruments one call; Flowlines correlates behavior across thousands. We run alongside. As you grow, the behavioral layer becomes primary and per-call traces fold into it.
How is this different from Mem0, Zep, or Letta?
Memory stores are the filing cabinet. Flowlines is the archivist: reading every interaction, deciding what's worth filing, and showing the evidence behind every write before you approve it.
What languages and frameworks do you support?
Node.js and Python SDKs today. Framework-agnostic: raw API calls, LangChain, LlamaIndex, LangGraph, CrewAI, Vercel AI, custom agent loops. Anything that talks to an LLM.
Do you store prompts and responses?
Yes, encrypted at rest in the region you select. PII redaction is on by default for regulated use cases. Enterprise plans support self-hosted deployment entirely inside your cloud.
How long until we see signal?
Individual traces appear in seconds. Cross-session behavioral patterns with statistical evidence typically surface after 100–200 sessions per cohort.
What will this cost us?
Free during early access, including production. When we launch paid tiers, pricing is per-trace with volume bands; memory writes are always free. Design partners get grandfathered rates for 24 months.
Your agent is learning.
Make sure it's learning_the_right_thing().
We onboard a small cohort each week with a strong bias for teams running production agents with real users. If that's you, expect a reply within 24 hours.