Built for variable outputs
Semantic equivalence and structural assertions instead of brittle string matches.
Describe a real user journey and we'll watch it across every release. Here's how that looks for the most common AI app shapes.
Standard QA tooling assumes deterministic output and stable UI. AI products break that assumption at every layer, and the bugs that ship are the ones nobody saw coming.
Same prompt, different responses. Snapshot diffs are useless - you need semantic and journey-level checks.
A prompt tweak silently breaks a downstream tool call or a refusal pattern. Standard E2E suites miss it entirely.
Latency, refusal rate, hallucination, tool success - these are user-facing metrics, not just internal SLOs.
Evals score the model in isolation. Production breaks at the seams: prompts, tools, retrieval, UI. Journey-level tests catch what evals can't see.
When a journey breaks, you don't get a stack trace and a vague screenshot. You get the full evidence trail - exactly what changed since your last green run.
Natural-language journeys, quality bars that travel with each release, and evidence you can act on the moment something breaks.

Drop in your critical AI flows - for example: "user asks for a refund and the agent processes it through the support tool." No DSL, no SDK.

Define what "good" looks like: tool calls, refusals, latency budget, semantic match. We verify outcomes, not exact wording.

TesterArmy executes against staging or prod previews on each deploy. GitHub App or webhook from any CI pipeline.

Query returns the correct filtered items and pagination controls work as expected
Recordings, transcripts, tool traces, and a clean diff against your last green release. Triage in minutes, not hours.
Yes. Journeys can include multi-turn agent steps and tool calls. We assert on outcomes (did the agent call the right tool with the right arguments and reach the right end state) rather than on exact wording.
Semantic checks plus structural assertions on tool calls, plus quality-signal thresholds: latency, refusal rate, format validity, citation presence. Variability in phrasing won't break tests; broken behavior will.
Yes - via the TesterArmy GitHub App or Vercel preview integration. You get pass/fail check statuses and report links right on each pull request.
Full session recording, complete LLM transcript with token-level diffs against the last green run, tool-call trace, latency and cost deltas, and the failing journey step with a one-click rerun.
No, we complement them. Evals are unit-level and score the model in isolation. TesterArmy is journey-level QA - it catches issues that only show up when prompts, tools, retrieval, and UI run together.
No. We're functional QA - does the journey work for real users? Red-teaming is complementary and we integrate cleanly with red-team tooling.
Assertions target user-visible outcomes, not exact wording. A model upgrade that improves answers won't break your suite. One that breaks a tool call or violates a refusal pattern will.
Web today, with mobile coverage available. The same natural-language journey definitions apply across platforms.
A prompt tweak. A model upgrade. A retrieval index drift. The eval still passes; the user journey breaks. TesterArmy is the safety net you wire in once and trust on every release.
Follow real user paths across auth, agent turns, tool calls, and outcomes.
Recordings, transcripts, and clean diffs against your last green run.
Follow real user paths across auth, agent turns, tool calls, and outcomes.
Recordings, transcripts, and clean diffs against your last green run.