Engineering

How TrueClara detects deploy regressions: CUSUM change-point detection, baselines, and bounding false positives

A technical walkthrough of the detection mechanism: route-level value reach, baseline windows, CUSUM cumulative-sum change-point detection, sample-size gating, and how we keep the false-positive rate bounded.

TrueClara Engineering

Detection team

May 22, 20268 min read

Deploy-linked route evidence, summarized for review.

The hardest objection we hear from engineers evaluating TrueClara is not "does this UI look nice" — it is "how do I know your detection is real, and not a dashboard that cries wolf?" That is the right question to ask of anything that promises to page you when a deploy quietly breaks revenue. This post is the honest answer: the actual statistical machinery, the parameters that govern it, and the specific way we bound false positives.

We will walk through a single worked example end to end — a checkout value-route whose conversion falls from 6.2% to 4.1% after a deploy — and show exactly when, and why, TrueClara raises an observation.

What we measure: route-level value reach, not page views

TrueClara does not alert on raw traffic. Traffic counts prove activity, not outcome — a page can load fine while the flow behind it stops converting. Instead, each project marks a small number of value-routes: the routes that represent a business outcome, such as checkout success or plan upgraded. For every value-route the runtime SDK reports a rate: a numerator (sessions that reached the outcome) over a denominator (sessions that had the opportunity), bucketed by hour.

So the unit of detection is a conversion rate p = numerator / denominator per subject per hour, where a subject is either a value-route ("did sessions reach checkout success?") or an edge between two routes ("did sessions that hit /pricing go on to /signup?"). Edges matter because a regression often lives in a transition, not a page.

The baseline: what "normal" looked like before this deploy

A change-point detector needs a reference. Ours is a baseline computed from the hours preceding the deploy under test. Two details make the baseline trustworthy rather than naive:

Hour-of-week structure. Conversion is not stationary — Sunday 3am does not convert like Tuesday noon. The baseline stores a rate per hour-of-week, plus an overall blended rate as a fallback. When we score a current hour, we compare it against the baseline rate for that hour-of-week, so we are not fooled by ordinary daily and weekly seasonality.
A minimum-sessions floor, with a corpus prior for cold starts. A baseline built on a handful of sessions is noise. We require a minimum number of baseline sessions before a baseline is considered usable. For brand-new projects that have not yet accumulated enough history, we blend the customer's thin data toward a conservative corpus prior, with the prior's weight decaying to zero as the customer's own session count crosses the readiness threshold. A young project leans on the prior; an established one ignores it entirely.

If neither real history nor a prior is available, we return no baseline and raise nothing. Silence is the correct output when you cannot make a defensible claim.

CUSUM: accumulating evidence instead of trusting one bad hour

The naive approach — "alert if this hour's rate dropped more than X%" — is exactly what produces alert fatigue. Any single hour can dip from ordinary binomial noise, especially on lower-traffic routes. You either set the threshold loose (and miss real, gradual regressions) or tight (and page on every fluctuation).

TrueClara uses CUSUM (cumulative sum), a sequential change-point method designed for precisely this problem: detecting a small, sustained shift in a stream while ignoring transient noise. The intuition is that a real regression leaves a consistent, same-direction signal hour after hour, and CUSUM accumulates that consistency.

For each hour we compute a standardized residual — how many standard deviations the observed rate sits below the baseline rate, using the binomial standard deviation sqrt(p0 * (1 - p0) / n):

z = (observed_rate - baseline_rate) / sqrt(baseline_rate * (1 - baseline_rate) / n)

Then we update the running CUSUM statistic with a slack term k that absorbs ordinary noise:

S = max(0, S_prev + (-z) - k)

(The -z is for drop detection; we run a mirrored statistic for spikes.) The slack — we use k = 0.5 standard deviations by default — is the key to noise rejection. Small fluctuations get subtracted away and S stays pinned near zero. Only when residuals are persistently negative does S climb. We raise an observation when S crosses a threshold h. Severity escalates with how far past the threshold it goes: above h is a warning, above 2h is critical.

Crucially, n enters the residual. A 2-point drop on 5,000 sessions is a large, confident z; the same drop on 40 sessions is statistically unremarkable, produces a small z, and barely moves the CUSUM. Sample size is baked into the math, not bolted on afterward.

A worked example

Take a checkout value-route with a baseline conversion of 6.2% (p0 = 0.062), built from the prior week and comfortably over the baseline-sessions floor. A deploy ships. Over the next several hours the route runs roughly 1,000 eligible sessions per hour at a new rate of about 4.1% (p = 0.041).

The per-hour standard deviation at n = 1000 is sqrt(0.062 * 0.938 / 1000) ≈ 0.00763. The residual each hour is:

z = (0.041 - 0.062) / 0.00763 ≈ -2.75

Now accumulate, with slack k = 0.5:

Hour 1: S = max(0, 0 + 2.75 - 0.5) = 2.25
Hour 2: S = max(0, 2.25 + 2.75 - 0.5) = 4.50
Hour 3: S = max(0, 4.50 + 2.75 - 0.5) = 6.75

If the calibrated threshold for this route is h ≈ 4.7, the statistic crosses it partway through hour 2. TrueClara opens an observation naming the checkout value-route, the suspected deploy SHA, the baseline rate (6.2%), the current weighted rate (~4.1%), the sessions observed, and the CUSUM value. Because the drop is large relative to the per-hour noise, it crosses in roughly two hours rather than dragging on. A smaller, genuine 0.5-point drift would still be caught — it would simply take more hours to accumulate past h, which is exactly the behavior you want from a detector that is supposed to catch slow bleeds without screaming at jitter.

Compare this to the simple-threshold approach: a 62%→41% relative drop (the magnitude here) would trip a percentage alarm, but so would a single noisy hour on a low-traffic route. CUSUM only fires when the drop is both large enough and sustained enough and backed by enough samples.

Bounding false positives: calibrating the threshold by simulation

Here is the part that turns this from "a formula" into "a detector you can trust." The threshold h is not a magic number we picked. It is calibrated to a target false-alarm rate per subject.

We pick a target — 0.05 by default, i.e. we are willing to accept a 5% chance that a perfectly healthy deploy window trips the alarm by chance. Then, for each subject, we run a Monte Carlo simulation: thousands of synthetic deploy windows generated from the route's own baseline rate and traffic volume, with no real regression injected — pure binomial noise. For each trial we run the exact same CUSUM and record the maximum statistic it reaches. We then take the 95th percentile of those maxima and set h to it.

By construction, only about 5% of genuinely-healthy windows produce a CUSUM that high. The threshold is tuned to each route's actual baseline rate and sample volume, so a high-traffic checkout route and a low-traffic upgrade route get different thresholds appropriate to their noise floors. This is what stops false positives from being a tuning afterthought — the false-positive budget is the input, and the threshold is derived from it.

Three more guards keep us honest:

Minimum current sessions. We require a floor of sessions in the deploy window before scoring at all. Below it, we wait rather than guess.
Minimum baseline sessions. Already covered — a thin baseline yields no detector.
Direction and not-variance. A wider spread of outcomes is not a regression; only a sustained shift in the rate in the regression direction accumulates CUSUM. Noisier traffic raises the per-hour standard deviation, which shrinks each residual, making the detector more conservative, not less.

Why this is defensible

Put together, the mechanism is: measure outcome rates on routes that matter, build a seasonality-aware baseline with enough data to be real, accumulate evidence with CUSUM so single noisy hours cannot fire, gate on sample size so small-n routes cannot fire on coincidence, and calibrate the alarm threshold by simulation so the false-positive rate is a number we chose — not one we hope for.

That is why an observation in TrueClara is not "a chart dropped." It is "this route's conversion shifted by an amount, sustained over a window, on enough sessions, past a threshold tuned to a 5% false-alarm budget, attributed to this deploy." The detector is built to earn the interruption — and to stay quiet the rest of the time.

If you want to see it on your own traffic, start free and connect a project, or read how a value-route turns observations into evidence.

TrueClara Engineering

Deploy-aware observations

Notes from the TrueClara team on deploy attribution, route-level behavior, and operating evidence.