Statistical Significance, Certainty & Priors in Made With Intent

Statistical significance & false positives across hundreds of intent combinations

What we report at the aggregate level

At the campaign (aggregate) level, MWI reports two Bayesian quantities:

Probability to beat control — the posterior probability that a random draw from a variant's reward distribution exceeds a random draw from the control's. A value of, say, 80% means "given everything observed so far, there is an 80% chance this variant is genuinely better than control." This is a Bayesian statement, not a frequentist p-value.
Certainty — a 0–1 score reflecting how decisively the accumulated evidence has identified a winner (0 = pure noise, 1 = maximum confidence).

Both are computed from Bayesian posterior distributions (Beta for binary conversion outcomes, Normal for continuous outcomes like revenue), updated on every training run. A 95% credible interval here means what people intuitively expect: "there's a 95% probability the true value lies in this range."

Why we don't run a significance test per segment

The natural objection from an experimentation agency is: "You're slicing the audience into hundreds of intent combinations. Each slice needs its own significance. Don't you need enormous traffic to reach significance in every slice — and aren't you inflating false positives by testing so many?"

The answer turns on a distinction that often isn't appreciated: for content/segment decisions, MWI uses a contextual bandit, not a multi-armed bandit. (Campaign timing is handled by a separate "oneshot" bandit, which optimises across the whole campaign rather than per segment.)

A multi-armed bandit (MAB) only learns from the data each arm actually receives. If you split a campaign across hundreds of segments, each segment is effectively its own little experiment getting a small slice of traffic. To build significance in every slice you genuinely would need a lot of data — and if you then tested significance separately in each slice, you'd hit the multiple-comparisons problem and inflate false positives.
A contextual bandit is different. It trains a model that predicts the conversion rate for a given context (segment), and that model generalises across slices. It does not need to have observed a particular combination to estimate its conversion rate.

Tom's example: suppose we've collected plenty of data on mobile devices, on low intent, and on a particular page. The specific combination "mobile + low intent + that page" may never have occurred together. A MAB would have no estimate for it. The contextual model, having learned how mobile, intent level and page each move conversion, can predict what that cohort's conversion would be — and use that prediction to decide how to allocate traffic — without waiting for that exact combination's data to arrive. That is precisely what a contextual bandit is designed to do: you can feed in many slices and it predicts rather than waits.

Mechanically, the trained model feeds into a Bayesian Monte Carlo allocation method. This is what lets MWI take "priors from places that have never happened before and use them over here" — the same logic that lets the Intent model work on a new site without having seen it, because it has learned the underlying patterns of intent.

So how do we "account for false-positive inflation"?

Honest framing: we don't try to, because we don't create the problem in the first place. False-positive inflation is a consequence of running many independent significance tests (the family-wise error rate climbs with the number of tests — at α = 0.05, ~14 tests already give roughly a 50% chance of at least one false positive; 100 tests yield ~5 false positives by chance alone). MWI does not run hundreds of per-segment significance tests. Instead:

Significance (probability to beat control, certainty) is assessed at the aggregate level, where the data is pooled and the statistic is robust.
Per-segment outcomes are governed by the model's predicted conversion, not by a per-segment significance test. The model's reliability comes from how well-calibrated and discriminating it is, not from passing a significance threshold in each slice.

It is not statistically robust to start splitting the aggregate and measuring significance across every single segment — you start getting spurious and inverse correlations that don't actually relate to anything. MWI trialled this, found it unsound, and removed it.

The calibration angle

This is reinforced by how the underlying Intent model is evaluated. It is a probabilistic model, so it isn't judged by a single "% accurate" number or a per-slice significance test; it's judged on:

Calibration — when it says 70%, does the event happen ~70% of the time? (measured by Expected Calibration Error)
Discrimination — can it rank likely converters above unlikely ones? (measured by AUC, e.g. ~0.84)
Decisioning — precision/recall/F1 when a threshold is applied.

Well-calibrated probabilities are what make it safe to act on a prediction for a segment rather than waiting for that segment to reach independent significance.

Uninformed vs informed priors, and per-campaign vs aggregated

Within a campaign: informed priors

Yes — within a campaign, priors are informed, and this is one of the reasons campaigns can evolve over time:

No cold start for new variants. If you add an experience half-way through a campaign, the performance of the existing variants initially informs the new one. If the average conversion rate across the running variants is ~3%, the new variant is assumed to be ~3% as well, rather than starting from zero and waiting to gather equivalent data. Over time, as it gathers its own data, its estimate splits off onto its own performance.
Warm-up periods. Something new gets its fair / equal share of traffic for a warm-up window before the bandit is allowed to push it down. This protects new variants from being starved before they've had a fair chance.
Net effect. This is what lets you introduce and remove variants over the life of a campaign without resetting everything — which is really what the "informed vs uninformed priors" question is getting at.

(The Agentic Campaigns design echoes this: existing variants retain their posteriors when a new experience is added; only the genuinely new variant starts fresh, and the bandit blends it in as data accumulates.)

Across campaigns: not aggregated (yet)

Priors are not carried across campaigns:

The contextual information changes from one experience to the next, so you wouldn't reliably know what to carry over.
It isn't really necessary — even if you seeded a new campaign with, say, the site-wide average conversion rate, it would start experiencing its own real data almost immediately.
This is a deliberate "not yet" — a robust method for cross-campaign priors hasn't been settled — rather than a fundamental limitation.

This is consistent with how contexts are handled in agentic campaigns: each context keeps its own posterior, contexts are locked once a campaign is running, and reward data is not carried across a context regeneration because the audiences are no longer directly comparable.

Per-segment certainty / significance visibility

Short answer: No — certainty and significance are surfaced at the aggregate / variant level, not broken out as a per-intent-segment significance read-out, and that is a deliberate design choice rather than a gap.

What is shown: probability to beat control and a certainty score at the aggregate/variant level, plus performance ratings (Great / Good / Promising / Uncertain / Unlikely). Per segment, the system acts on the model's predicted conversion and allocates traffic accordingly. (Exactly which per-context figures are exposed in the platform UI is worth confirming with the product team before this is published.)
What is not shown: a separate statistical-significance or certainty figure for each individual intent segment.
Why not: measuring and displaying significance across every individual segment is not statistically robust. When you slice that finely you start surfacing spurious and inverse correlations — patterns that look real but don't relate to anything. MWI built a version of this, saw that it misled people (it was "what people wanted to see" but not the right thing to do), and removed it. The intention is not to reintroduce per-segment significance reporting.

The honest, defensible message for a prospect: per-segment decisions are driven by a calibrated predictive model, and we report statistical confidence where it is sound to do so (the aggregate). We intentionally avoid presenting per-segment significance because it would imply a rigour that fine-grained slicing cannot support.

Technical appendix — Contextual bandit vs multi-armed bandit

The core difference

Dimension	Multi-armed bandit (MAB)	Contextual bandit
What it optimises	The single best arm/variant overall	The best variant given the context (segment)
How it learns	Only from data each arm actually receives	Trains a model that maps context → predicted reward
Unseen combinations	No estimate until data arrives	Predicts reward for combinations never observed together
Data efficiency across segments	Splits traffic thinly across slices; each needs its own volume	Generalises across slices; shares statistical strength
Significance testing	Tempts per-segment tests → multiple-comparisons / false-positive inflation	Decisions from calibrated predictions; significance assessed in aggregate

Why a contextual bandit needs less traffic per slice

Because it generalises. A MAB treats each segment as an isolated experiment, so the data "splits out so much that everyone gets a small slice" and you need lots of data to learn anything per slice. A contextual bandit pools statistical strength: it learns how each feature (device, intent level, page, etc.) affects conversion and composes those effects to estimate any combination — including ones with little or no direct data. MWI's implementation trains the contextual model and then uses a Bayesian Monte Carlo method to draw from posteriors and allocate traffic (Thompson-sampling style: sample each variant's distribution, route to whichever samples highest, so better variants win more often while uncertain ones still get explored).

Important caveat — "so you don't need much traffic" is only half true

This should not be oversold. The right framing (per Tom): yes, a contextual bandit lets you allocate sensibly at lower volumes than a naïve MAB — but volume is only one factor. It also depends on:

How many variants you're splitting between.
The effect size you're trying to detect (tiny effects need far more data).
How long the campaign runs.
How stable the underlying data and control are.

With plenty of volume, all of that "drains out" and becomes a non-issue. The failure mode is trying to use it like a MAB to detect a tiny cosmetic change (e.g. a button from blue to yellow) — that genuinely wouldn't have enough signal, but it also isn't the kind of intervention MWI is built for. Used appropriately (meaningful interventions, sensible number of variants), the traffic most of our customers' sites generate is sufficient.

On false positives in adaptive allocation

For balance, it's worth knowing the literature: naïve MAB allocation can increase false-positive rates, particularly where there are temporal biases in when users enter the experiment, so results from adaptive allocation need careful interpretation. MWI mitigates this through Bayesian estimation, a maintained exploration floor, a standing (adaptive) control group, and bias-correction on the allocation policy (re-weighting observations by the inverse of the probability each variant was chosen) — rather than through per-segment significance testing.

FAQ

Is your significance testing Bayesian or frequentist?
Bayesian. We report probability-to-beat-control and a certainty score derived from Bayesian posterior distributions, updated on every training run. A 95% credible interval means what people intuitively expect: there's a 95% probability the true value lies in that range — unlike a frequentist confidence interval, which has a subtler interpretation.
Don't you need a huge amount of traffic to reach significance across hundreds of intent combinations?
That's true for a multi-armed bandit, which only learns from the data each slice actually receives. We use a contextual bandit, which trains a model that predicts conversion for any combination — including ones never observed together — and feeds those predictions into a Bayesian Monte Carlo allocator. It doesn't wait for every slice to reach significance independently, so it works at far lower per-slice volumes.
How do you account for false-positive inflation?
We avoid creating it. False-positive inflation comes from running many independent significance tests (the family-wise error rate climbs fast — ~14 tests already give roughly a 50% chance of at least one false positive). We don't run hundreds of per-segment significance tests; significance is assessed in aggregate and per-segment routing is driven by calibrated model predictions.
Why not just run a standard A/B test?
A fixed A/B test splits traffic evenly for a set period and declares a winner at the end, so you pay the full exploration cost the whole time — even once the winner is obvious. The bandit shifts traffic toward the winner as evidence accumulates (typically 80–90% within two to three weeks) and keeps running and adapting indefinitely, so you pay a smaller total exploration cost and respond to changes over time.
Can the system pick the wrong winner?
It can, in two situations: when there's too little data (random noise dominates — the wide credible intervals signal this, and you shouldn't act on an "insufficient data" forecast), or when user behaviour shifts after convergence (a previously best variant may no longer be best). The bandit detects drift through its standing exploration and corrects, with a lag. The Bayesian approach is more robust to small samples than frequentist A/B testing, but it isn't immune to noise.
Do you use informed or uninformed priors?
Informed, within a campaign. A variant added mid-campaign inherits the performance of existing variants (plus a warm-up period of fair traffic) rather than cold-starting from zero, then splits onto its own performance as it gathers data. This is what lets you add and remove variants over a campaign's life without resetting everything.
Are priors shared or aggregated across campaigns?
No — not currently. The context differs from one experience to the next, so there's nothing reliable to carry over, and a new campaign sees its own real data almost immediately. It's a deliberate "not yet", not a fundamental limitation.
Can I see certainty or significance broken down per intent segment?
No — certainty and probability-to-beat-control are shown at the aggregate / variant level. Measuring significance across every individual segment isn't statistically robust (you start surfacing spurious, inverse correlations), so we deliberately don't present it. We trialled per-segment significance, found it misleading, and removed it. Per-segment routing is driven by the model's predicted conversion rather than a per-segment significance test.
What's the difference between "probability to beat control" and a p-value?
"Probability to beat control" is the Bayesian posterior probability that the variant is genuinely better than control given everything observed — e.g. 80% means an 80% chance it's truly better. A frequentist p-value answers a different, less intuitive question (the probability of seeing data this extreme if there were no difference). They are not interchangeable.
Why don't you just publish an "X% accurate" number for the model?
Because it's a probabilistic model, not a yes/no classifier. Its quality is measured by calibration (when it says 70%, does it happen ~70% of the time?), discrimination (can it rank likely converters above unlikely ones, e.g. AUC ~0.84?), and decisioning (precision/recall/F1 at a chosen threshold) — not a single accuracy figure.
Will the forecast and certainty figures keep changing?
Yes, and that's correct. Every training run incorporates new data, so the posteriors sharpen, allocation shifts toward the winner, and the certainty score and forecast stabilise over time. Expect more movement early in a campaign and stability in a mature one.