Fixed horizon, Bayesian, or sequential: Choosing a stats engine -

7 min read time

My interest in statistics started with a joke I heard as a kid. Someone asked: “What’s the chance of seeing a dinosaur on the metro?”

Being a data-minded person, I’d say “0%. Dinosaurs are extinct.”

They’d say “No, it’s 50%. You either see one or you don’t.”

The logic felt oddly airtight, yet completely wrong. I’d push back: “I’ve ridden the metro more than a hundred times and never seen a dinosaur. So, how can it be 50%?” That small puzzle stuck with me, and it captures something essential about how we think about probability.

Today, I want to use that intuition to break down the three statistical approaches used in A/B testing — frequentist, Bayesian, and sequential and explain how they differ in practice and help you decide which one fits your team best.

First, a quick grounding in probability

Probability is the branch of mathematics that quantifies uncertainty by measuring how likely an event is to occur. It’s typically calculated by dividing the number of favourable outcomes by the total number of possible outcomes.

Think back to high school maths: “There are 7 balls in a bag — 3 white and 4 black. What’s the probability of picking a white ball?” The answer is 3/7. Simple enough.

The interesting part is what happens when the two main schools of statistics — frequentist and Bayesian apply this idea to real experiments.

Frequentist vs. Bayesian: what’s the actual difference?

The explanation that finally made this click for me personally: frequentist approaches assign probability to data, while Bayesian approaches assign probability to a hypothesis.

Let that sink in.

Say you’re running an experiment to test whether a new homepage banner improves click-through rate. Your hypothesis is: “This new banner will get more clicks than the existing one.”

Bayesian interpretation: “Variation B has a 90% chance of being better than the control.”
Frequentist interpretation: “If there were no real difference between the variations, the chance of seeing data this or even more extreme would be only 10%, so it is likely there is a real difference”

The Bayesian statement is intuitive as it directly answers “is this variation better?” The frequentist statement is more precise but harder to explain to a non-technical stakeholder.

How these differences play out across the experiment lifecycle

The two approaches diverge not just in how you interpret results, but also in how you plan and run experiments.

1. Experiment planning

With a classic frequentist (fixed-horizon) approach, you need to calculate the required sample size before you start. This means knowing your baseline metric. For example, what percentage of visitors typically click the existing banner and defining a minimum detectable effect (MDE): the smallest improvement you want to be able to detect.

Getting the MDE right is one of the trickiest parts of experiment design. Set it too small, and you’ll need a massive sample that takes months to collect. Set it too large, and you risk running an underpowered test that misses real effects smaller than you assume. With a Bayesian approach, you have two options: use informative priors or non-informative priors. Informative priors are a mathematical representation of your existing knowledge or beliefs about the metric, not just historical data, but a probability distribution built from it. Non-informative priors mean you’re starting fresh, with no pre-experiment assumptions, and letting the incoming data speak for itself.

Informative priors require you to define your prior beliefs mathematically before the experiment starts. Non-informative priors need no pre-experiment input at all. The data does all the work.

Most online experimentation platforms default to non-informative priors, and for good reason: defining informative priors correctly is genuinely hard. Get them wrong, and your results will be incorrect from the start.

2. Running the experiment

With classic frequentist testing, you commit to collecting all your data before analysing results. Peeking at the data mid-experiment — calculating p-values and confidence intervals before you’ve hit your target sample significantly inflates the rate of false positives. You need to wait.

Bayesian testing doesn’t have the same false-positive constraint, so you can check in on results throughout the experiment without compromising their validity.

Enter sequential testing: the best of both worlds

Sequential testing takes the robustness of the frequentist framework and pairs it with the flexibility of Bayesian monitoring. You get rigorous results and the ability to peek at the data as it comes in, without inflating false positives.

How? Optimizely sequential testing algorithm mSPRT derives valid p-values from a mixture likelihood ratio (computed by averaging the likelihood across a prior distribution of possible effect sizes) keeping false positive rates under control throughout the experiment. There’s no need to pre-specify an MDE, making it especially useful when you’re testing a new primary metric or operating in an unfamiliar part of your product.

So which approach should you use?

There’s no universally “better” stats method. It depends on how your team works and what your experiment goals are.

Classic frequentist (fixed horizon): It suits teams with a well-defined business cycle, a solid understanding of their baseline metrics, and the discipline to wait for full data before calling a winner.
Bayesian: It works well when your stakeholders think in terms of probability — “what’s the chance this variation is better?” and aren’t comfortable with p-values or confidence intervals.
Sequential: It is the right fit when you want frequentist rigour but need flexibility: you’re unsure of the right MDE, you’re working with a new metric, or your business requires the ability to act on results at any point during the experiment.

The right choice is about matching the method to the way your team makes decisions. Here’s a quick guide to choosing the right stats method for your team:

	Frequentist	Sequential	Bayesian
Results interpretation	Assigns probability to data — “If there were no real difference between the variations, the chance of seeing data this or even more extreme would be only 10%, so it is likely there is a real difference”	Assigns probability to data — “If there were no real difference between the variations, the chance of seeing data this or even more extreme would be only 10%, so it is likely there is a real difference”	Assigns probability to a hypothesis — “Variation B has a 90% chance of being better than the control.”
Experiment set-up	Requires a known baseline metric and an accurate MDE to calculate sample size before starting.	No pre-experiment calculations needed — start right away. (assuming non-informative priors for Bayesian)	No pre-experiment calculations needed — start right away. (assuming non-informative priors for Bayesian)
Experiment running	Fixed timeline, determined by the sample size calculation.	Flexible – stop at any time	Flexible – stop at any time

Common misconceptions about stats methods

1. “Bayesian experiments require a smaller sample size to reach significance.”

It’s more nuanced than that. Bayesian experiments with well-defined custom priors can indeed reduce the required sample size. But as discussed earlier, defining priors accurately is genuinely difficult and requires a solid statistics background.

When comparing a classic frequentist fixed-horizon approach against Bayesian with non-informative priors, the required sample sizes are actually very similar.

2. “Optimizely’s sequential stats engine is too conservative and needs a larger sample size.”

Also nuanced. Compared to classic frequentist fixed-horizon testing, if your MDE is very small, the fixed-horizon approach may require a slightly smaller sample size — though the difference is marginal. If your MDE is high, Optimizely’s sequential engine will typically reach significance with a smaller sample size.

On the topic of being “conservative”: both our sequential engine and the frequentist fixed-horizon approach control the false discovery rate by default, which protects the quality of your results. Controlling for false discoveries isn’t optional when using sequential or frequentist fixed horizon stats approaches — there’s no multi-comparison scenario where it isn’t needed. Importantly, it doesn’t reduce the chances of your primary metric reaching statistical significance in a simple A/B test.

“I pulled data from the results page using the sequential engine, ran it through an online calculator, and it shows significance — but the results page doesn’t.”

This is exactly the false positive scenario described earlier. If you were to calculate the sample size needed to reach significance for your given baseline metric and MDE as you would with a classic frequentist method, you’d likely find you haven’t reached it yet. It is then difficult to rule out the possibility that the significance result is a false positive, as it is well known that pre-mature peeking significantly increases false positives.

3. “Frequentist is more reliable than Bayesian.”

It isn’t. A well-designed experiment performs well regardless of the statistical method used. What matters is that you interpret the results correctly for the method you’ve chosen. Use the decision guide in this post to pick the right approach for your experiment, and the reliability will follow.

Source link