Invespsoft

/tools/sample-size-calculator

A/B Test Sample Size Calculator

Drag the sliders. Watch the bar fill. The number you need is on the right.

Controls · v1.0

idle
4.0%%
0.1%50.0%

Today's rate for the metric you're testing.

Minimum detectable effect
+10%%
+1%+100%

Treatment target: 4.40%.

80%
5099

Probability of catching a real effect when one exists. 80% is convention.

95%
8099.9

Two-sided. 95% is convention. Drops the false-positive rate per test.

2,000
5050,000

What each variant gets per day after splitting.

Output

Sample per variant

39,473
1001k10k100k1M10M

log scale, samples per variant

Total (both)

78,946

Days to call it

20d

Detect a lift from 4.00% to 4.40% at 95% significance and 80% power.

two-proportion z-test · two-sided

Assumptions

Three notes from someone who's run this calculation in anger hundreds of times.

80% power and 95% significance are the numbers you see in every textbook because somebody picked them in 1925 and we never argued back. They're fine starting points. They are not the right answer for every test you run.

If a wrong call costs you a quarter of revenue, push significance to 99% and accept the longer test. If you're testing copy on a button and the worst-case downside is shrugging and reverting, 90% is plenty. The defaults exist so you don't have to think. Most senior experimenters think anyway.

The calculator above doesn't know which of those situations you're in. You do.

Questions people ask

How does this calculator decide how big my A/B test should be?

It uses the standard two-proportion z-test formula. Given a baseline conversion rate, the smallest effect you want to detect, the statistical power you want, and the significance level you're willing to accept, it solves for the sample size per variant that satisfies all four. The result is what you need before you peek, not after.

What's the difference between relative and absolute minimum detectable effect (MDE)?

Absolute is in conversion-rate points. If your baseline is 4% and your absolute MDE is 1%, you're trying to detect a lift to 5%. Relative is a percentage of the baseline. A 25% relative MDE on a 4% baseline is also a lift to 5%. Relative MDE is usually the more honest framing — it tells you how much extra revenue you can pay for.

Why is 80% statistical power the default?

Eighty percent is the convention because it's a workable balance between false negatives and how long you have to wait. At 80% power you'll miss roughly one in five real wins. Bump it to 90% if missing a real winner is more expensive than running the test a bit longer.

Why is 95% significance the default?

Ninety-five percent is the convention because it caps the false-positive rate at 5% per test. If you run a lot of tests in parallel or peek at results, that effective rate climbs fast. Pushing significance to 99% buys you a quieter dashboard at the cost of larger samples.

Can I trust the days-to-significance estimate?

Treat it as a floor, not a forecast. The math assumes evenly distributed traffic, no day-of-week effects, no holidays, and no segment skew. In practice, plan for at least one full business cycle (typically two weeks) even if the calculator says you can call the test sooner.

What sample size should I use if my baseline conversion rate is very low?

Low baselines need disproportionately large samples because the variance of a Bernoulli outcome doesn't shrink the way the rate does. If your baseline is 1% and you want to detect a 10% relative lift, expect tens of thousands of users per variant. If you don't have that volume, test a higher-funnel metric where the baseline is bigger.