A/B Testing Calculator | Sample Size, Duration, and Significance

Know how many visitors you need before you start, then check statistical significance when you're done. Plan your test duration, avoid stopping early, and know when you have a real winner.

Planning tool, not a guarantee. This calculator helps you determine sample sizes and interpret results based on standard statistical methods. Actual outcomes depend on test implementation, traffic consistency, and factors this tool cannot measure.
Share
Before your test: Calculate sample size

Use the "Sample Size" tab to figure out how many visitors you need and how long to run. Enter your current conversion rate, the improvement you want to detect, and your daily traffic.

Run the full duration

Once you start, commit to running for the calculated number of days. Don't peek at results early or stop when it "looks like a winner."

After your test: Check significance

Use the "Significance" tab to interpret what happened. Enter your final numbers to see if you have a real winner or just noise.

Make your decision

If you have a winner, implement the change. If inconclusive, either run longer or accept that the change probably doesn't make a meaningful difference.

Calculate Sample Size for Your A/B Test

Figure out how many visitors you need and how long to run your test before starting. This prevents you from stopping too early or waiting too long. Important: Once you start, commit to the full duration.

Test Parameters

This is your starting point: the percentage of visitors who currently take the action you're measuring (buy, sign up, click, etc.).

Example
If 100 people visit your page and 3 buy something, your conversion rate is 3%.

Check your analytics for this number. If you're not sure, estimate based on recent data.

This is the smallest improvement worth detecting. You can enter it two ways:

Relative (default)
A percentage of your current rate. 10% relative on 3% baseline = looking for 3% → 3.3%
Absolute
Percentage points added. 1% absolute on 3% baseline = looking for 3% → 4%

The tradeoff: Smaller MDE = longer test. Bigger MDE = faster test but you might miss smaller wins.

How many people visit the specific page you're testing per day, on average.

This isn't your whole site's traffic. Just the page where the test will run. Check your analytics for the page URL.

More traffic = faster tests. If you have low traffic, you'll need to either wait longer or look for bigger improvements (higher MDE).

How sure you want to be that the result isn't just random luck.

95% means: "If there was no difference, there's only a 5% chance I'd see results this extreme by accident."

Just use 95% unless you have a specific reason not to. It's the standard for business decisions.

How many versions are you testing, including the original?

Standard A/B test = 2 (original + one variant).

More variations need more traffic. If you're new to testing, stick with 2.

Your Test Requirements

Sample Size Per Variation
-
visitors needed
Total Visitors
-
needed
Test Duration
-
days minimum
Detection Zone
3%
Can't detect
3.3%+
Your baseline
Too small to detect reliably
Detectable improvement
What this assumes
  • Statistical power of 80% (if a real improvement exists, this test has an 80% chance of catching it)
  • Your daily traffic stays roughly consistent during the test
  • Traffic is split evenly between versions (50/50 for A/B)

Check Statistical Significance of Your Results

Your test finished. Enter the final numbers to check if your results are statistically significant or just noise. Get these from your testing tool (Optimizely, VWO, Convert, etc.) or Google Analytics.
Before analyzing: Did you run the test for the full planned duration? Results checked early often look significant due to random noise, then even out over time. Stopping when it "looks like a winner" means you might just be catching a lucky streak.

A Control (Original)

B Variant (Test Version)

The threshold for declaring a winner. Results are significant if the p-value is below (100 - confidence)%.

At 95% confidence, you need a p-value below 0.05 to call a winner.

Results

-
-
-
Control Rate
-
Variant Rate
-
Relative Lift
-
P-Value

The probability of seeing results this extreme if there was no real difference between versions.

Lower is better. Below 0.05 = statistically significant at 95% confidence.

-
Version Visitors Conversions Rate
A - Control - - -
B - Variant - - -

Need help with your testing strategy?

Knowing what to test is half the battle. If you need help identifying high-impact opportunities or building a testing roadmap, let's talk.

Get in Touch

Frequently Asked Questions

Common questions about A/B testing sample size, statistical significance, and test duration.

An A/B test compares two versions of a page, email, or ad to see which performs better. You split your traffic between version A (the original) and version B (the variant), then measure which one converts more visitors into customers, subscribers, or whatever action you're optimizing for.

Sample size depends on your baseline conversion rate, the minimum effect you want to detect, and your desired confidence level. The formula uses z-scores for your significance level and statistical power (typically 80%).

Most A/B tests need at least 1,000 visitors per variation to detect a 10-20% relative improvement. Use the calculator above to get your specific number.

Run your test long enough to reach statistical significance, typically at least 7 days to capture weekly traffic patterns. The exact duration depends on your traffic volume and the size of improvement you're trying to detect.

Use the Sample Size tab above to calculate your specific duration before you start.

Statistical significance tells you how confident you can be that your results aren't due to random chance. A 95% confidence level means there's only a 5% probability that the difference you're seeing happened by luck.

Most businesses use 95% as their threshold for making decisions.

MDE is the smallest improvement you want your test to be able to detect. It can be expressed as:

Relative: A percentage of your current rate (e.g., 10% relative on 3% baseline = looking for 3% → 3.3%)

Absolute: Percentage points added (e.g., 1pp absolute on 3% baseline = looking for 3% → 4%)

Smaller MDE requires more traffic and longer tests.

Early results are unreliable because random variation can make one version look like a winner when there's no real difference. If you check results daily and stop when it looks good, you're essentially cherry-picking a lucky moment.

This inflates your false positive rate from 5% to potentially 20-30%.

A p-value represents the probability of seeing results as extreme as yours if there was actually no difference between versions.

A p-value of 0.03 means there's a 3% chance the results are due to random variation. Lower p-values indicate stronger evidence that the difference is real. Below 0.05 is typically considered significant.