Skip to content →

Testing A/B test results for statistical significance (or, “is there really a difference?”)

Those of us working in digital marketing or CRO know that, to optimise the performance of our site, we compare the performance of different versions of the same page, call to action, or advert. Sometimes this might be as part of an A/B or multivariate test, other times it might be comparing the conversion rates of 2 releases.

If you’re comparing how well 2 or more versions of a page, button, content, or anything else performs, you really need to know about statistical significance. Why? Because if you don’t, you’re running the risk of thinking the lowest-performing variation of what you’re testing is the winner, which will cost you or your organisation money.

What is statistical significance?

For A/B testing, the purpose of a statistical significance test is to tell you whether the difference between 2 variations is actually real, or just down to chance.

To explain it more fully, let’s work through an example:

Imagine we’re testing 2 versions of the same checkout page; our aim is to increase the conversion rate (or to reduce the abandonment rate if you prefer to look at it that way). We run both versions at the same time in an A/B test.

If we let the test run for a couple of weeks, the 2 pages will have different conversion rates. This might be down to the changes we made, driven by anything from extensive user research to someone’s intuition. If that’s the case it’s great, as even if the conversion rate’s lower, at least you’ve eliminated one option for improving the page. However, it might also be due to chance – just the natural variation that happens from day to day and week to week.

This is what statistical significance tests are designed to help you work out – how confident you are that the difference isn’t due to chance (for example, one version having a good or bad week or month).

One way to illustrate this is a demonstration using 2 dice.

Imagine each of the dice is a different version of the page. Over a couple of weeks we get 200 visits that see the checkout page, each represented by a roll of one of the dice. We’ll imagine that for each die, rolling a 1-4 is an abandoned checkout, while rolling a 5 or 6 represents a completed checkout (meaning we’ve got an 20% chance of converting each time, and that the actual conversion rate for both versions is exactly the same). So, grab a couple of dice, and roll them each 100 times, recording the result. If you don’t have the dice, time, or inclination to record 200 rolls, here are the results I got (OK, I cheated and used a random number generator. I’ve not included the whole table, as it’s not very exciting!):

statistical significance

As you can see, the “conversion rate” of the 2 dice during the test is different. However, this doesn’t mean one of the dice has a worse conversion rate, what we’re seeing is the natural variation that happens in these kinds of situations. This is because, although both dice have a 20% conversion rate when rolled an infinite number of times, over a small number of rolls, the conversion rate is likely to be different from this.

If you took these results at face value, and assumed that version A is better than version B, you’d be wrong. We know that with certainty in this case, as we know the exact conversion rate of both dice (over all of time) is identical, at 20%. So if you don’t test the results for statistical significance there’s a risk that you’ll believe version B is in some way inferior to version A when it isn’t.

This might not seem that important when the 2 versions have an identical conversion rate, but it means you incorrectly think the change you made had an impact when it didn’t. You might then spend money implementing a change that won’t lead to an improvement.  A more serious situation is when version A looks better than version B but is actually worse, you implement a version that isn’t the highest performer, and lose you or your organisation money.

OK, so testing for statistical significance is a good thing, how do you go about doing it?

There are 2 possible answers –

  1. Learn the underlying principles of probability theory and statistical significance, which involves a lot of maths (if you do want to learn more about probability and statistics, I’d recommend Robert Pagano’s excellent book “understanding statistics in the behavioural sciences”)
  2. Use an online statistical significance calculator

I guess most people will want to use an online calculator. If you do, be careful; for an A/B test, where the outcome is binary (i.e. people either do complete an action/convert or don’t), the chi-square test is the appropriate test of statistical significance. A lot of online statistical significance calculators use a Z test, which isn’t designed for this kind of data (it’s designed to compare mean averages). A lot also use incorrect calculations of the Chi-square test, and give you the wrong answer.

I’ve generally found online chi-square calculators are either accurate, or easy to use, but not both. The exception to this is Evan Miller’s one, which I’d recommend using.  Using this we find that the difference between version A and version B in our dice rolling exercise is not statistically significant. This translates as ‘we can’t be confident they’re really different’.

For A/B tests, most of the time I’d recommend using the 95% confidence interval. This means we can be 95% confident there’s a genuine difference between versions if the test says there is. Put another way, if the test says the two versions produce genuinely different results, there’s a 5% chance it’s wrong, and they don’t.

There are 2 things that the Chi-square test looks at: the difference between the proportion of conversions, and the number of observations (or visits that see the checkout page in our example). It’s the combination of these that’s important; you need a big enough difference, and a big enough number of visits for a change to be statistically significant. So if there’s a very big difference between how well the 2 pages perform, you’ll need fewer observations before you know the difference is statistically significant. On the other hand, if one version is marginally better than the other, you’ll need a lot of observations before your results are statistically significant. This means that if you leave a test running long enough you might find the difference is statistically significant, but it’s not ‘real world’ significant. How valuable is a 0.25% increase in conversion rate? is it worth running a test for 9 months to find out, or should you be looking for a change that has a bigger impact in that time?

If we test 2 versions of a page, and after a couple of weeks or months we find there’s a statistically significant difference (one is better than the other), that’s great. But what if, after 4 months, there’s still no statistically significant winner, when do we end the test?

This is where we skip back to the start. What you should do, before you start a test, is work out the minimum difference you’re interested in. For example, do you need the new version to be 10% better than the old one to be worth bothering with (if it’s not at least 10% better, do you try something else?). This should be driven by some kind of rational look at the business, rather than gut feeling, but however you do it, you need to make this decision beforehand.

Once you’ve decided what the minimum difference you’re interested in is, you can use a sample size calculator to work out how many people need to try both versions for you to be able to work out if there’s a statistically significant difference of at least that between the two. This gives you an end point to the test if it turns out the new version doesn’t make the grade, and it’s much better to have decided this in advance than hang on, hoping to see an improvement, or decide at some random point that you can’t wait any longer and just arbitrarily end the test. One point worth noting is that you should always let the test run until the required sample size has been reached. This is true, even if a statistical significance test says the versions are different halfway through the test. The reason for this is that randomness in the results can mean results are statistically significant part way through the test, but aren’t by the end of the test. So try not to peek, and if you can’t stop yourself, don’t act on the findings!

Handily, there are sample size calculators available that help you know how many observations you need to be confident a difference of a given size is statistically significant (again, Evan Miller’s produced a good one). From these results, you can set your own and others’ expectations about how long the test needs to run, and when you’ll decide to stop it if there’s no significant difference between the versions.

That’s probably about all the stats most people will want to read in one post. If you’ve got this far, well done!

If you do need help with setting up or analysing the results of A/B tests, or other web analytics services, drop us a line and we’ll see if we can help you out.

(Visited 48 times, 1 visits today)

Published in A/B testing


Leave a Reply

Your email address will not be published. Required fields are marked *