A/B testing, also called split testing, is extremely useful to determine if a particular action is better than another. Today, A/B tests are run on many websites, by many services. However, it’s relatively simple to A/B test anything you want. In this post, I’ll discuss A/B testing, the statistics behind it, how to run your own A/B test, and how to do it with SendGrid’s General Statistics Endpoint.

**If you just want to compare A/B test results of SendGrid transactional emails, you can use this handy app I made.**

## A/B Testing Explained

A/B Testing seems simple at its surface. Simply display two different versions of whatever you’re testing (e.g. webpages, ads, emails) and compare their statistics. Essentially, that’s all it is, but to run a proper A/B test, you need to dive into the world of statistics. This gets you deep into terms like *binomial distributions*, *normal approximation confidence intervals*, and *high school statistics notes*.

First, you need to determine whatever you believe “success” is–this could be a purchase, form submission, email open, or something else. You must track this, and give some portion of your *population* (i.e. your users) your “A” variation and another proportion the “B” variation. Once you’ve done this and gotten some data, you may dive into the statistics!

### The Mathematics

Initially the mathematics are pretty simple, you divide the number of successes you had with the particular variation by the number of times you showed the variation, this number is called *p-hat (p̂).* Put simply, it’s the percent of users that satisfied your success criteria (as a decimal).

Having that number is just *p*eachy, however, you can’t tell which test won. *This seems weird, because you have two numbers, and it seems like you should just be able to pick the largest.* However, these numbers might not tell the whole story, and you want to be absolutely sure that they’re right, before you act on them. The percentages you’ve calculated tell you something about your *sample* (i.e. the people you showed the variations to), however, they do not tell you much about your population. Put into an example, just taking the percents at face value is like looking at a single tree and making a judgement for the whole orchard.

Really, we need to see if the numbers are different enough to apply to your *entire* population. To do so, we’ll create what’s called a *confidence interval*. This takes into account the *variance* of your results and how confident you want to be in your decision.

To get started creating a confidence interval, we’ll need to find the *standard deviation* of your results, this number is called *sigma (σ)*, it’s a calculation of how different each result is from the average.

Next up, we need to determine what the level of significance you want your results to hold. This is done by choosing a percentile, called *alpha (α)*, this percentile should, *generally*, be greater than or equal to 95%.

From the alpha, we can obtain the *z-score* of our data. This number represents the number of standard deviations above the mean, within which any given percent of data lies. *Sound confusing?* Don’t worry too much about it, you can look z-scores up, or remember that a 95% percent confidence interval has a z-score of 1.96 and a 99% confidence interval, 2.58.

Finally, we get to bring this all together into the aforementioned confidence interval:

This confidence interval gives us a *range* for each variation (e.g. 50%-75%). If these ranges *do not* intersect, then we can say, *confidently* that this test is significant, and can choose the higher data point.

### Example

This may be easier to see, when put into an example. Say I have two tests, where I show someone the color yellow or blue, and then ask for a high-five. We’ll consider receiving a high-five success, and being denied failure. I decide that I’ll perform the test on 1,000 people and have a 95% confidence interval. After performing the test, I get the following results:

Color | Sample Size | Success | Failure |
---|---|---|---|

Yellow |
498 | 390 | 108 |

Blue |
502 | 299 | 203 |

We can then calculate that:

Thus:

Finally, remembering that z = 1.96, we bring everything all together getting the following confidence intervals:

We then see, that these confidence intervals do not intersect, and therefore, showing people yellow gives us a statistically significant higher chance of getting a high-five.

### Caveats & Warnings

The method I gave for determining a confidence interval, is the simplest of such methods to do so. However, there are more complicated versions, that are better. For a simple calculation, this method should be fine. However, if you want **rock solid statistics**, you should opt for a different one.

Additionally, you should pick a number of users that you want to see the test *or* a time period in which you plan to do the test, and *only* after this condition is successful, should you act on the data. This is due to some *important* funny business with statistics.

## Applying A/B Testing Methodologies to SendGrid

Knowing all this, we can test how different emails affect the statistics SendGrid tracks. If you set categories for each variation, this tracking becomes easy. You can do this by hand by looking at your Email Statistics Dashboard, and filtering by category. Then Delivered becomes your total sample size and any other metric can be your success variable, and its complement (Delivered – Success), your failure.

If doing it by hand doesn’t sound like your favorite thing to do, you can just use the app I made. If you’re interested, you can find the code for it on Github.

Pingback: Elementary Arithmetic of Modern Development