Web experimentation is a great tool to increase engagement and conversion rates. The primary strength of experiments is the possibility to isolate variables, and thus examine causality between different metrics such as tagline and conversion rate.
Much of the literature on experimental design has its roots in statistics and can be quite intimidating. To make it more accessible, I introduce the illustrated guide to web experiments (with some help from my brother, Andreas Høgenhaven, who kindly made the illustrations).
Before getting started on the experiment, you need to get the basics right: Test metrics that align with your long term business goals. Test big changes, not small. And remember that the test winner is not the optimal performance, but only the best performing variation we have tested. It doesn’t mean that you have found the all time optimal performing variation. You can (almost) always do better in another test.
A/B or MVT
One of the first things to consider is the experimental design. An A/B test design is usually preferred when one or two factors are tested, while a multivariate test (MVT) design is used when two or more independent factors are tested. However, it is worth noting that 2+ factors can be tested with A/B/n tests or with sequential A/B tests. The downside of using A/B test for several factors is that it does not capture interaction effects.
MVT Face-off: Full Factorial vs Fractional Factorial
So you want to go multivariate, huh? Wait a second. There are different kinds of multivariate tests. If you have ever visited Which MVT, you probably came across terms such as full factorial, fractional factorial, and modified taguchi. Before getting into these wicked words, let's get our multivariate test down to earth with an example. In this example we have 3 different factors, and each factor has two conditions.
In this case there are 3 factors each with 2 combinations, giving a total of 23 = 8 groups. In the full factorial design, all possible combinations are tested. This means 8 variations are created, and users are split between these. In the following table, +1 Indicates condition on while -1 indicates condition 2.
This design is not too bad when we have 3 factors with 2 conditions in each. But if we want to test 4 factors each comprising 4 conditions, we will have 44 = 256 groups. Or if we want to test 10 different factors with 2 conditions in each, we will end up with 210 = 1,024 groups. This will require a lot of subjects to detect any significant effect of the factors. This is not a problem if you are Google or Twitter, but it is if you are selling sausages in the wider Seattle area (You can calculate the test duration time with Google's Calculator and VisualWebsiteOptimizers Calculator. These calculators are, however, based on very imprecise data because the change in conversion rate is unknown. That is kinda the point of the test).
Enter fractional factorial design. The fractional factorial design was popularized by Genichi Taguchi and is sometimes called the Taguchi design. In a fractional factorial design, only a fraction of the total number of combinations are included in the experiment. Hence the name. Instead of testing all possible combinations, the fractional factorial design only tests enough combination to calculate the conversion rate of all possible combinations.
In this example, it is sufficient to run 4 different combinations, and use the interaction between included factors to calculate combination of factors not included in the experiment. The 4 groups included are ABC; A + (BC); B + (CA); C + (BA).
Instead of testing Factor A 3 times, it is only tested once while holding B and C constant. Similarly, Factor B is tested once while holding A and C constant, and Factor C tested once while holding A and B constant. I'll not deep too deeply into the statistics here, as the experimental software does the math for us anyway.
The fractional factorial test assumes that the factors are independent of one another. If there are interactions between factors (e.g. image and headline), it'd affect the validity of the test. One caveat of the fractional factorial design is that one factor (e.g. A) might be confounded with two-factor interactions (e.g. BC). This means that there is a risk that we end up not knowing if the variance is caused by A or by the interaction BC. Thus, if you have enough time and visitors, full factorial design is often preferable to fractional factorial design.
Testing The Test Environment With The A/A Test
Most inbound marketers are quite familiar with A/B tests. But what is less known is the A/A test. The A/A test is useful as a test of the experimental environment, and is worth running before starting A/B or MVT tests. The A/A test shows if the users are split correctly, and if there are any potential misleading biases in the test environment.
In the A/A design, users are split up like they are in an A/B or MVT test, but all groups see the same variation. We want the test results to be non-significant, and thus see no difference between the groups. If the test is significant something is wrong with the test environment, and subsequent tests are likely to be flawed. But as discussed below, an A/A test is likely to be significant sometimes, due to random error / noise.
The A/A test is also a good way to show co-workers, bosses, and clients how data fluctuate, and that they should not get too excited when seeing an increase in conversion rate with 80% confidence. Especially in the early phases of experiments.
Statistical Significance
In the ideal experiment, all variables are held constant except the independent variable (the thing we want to investigate, e.g. tagline, call to action, and images). But in the real world, many variables are not constant. For example, when conducting an A/B test, the users are split between two groups. As people are different, the two groups will never comprise similar individuals. This is not a problem as long as the other variables are randomized. It does, however, inflict noise in the data. This is why we use statistical tests.
We conclude that a result is statistically significant when there is only low probability that the difference between groups is caused by random error. In other words, the purpose of statistical tests is to examine the likelihood that the two samples of scores were drawn from populations with the same mean, meaning there is no "true" difference between the groups, and all variation is caused by noise.
In most experiments and experimental software, 95% confidence is used as the threshold of significance, although this number is somewhat arbitrary. If the difference between two group means is significant at 98% probability, we accept it as significant even though there is a 2% probability that the difference is caused by chance. Thus, statistical tests show us how confident we can be that difference in result are not caused by chance / random error. In Google Website Optimizer, this probability is called chance to beat original.
Pro Tip: Ramp Up Traffic To Experimental Conditions Gradually
One last tip I really like is ramping up the percentage of traffic sent to experimental condition(s) slowly. If you start out sending 50% of the visitors to the control condition, and 50% to the experimental condition, you might have a problem if something in the experimental condition is broken. A better approach is to start sending only 5% of the users to the experimental condition(s). If everything is fine, go to 10%, then 25% and finally 50%. This will help you discover critical errors before too many users do it.
Ps. If you want to know more / share your knowledge on experiments and CRO tools, you might want to have a look at this CRO Tools Facebook Group.
Yup A/B is good! no reason not to use it.. unless it is too much work.. sigh!
Somethng about this reminds me of Ferris Bueller's Day Off. Aren't you Abe Froman, Sausage King of Seattle? ;)
Nice post, Thomas. This is a great introduction to A/B and MVT testing in a way that's accessible.
Nice! I was just about to dig into my new advanced analytics book, thise will be an awesome supplement! What useful books have you recently read (or written)?
Hi Ben,
The best thing I have read recently is some of the research coming out of Microsoft. They have been working on creating a test-riven culture for years. The person in charge of this, Ron Kohavi, has written some excellent articles:
Those articles are fantastic - thanks for the links : )
Excellent post as well Thomas, some really useful stuff there for improving your own CRO efforts
Great Post and something we need to see more of at SEOmoz. Yes, someone preaching the A/A test! We've used these in the past and it's a great sanity check to make sure the testing tool is doing what it's supposed to as well as just getting used to setting up a test and minimizing your risk. I've had success and been pleased with Visual Website Optimizer's tool.
A commenter above mentioned the need/desire to view results based on traffic sources. This is a very valid point and one that all testers should consider. However, there is a counter-point: What do you do if one traffic source's conversion rate drops and another increases, but overall revenue increases? In an ideal situation you would then roll out a control page for each traffic source that reflects the performance of that page + traffic source combo. However, most basic testing solutions outside of Adobe Test and Target, SiteSpect, and other enterprise level tools do not offer this CMS-style testing management. So, I think it's definitely valid to look at performance gain/loss of each traffic source when running a test, but simply moving the needle in the positive direction when performance is viewed as a whole is usually the best option for those starting out with split testing with the majority of the tools that are out there.
Yes! Sanity check is just the right word for what A/A tests are about. For there are always people refusing all tests as being random, and other people accepting test results as valid 5 minutes after it was launched. A/A tests help both people become more sane.
Thomas -
I love your A/A test thought here. I had never thought of doing that before, but it makes so much sense for the reasons you outline, as well as the potential of showing clients/bosses the power of data.
Great post.
Great post on MVT, especially bringing up the concept of A/A testing. There are a few aspects of MVT which can be intimidating for new users, including the statistical aspect you outlined here. Another item that can be intimidating is knowing which aspects of a page to test. Personally, I have tested everything from colors and layouts to headlines and CTA's with great success.
A/A testing was the big takeaway for me in this post. I've been split testing for years and never thought to test the reliability of the split this way. Looking forward to giving it a run before my next round of A/Bs.
Agreed. We have a custom MVT solution, which I've been using for the past 6 months. Just kicked off an A/A test today - hoping that the results come back insignificant!
Nice one!
Something extra I noticed through my last conversion testing experiment - time is a major variable, or more precisely, the times. Seasonal changes in user behaviour such as search volume spikes, or dips, can skew results by throwing a lot of unexpected users at your test.
It depends obviously on what you're selling, but I've found that if you're running an experiment over an extended period of time to try and avoid holidays such as Christmas.
Google Insights can be useful for helping to minimise the effects of time as a variable as it can give you an idea of when search volume tends to be most consistent.
For model used for CRO i think influence of time already included. We have 2 prerequisite: conversion process don't change in time (is stationar) and has normal distribution (for infinite data sets). But such process doesn't exist in real world. Thats why shifts for distribution already added by testing tool developers and time effect is non-significant. Anyway google insights is a great tool, but as for me - for other goals.
Nice Post.
You could add clickdensity tools like crazyegg, clicktale etc. onto your experiment pages to get more insight. For example if your split testing a form page then you can see the form interactions between your experiment pages i.e. drop off and abandonment on specific parts of the form. This then allows you to monitor page element interactions from A/B split tests.
Fabian
On the reliability of the test I like to on a A/B run the data through analytics segmented by medium. Often I found just looking at what percentage of mediums was sent to each page that actually the test had significant bias that actually if the test were split equally via medium that sometimes it would change the result.
Thats a great approach. Software such as Google Website Optimizer and Visual Website Optimizer do not control for such variants, although it definitely ought to, to avoid false positives and false negatives. Hopefully they will get those features right some day.
Yes @Overtake this is so true. The problem with A/B split testing is that you could have page A with more sway towards PPC and targeted keywords vs page B SEO which would create a biased test. Overall great post.
A good post, except for the part where you recommend ramping up traffic gradually.
Every experience should be thoroughly QA'd before launch to identify errors. For enterprise level organisations any degree of imperfection could spell disaster in terms of lost revenue - something which is fine if it's part of the experiment, but definitely not fine if it's because of bugs!
Test first, ensure perfection, then launch in full.
This is a great place to start if you're not familiar with testing. This illustrated guide tackles the simplest to the most advanced testing methods without leaving beginners in the dust. Well done.
I didn't fully understand the points that you made, but the pictures definitely helped (laughs).I have a site that I'd like to get some feedback on. I'm new to SEOMoz, so I don't know if I can Private Message you? I'll just post my site here, I guess. I'd love to get some feedback from you. https://www.xbox360gamesarefree.webs.com/
Testing and CRO are playing more and more into search engine marketing these days that I've seen in the past. You can learn more about conversions and testing through the weekly #CROchat on Twitter taking place on Thursdays at Noon EST, 9 am Pacific, 5 pm in the UK.
Great post. It's especially usefull as newby guide. And thanks for link to CRO group.
To be honest, the post went right above my head at a super sonic speed... though some part of it went to my brain, but its so highly encrypted, I need to meditate for a while to be able to decrypt it...
Awesome post. Great guide for noobs like me.
Great post . Love the imagery. We do tonnes of Multivariate testing for our clients. It's by far the best method to increase click through and conversion. Thanks for making it so easy to understand, I think the UK is a little slow on uptake on this great tech.
Thanks
George
Great post - we are looking at doing a number of CRO tests in the near future to tie in more with specialised adwords campaigns.
The info presented here is very easy to understand (and present to other members within our company).