At our recent SearchLove conferences, I’ve been talking about things we need to do differently as marketers amidst the big trends that are reshaping search. My colleague Tom Anthony, who heads up the R&D team at Distilled, spoke about 5 emerging trends:
- Implicit signals
- Compound queries
- Keywords vs intents
- Web search to data search
- Personal assistants
All of these trends are powered by Google’s increasing reliance on machine learning and artificial intelligence, and mean that ranking factors are harder to understand, less predictable, and less uniform across keywords. It's becoming such a complex system, we often can't really know how a change will affect our own site until we roll it out. The lack of transparency and lack of confidence in results has two major impacts on marketers:
- It damages our ability to make business cases to justify targeted projects or initiatives (or even just to influence the order in which a technical backlog is addressed)
- It raises the ugly possibility of seemingly good ideas having unforeseen negative impacts
You might have seen the recent news about RankBrain, Google’s name for the application of some of this machine learning technology. Before that announcement, I presented this deck which highlighted four strategies designed to succeed in this fast-changing world:
- Desktop is the poor relation to mobile
- Understand app search
- Optimize for what would happen if you ranked
- Test to figure out what Google wants from your site
It’s this last point that I want to address in detail today — by looking at the benefits of testing, the structure of a test, and some of the methodology for assessing winning tests.
The benefits of A/B testing for SEO
Earlier in the year, the Pinterest engineering team wrote a fascinating article about their work with SEO experiments which was one of the first public discussions of this technique that has been in use on a number of large sites for some time now.
In it, they highlighted two key benefits:
1. Justifying further investment in promising areas
One of their experiments concerned the richness of content on a pin page:
For many Pins, we picked a better description from other Pins that contained the same image and showed it in addition to the existing description. The experiment results were much better than we expected ... which motivated us to invest more in text descriptions using sophisticated technologies, such as visual analysis.
– Pinterest engineering blog
Other experiments failed to show a return, and so they were able to focus much more aggressively than they would otherwise have been able to. In the case of the focus on the description, this activity ultimately resulted in almost a 30% uplift to these pages.
2. Avoiding disastrous decisions
For non-SEO-related UX reasons, the Pinterest team really wanted to be able to render content client-side in JavaScript. Luckily, they didn’t blindly roll out a change and assume that their content would continue to be indexed just fine. Instead, they made the change only to a limited number of pages and tracked the effect. When they saw a significant and sustained drop, they turned off the experiment and cancelled plans to roll out such changes across the site.
In this case, although there was some ongoing damage done to the performance of the pages in the test group, it paled in comparison to the damage that would have been done had the change been rolled out to the whole site at once.
How does A/B testing for SEO work?
Unlike regular A/B testing that many of you will be familiar with from conversion rate optimization (CRO), we can’t create two versions of a page and separate visitors into two groups each receiving one version. There is only one googlebot, and it doesn’t like seeing near-duplicates (especially at scale). It’s a bad idea to create two versions of a page and simply see which one ranks better; even ignoring the problem of duplicate content, the test would be muddied by the age of the page, its current performance, and its appearance in internal linking structures.
Instead of creating groups of users, the kind of testing we are proposing here works by creating groups of pages. This is safe — because there is just one version of each page, and that version is shown to regular users and googlebot alike — and effective because it isolates the change being made.
In general, the process should look like:
- Identify the set of pages you want to improve
- Choose the test to run across those pages
- Randomly group the pages into the control and variant groups
- Measure the resulting changes and declare a test a success if the variant group outperforms its forecast while the control group does not
All A/B testing needs a certain amount of fancy statistics to understand whether the change has had an effect, and its likely magnitude. In the case of SEO A/B testing, there is an additional level of complexity from the fact that our two groups of pages are not even statistically identical. Rather than simply being able to compare the performance of the two buckets of pages directly, we instead need to forecast the performance of both sets, and determine that an experiment is a success when the control group matches its forecast, and the variant group beats its forecast by a statistically-significant amount.
Not only does this cope with the differences between the groups of pages, but it also protects against site-wide effects like:
- A Google algorithm update
- Seasonality or spikes
- Unrelated changes to the site
(Since none of these things would be expected to affect only the variant group).
The statistics and underlying mathematics behind all of this is quite hairy in places, but if you are interested in learning more, you can check out:
- The section of my SearchLove presentation that covered this briefly
- Predicting the present with Bayesian structural time series [PDF]
- Inferring causal impact using Bayesian structural time series [PDF]
- CausalImpact R package
- Finding the ROI of title tag changes
- My colleague Ben Estes has also written about R and analytics forecasting
Good metrics for measuring the success of tests
We generally advise that organic search traffic is the best success metric for these kinds of tests — often coupled with improvements in rankings, as these can sometimes be detected more quickly.
It is tempting to think that rankings alone would be the best metric of success for a test like this, since the whole point is in figuring out what Google prefers. At the very least, we believe these must be combined with traffic data because:
- It’s hard to identify the long tail of keywords to track in a (not provided) world
- Some changes could improve clickthrough rate without improving ranking position — and we certainly want to guard against the opposite
You could set up a test to measure the improvement in total conversions between the groups of pages, but this is likely to converge too slowly in practice on many sites. We generally take the pragmatic view that as long as the page remains focused on the same topic, then growing search traffic is a valid goal. In particular, it’s important to note that unlike a CRO test (where traffic is assumed to be unaffected by the test), conversion rate is a very bad metric for SEO tests, as it's likely that the visitors you're already getting are the most qualified ones, and doubling the traffic will increase (but not double) the total number of conversions (i.e. there will be a lower conversion rate even though it’s a sensible action).
How long should tests run for?
One advantage of SEO testing is that Google is both more "rational" and consistent than the collection of human visitors that decide the outcome of a CRO test. This means that (barring algorithm updates that happen to target the thing you are testing) you should quickly be able to ascertain whether anything dramatic is happening as a result of a test.
In deciding how long to run tests for, you first need to decide on an approach. If you simply want to verify that tests have a positive impact, then due to the rational and consistent nature of Google, you can take a fairly pragmatic approach to assessing whether there's an uplift — by looking for any increase in rankings for the variant pages over the control group at any point after deployment — and roll that change out quickly.
If, however, you are more cautious or want to measure the scale of impact so you can prioritize future types of tests, then you need to worry more about statistical significance. How quickly you will see the effect of a change is a factor of the number of pages in the test, the amount of traffic to those pages, and the scale of impact of the change you have made. All tests are going to be different.
Small sites will find it difficult to get statistical significance for tests with smaller uplifts — but even there, uplifts of 5–10% (to that set of pages, remember) are likely to be detectable in a matter of weeks. For larger sites with more pages and more traffic, smaller uplifts should be detectable.
Is this a legitimate approach?
As I outlined above, the experimental setup is designed specifically to avoid any issues with cloaking, as every visitor to the site gets the exact same experience on every page — whether that page is part of the test group or not. This includes googlebot.
Since the intention is that improvements we discover via this testing form the basis for new and improved regular site pages, there is also no risk of creating doorway/gateway pages. These should be better versions of legitimate pages that already exist on your site.
It is obviously possible to design terrible experiments and do things like stuffing keywords into the variant pages or hiding content. This is as inadvisable for A/B tests as it is for your site in general. Don’t do it!
In general, though, whereas a few years ago I might have been worried that the winning tests would bias towards some form of manipulation, I think that's less and less likely to be true (for context, see Wil Reynolds’ excellent post from early 2012 entitled how Google makes liars out of the good guys in SEO). In particular, I believe that sensibly-designed tests will now effectively use Google as an oracle to discover which variants of a page most closely match and satisfy user intent, and which pages signal that to new visitors most effectively. These are the pages that Google is seeking to rank, and whether we are pleasing algorithms designed to please people or pleasing people directly isn’t too important — we’ll converge on the right result.
What are the downsides?
So, this all sounds great. Why isn’t everyone doing it?
Well, the truth is that it’s quite hard. Not only do most content management systems (CMS) fail to offer the ability to make changes to arbitrary groups of pages, but it’s also hard to gather and analyze the data to come to the right conclusions. There are also theoretical limitations, even on big sites — particularly around understanding and analyzing the effects of changes like internal linking structure, which cascade through the site in unpredictable ways.
We do, however, know of a handful of large sites, with tons of traffic, and huge development resources who have gone down this path and are reaping substantial rewards from it.
Although there will always be sizes of website and levels of traffic below which its uneconomical or impractical to perform sensible tests, we want to make the ability to run these tests available to a much wider audience than it is currently. To achieve this, we've been working on our own platform designed both to make it easy to run tests, and also to gather and analyze the output (it also happens to make it easy to roll out quick changes that are hard to get bumped up your backlog, for whatever reason).
Distilled’s Optimization Delivery Network (ODN)
You can read more about the tool in my launch announcement over on the Distilled blog.
As I said over there:
We are calling this type of platform an Optimization Delivery Network or ODN. It works like this:
If you’re interested in hearing more, seeing a demo, or even signing up to the beta, please go ahead and fill out our form.
- It sits in your web stack like a Content Delivery Network (CDN) (or behind your CDN if you are using one).
- It allows you to make arbitrary changes to the HTML (and HTTP headers) of any page or group of pages on your website — operating a little like a CMS over the output of your CMS and avoiding the need for a lengthy wait for your development backlog.
- In addition, it makes it possible to make certain kinds of changes to subsets of pages in order to test to see what will work best.
Decent post, an ODN sounds like a fantastic idea but makes you wonder why no one has already thought of it and created one until now. I think the majority of businesses/marketers have seen the need for something like this for many a year now.
Hey David! I think the answer as to why nobody made one until now is because it is hard! There are lots of details to worry about so it has taken us 7 months of work just to get to where we are now, but I think we are now at a place where it is viable and capable of providing huge amount of value. :)
You can try out counterfactual forecasting using the R-package that Will mentions here - https://www.distilled.net/resources/statistical-forecasting-for-seo-analytics-and-a-free-tool/
Now that's how sponsored content should be done!
I really like most of the content published here. However, last couple of months I get the feeling that Moz Blog is becoming a place where Moz recommended companies/partners can plug new products (this article) and/or republish content from their site (like this WBF from Dejan and this article as well) or just put some generic truths like this one.
To stay on topic - it's nice to outline the general concept of this tool but this article would have been much more interesting if it would have included some actual results. I understand that it's still in early stage but I would imagine that before publicly announcing it, it has been tested and has generated some positive results. Talking about these results would have made this article really worthwhile. The Pinterest article mentioned in the post is for me the main take-away and is also what I would have expected this post to look like.
Hey Dirk,
Sorry you didn't get as much out of this as you hoped. I chose to talk about this at recent conferences and to write it up here because (a) I've been getting excited about the potential and (b) I've been doing a lot of thinking about this area - along with the rest of the team here.
My biggest concern was that it could become something that didn't add enough value to the audience - which is why I spent 90%+ of the effort (and of the post!) writing about the general principles and giving away essentially all the things we've learned so far. If you want to go deeper into the details, I'd encourage you to check out the links in the middle to the deeper mathematics / statistics - there is a ton of meat in there and I guarantee that'll have you learning new things for days (if not weeks) to come.
Will,
Really appreciate that you take the time to answer. I have no doubt that A/B testing is a promising technique and that you're really excited about it as a tool. As you already indicate in your article, the complexity of doing this is daunting and puts this topic out of reach for most of us. As a result - for me it looks a bit too much like a pre-sales text (hey, we build this great & very complex tool which can do incredible stuff - come over to Distilled and try it). The Pinterest article is also out of reach (and as such has equally little practical value) but it's still an interesting read (much like the articles written on the Tech blog here on Moz which are just fun to read even if I understand only half of it).
Gotcha - in which case - at the other end of the complexity spectrum, I think it'd be worth checking out section 3 of my SearchLove presentation (linked above) which gives some practical ideas and hints for how to think about ranking factors (and ranking more generally) in a world where Google is applying machine learning technology to user happiness metrics.
Hope that helps!
In order to truly understand Google's ranking factors, split testing is a good idea because it is the only way to truly know what works and what doesn't. However, there is a danger on completely relying on the results of the test. No matter what the results are, it is important to consistently follow an effective SEO program, which would include on-site SEO, creating excellent, informative content, and continually using social media. However, the results of the split tests can help give valuable insights into the evolving nature of SEO.
I smell a WordPress plugin coming soon.
Will, maybe this is covering in one of the other presentations or links (haven't had a chance to read through yet), but have you found in testing that the control and variant pages need to have very similar page metrics prior to the test, especially traffic. It seems like there may be a lot more factors that would influence the testing other than the proposed changes (depending on what those changes are) when you are testing two separate pages or groups of pages. Yes, you would be testing on percentages of traffic and ranking increases or decreases, but might that curve be somewhat skewed if original page metrics are not near exact? Or have you found a method to compensate for this? Great concept. Thanks!
Hey Blake,
Yep - you are on to a good point here (and an important one). You cannot rely on the page metrics etc. being the same and any approach that tries finding two comparable sets of pages is doomed to suffer from all sorts of problems.
We do various things to offset things in a number of ways, and the exact blend of spices is part of our secret sauce, but a great starting place is looking at the counterfactual forecasting paper from google (here) is a great first step (see also this post which will help wrap your head around it). A great post for getting starting using this technique is this post by Mark Edmondson.
By using this method you can actually account for variations in two groups of pages. You'll see that in Mark's post he uses a set of pages to forecast its own traffic forward then by seeing whether a set of pages performs above or below that forecast he can try to establish if a change had a positive impact. You can use a similar technique with multiple groups of pages which allows you to account for the variation between those groups and to account for any additional outside interventions that occur during that period to improve your accuracy.
There are lots of gotchas and we are still working on it, and once we are a bit more established we'll try to do some follow up posts somewhere on releasing more of the details (and possibly some code down the line).
Thanks for the insightful question! :)
Tom,
Great I'll have a look. I have played with your Distilled Forecaster tool a bit, I can see where that can be useful in a testing situation such as this. (and I can see some great uses for this once there are further improvements, like multiple data points:) Thanks for the feedback.
Hi Tom and Blake, I came across your question and link in my blog referrals, thanks for that!
The CausalImpact package has this concept of control time-series that do something to mitigate against what you speak about Blake, that is to try and account for other variation that may have happened in the test period. If you are using GA data, then you can import SEO traffic and a control segment in my app "GA Effect", that just wraps CausalImpact into an interface that fetches GA.
It is here: https://gallery.shinyapps.io/ga-effect/
I usually do something like an SEO segment with a control of direct traffic or direct + referrals. My app only uses one control, but if you use the R package directly then you can have many control channels. The original paper used traffic segmented by country, to test if a specific national campaign had an effect.
To build on Tom's response, the CausalImpact R package is a cool and relatively simple implementation of seeing "what would have happened if you did nothing" (i.e. the counterfactual) compared to what actually happened. However, there still requires setup in the fact that you'll want the other time-series you're using the build the counterfactual should be somewhat related.
In the Google paper it says:
"Such control series can be based, for example, on the same product in a different region that did not receive the intervention or on a metric that reflects activity in the industry as a whole."
Additionally, if you don't necessarily want to use the Bayesian model, you can more simply use a linear regression model against the various cohorted variants. That way you'll know their behaviors are relatively the same even though the orders of magnitude might be off. This is actually a fairly common technique when evaluating things like regional offline marketing campaigns.
With SEO specifically though, one of the factors that is harder to consider is actually the time which items have been indexed or re-indexed to see the change. If you run the test "long enough" then you won't want to worry about this since you can assume that the search engine has discovered, reindexed, and reranked the pages being tested. But it's a good rule of thumb to understand your re-index rate first, especially for the section being tested before concluding the test.
Great concept! I would love to see this in action!
Amazing post, had to bookmark 5 diffrent links. Thank you!
This is AWESOME. Thanks MOZ for delivery yet again another amazing blog. These SEO split testing changes are going to help me big time. Thanks again!
We are just planning to make a few pages mobile responsive for a clients website. They are currently ranking number 1 for many keywords (though they are not mobile responsive). The problem is they are losing a number of conversion due to the fact that more than 50 percent of visits are from mobile devices. This test should help us in slowly rolling out a split test strategy to make sure that making the pages responsive wont affect their current links.
This is a good method but how long can we change to B/A Test Changes?
Hi there. If I've understood you correctly, you're asking about how long tests should run for? I addressed that in the article:
"uplifts of 5–10% (to that set of pages, remember) are likely to be detectable in a matter of weeks"
Hope that helps. Let me know if I have misunderstood you.
Great article, thnx for the information!
A / B Testing is a simple and cost-effective method in comparison with other instruments analytics of your product. This is a good tool of UX which allows you to improve your product and increase the efficiency of its monetization. I can recommend a good service that will help you with A / B testing - Applead.
[link removed by editor]
works only if you got a decent traffic :(
seo ab testing,,, yes sure (:
spam method vs spam method..
google manipulation is not allowed, not matter how you call it.
Interesting - not sure why you'd think it would be only useful / only used for testing spam methods.
Frankly, i love doing experiments in SEO as it leads me to learn more than i do usual practices. Really useful A/B outcomes.
useful article share with us thank you
Great article and post. I have been split testing my landing pages for all of my paid campaigns and over the period I have learnt that not just pages but 2 analytics is a must too. Its good not just to compare reports for 2 landing pages but reports on 2 sources too so a 3rd party tool like gostats helps a lot.
Great article and I think it will be more important in the future even with more changes to the algorithm from Google.
Nice Points Will Critchlow, Nice Presentation too !
Its really advance SEO technique for SEOing .Means i never seen testing in a SEO marketing.So, by this blog i knew and learned about AB testing and their benefits in SEO.I rally rally thanks to Will who introduced new concept in Website optimization. and i also downloaded your slide presentation for my skills improvement.your slide presentation is superb!! Nice example and images here.
LOL by the time you get your A/B results Google's Algo has already changed, Especially if it is that obvious and measurable. The exceptions would be stuff like title tag, original content, authority of links etc. You would have like the old days stuff didn't change for years. Slow algo change made for easy reverse engineering :)
A very very unique post.
This article is very much different from other traditional SEO post. For me, this post is very much learning expereinces.