Sampling is a process used in statistics when it's unfeasible or impractical to analyse all the data that exists. Instead, a small, randomly selected subset is used to keep things manageable. Many analytics platforms use some sort of sampling to keep report loading times in check, and there seem to be three schools of thought when it comes to sampling in analytics. There are those who are terrified of it, insisting in unsampled versions of any report. Then there are those who are relaxed about it, trusting the statistical logic. And then, lastly, there are those who are oblivious.
All three are misguided.
Sampling isn't something to fear, but, in Google Analytics in particular, it can't always be trusted. Because of that, it's definitely worth your time to understand when it occurs, how it affects your work, and how it can be avoided.
When it happens
You can always tell when sampling is being used, because of this line at the top of every report:
If the percentage is less than 100%, then sampling is in progress. You'll notice above that I've produced a report based on more than half a billion sessions without any sampling — sampling isn't just about the sheer number of sessions involved in a report. It's about the complexity of what you're asking the platform to report on. Contrast the below (apologies for the small screenshots; I wanted to make sure the whole context was included, so have added captions explaining just what you're looking at):
No segment applied, report based on 100% of sessions
Segment applied, report based on 0.17% of sessions
The two are identical apart from the use of a segment in the second case. Google Analytics can always provide unsampled data for top-line totals like that first case, but segments in particular are very prone to prompting sampling.
The exact same level of sampling can also be induced through use of a secondary dimension:
Secondary dimension applied, report based on 0.17% of sessions
A few other specialised reports are also prone to this level of sampling, most notably:
- The Ecommerce Overview
- "Flow Reports"
Report based on 0.17% of sessions
Report based on <0.1% of sessions
To summarise so far, sampling can happen when we use:
- A segment
- More than one dimension
- Certain detailed reports (including Ecommerce Overview and AdWords Campaigns)
- "Flow" reports
The accuracy of sampling
Sampling, for the most part, is actually pretty reliable. Take the below two numbers for organic traffic over the same period, one taken from a tiny 0.17% sample, and one taken without sampling:
Report based on 0.17% of sessions, reports 303,384,785 sessions via organic
Report based on 100% of sessions, reports 296,387,352 sessions via organic
The difference is just 2.4%, from a sample of 0.17% of actual sessions. Interestingly, when I repeated this comparison over a shorter period (last quarter), the size of the sample went up to 71.3%, but the margin of error was fairly similar at 2.3%.
It's worth noting, of course, that the deeper you dig into your data, the smaller the effective sample becomes. If you're looking at a sample of 1% of data and you notice a landing page with 100 sessions in a report, that's based on 1 visit — simply because 1 is 1% of 100. For example, take the below:
Report based on 45 sessions
Eight percent of a whole year's traffic to Distilled is a lot, but 8% of organic traffic to my profile page is not, so we end up viewing a report (above) based on 45 visits. Whether or not this should concern you depends on the size of the changes you're looking to detect and your threshold for acceptable levels of uncertainty. These topics will be familiar to those with experience in CRO, but I recommend this tool to get your started, and I've written about some of the key concepts here.
In extreme cases like the one above, though, your intuition should suffice - that click-through from my /about/ page to /resources/...tup-guide/ claims to feature in 12 sessions, and is based on 8.11% of sessions. As 12 is roughly 8% of 100, we know that this is in fact based on 1 session. Not something you'd want to base a strategy on.
If any of the above concerns you, then I've some solutions later in this post. Either way, there's one more thing you should know about. Check out the below screenshot:
Report based on 100% of sessions, but "All Users" only accounts for 38.81% "of Total"
There's no sampling here, but the number displayed for "All Users" in fact only contains 38.8% of sessions. This is because of the combination of there being more than 1,000,000 rows (as indicated by the yellow "high-cardinality" warning at the top of the report) and the use of a segment. This is because of the effect of those rows grouped into "(other)", which are hidden when a segment is active. Regardless of any sampling, the numbers in the rows below will be as accurate as they would be otherwise (apart from the fact that "(other)" is missing), but the segment totals at the top end up of limited use.
So, we've now gone over:
- Sampling is generally pretty accurate (+/- 2.5% in the examples above).
- When you're looking at small numbers in reports with a high level of sampling, you can work out how many reports they're based on.
- For example, 1% sampling showing 100 sessions means 1 session was the basis of the number in the report.
- You should keep an eye out for that yellow high-cardinality warning when also using segments.
What you can do about it
Often it's possible to recreate the key data you want in alternative ways that do not trigger sampling. Mainly this means avoiding segments and secondary dimensions. For example, if we wanted to view the session counts for the top organic landing pages, we might ordinarily use the Landing Pages report and apply a segment:
Landing Pages report with Organic Traffic segment, based on 71.27% of sessions
In the above report, I've simply applied a segment to the landing pages report, resulting in sampling. However, I can get the same data unsampled — in the below case, I've instead gone to the "Channels" report and clicked on "Organic Search" in the report:
Channels > Organic Search report, with primary dimension "Landing Page", based on 100% of sessions
This takes me to a report where I'm only looking at organic search sessions, and I can pick a primary dimension of my choice — in this case, Landing Page. It's worth noting, however, that this trick does not function reliably — when I replicated the same method starting from the "Source / Medium" report, I still ended up with sampling.
A similar trick applies to custom segments — if I wanted to create a segment to show me only visits to certain landing pages, I could instead write a regex advanced filter to replicate the functionality with less chance of sampling:
Lastly, there are a few more extreme solutions. Firstly, you can create duplicate views, then apply view-level filters, to replicate segment functionality (permanently for that view):
Secondly, you can use the API and Google Sheets to break up a report into smaller date ranges, then aggregate them. My colleague Tian Wang wrote about that tool here.
Lastly, there's GA Premium, which for a not inconsiderable cost, gets you this button:
So lastly, here's how you can avoid sampling:
- You can construct reports differently to avoid segments or secondary dimensions and thus reduce the chance of sampling being triggered.
- You can create duplicate views to show you subsets of your data that you'd otherwise have to view sampled.
- You can use the GA API to request large numbers of smaller reports then aggregate them in Google Sheets.
- For larger businesses, there's always the option of GA Premium to receive unsampled reports.
Discussion
I hope you've found this post useful. I'd love to read your thoughts and suggestions in the comments below.
Great post. It would be of good help to make the best use sampling data using analytics.
Great sharing, thanks! Have always wondered about the accuracy of sampling in Google Analytics reports.. In your opinion, would going for Google Analytics Premium (Analytics 360 now) be a good sustainable solution for sites with high traffic (over 1 billion hits)?
The main advantage GA premium gets you is unsampled exports, but they're not without their limits - row limits still apply, you get a limited number of exports per day, and they take hours to process. GA premium also increases your available slots for things like custom dimensions.
Whether that's worth the normally 6-figure price is up to you.
Last time I heard, GA Premium starts around $150k a year but it depends on your website's traffic, so can go a lot higher than that.
We upgraded to Premium about 6 months ago and the first reason was sampling, we had over 500M monthly hits while the free Analytics hit limit is 10M before sampling.
At the same time, if all that Premium (360) can do for you is give you unsampled data, then this is a very expensive feature. And the main reason we upgraded was that Premium has full integration with Google's DFP.
One more thing, $150K is a myth. Google wants us to think it's 150 so that any price below would sound really good to us. Google only sells Premium through resellers, and I spoke to 6 of them before we signed. The actual price range is around 100-120K with a yearly support by the reseller.
If you have over 1B hits, then the next package is 1B to 5B hits and it's an extra $25K/year (more or less).
Thanks Igal! Lots of information we can take away from your advice and experience of switching to Premium. We're also looking at leveraging other features offered by 360 Suite beyond just having the hit limit upsized so that we can justify the hefty price we pay to move the Premium.
We'll definitely look into discussing this in details with more resellers to get a more holistic view on the support and package we can get.
Great Sharing and helpful inofrmation.....
The first thing you notice when you switch from the free Google Analytics sampled reports to GA Premium or Adobe Analytics, is that report pages take a lot longer to load. Especially if you analyse yearly trends and work with hundreds of millions of sessions, you miss the quick sampled GA reports. :-)
Hi Tom!
I'm taking the certification of Google Partners and the truth is that this post helps me a lot! If questions arise me I'll ask you ok? :)
Thank you!
Thanks Tom, it's funny we were having a conversation this morning about sampling inconsistencies. Knowing you can request smaller subsections of reports will help us ramp up productivity for a handful of our larger clients. No doubt, this will save us a good chunk of time. Thanks for this!
Hi Tom
Excellent guide to better understand and use Google analytics. The truth is that it is a tool that can provide us with information that no one else can give us.
A related question Google Analytics..¿Es best installation through a plugin or better to copy the javascript code?
Thanks
Either should be equivalent.
I tend to prefer using the raw code or GTM, however, as this allows more control over customisation later on.
I always prefer to implement the code more than plugins. It is simple and not burden much the CMS , but it's just a personal issue (if you work with Wordpress for example is not good overcome more than 20 plugins. The less you have the better). Functionality is the same.
I was unaware of these little tricks to "get around" sampling. Good knowledge.
Worth a mention: Google Analytics Premium's "not inconsiderable cost"starts at a whopping $150,000 per year.
Hey Tom
Great piece of advice! Isn't GA API also used in most SEO tools these days? Are there any advantages using those tools instead of GA in paid. What is the cheapest/effective solution?
The GA API is just as vulnerable to sampling as the interface, unfortunately. However, you can do it to more easily aggregate reports with smaller date ranges, and thus less chance of sampling.
Thanks. I'd love to see a guide on aggregating reports from GA API in spreadsheet. Do you plan to publish it somewhere? Or is there nothing to publish, it just sounds like there are more than 2 steps :)
So the guide I linked to should tell you most of what you need to know.
Beyond that, basically you just need to add an extra dimension of "ga:yearMonth" to each request, then use a combination of that and the "sumifs" formula to aggregate.
Tom, great piece of writing. Worth reading it. One quick question about the accuracy of data where as data sample is not huge. Will it still be worth considering sampling?
Thank you
It is worth considering, but the extent to which it's worth considering is something you need to analyse yourself using some of the tools I mention in the post above.
hmm. Got it
Great Article Tom.
What's your take on Sampling when it comes to A/B testing using Google-experiment feature?
Have to say I've not much experience of sampling with the experiment feature. Anything interesting come up?
Thanks Tom! Great article to share!
I have read your article and this is the best article on the Google Analytics.
I've used this information with my website and I will implement.
Awesome!
Great article, most of time we fear that the sample was not accurate enough, but you show the fine tools analytics can be.
thanks a lot
Awesome article! Very informative
HI Dear,
Its Really Great Information.
nice one to share.thank you
Great article. Thank you.
Thanks Tom , interesting article that I will implement.
great article, Thanks for sharing your detailed analysis on sampling. This is awesome!
Thanks for sharing this article. Now it's time to implement it!
Thanks Tom! Nice article to share! :)
Great Article - To me, there is no justification for spending $150,000 per year for premium. We use Analytics Edge Free Plugin for un-sampled data - not user-friendly, but effective. https://www.analyticsedge.com/
Great article. I've used this information with my domain - screenplay.biz - and it worded well.
I'm confused. your comment references screenplay.biz and your profile page says screenplays.biz. Well, which is it? I always come to Moz's Analytics articles and browse the comments to find the next great new screenplay website... :|
Really enjoyed that, thanks Andrew.