One of the classic problems in SEO is that while complex navigation schemes may be useful to users, they create problems for search engines. Many publishers rely on tags such as rel=canonical, or the parameters settings in Webmaster Tools to try and solve these types of issues. However, each of the potential solutions has limitations. In today's post, I am going to outline how you can use JavaScript solutions to more completely eliminate the problem altogether.
Note that I am not going to provide code examples in this post, but I am going to outline how it works on a conceptual level. If you are interested in learning more about Ajax/JSON/jQuery here are some resources you can check out:
Defining the problem with faceted navigation
Having a page of products and then allowing users to sort those products the way they want (sorted from highest to lowest price), or to use a filter to pick a subset of the products (only those over $60) makes good sense for users. We typically refer to these types of navigation options as "faceted navigation."
However, faceted navigation can cause problems for search engines because they don't want to crawl and index all of your different sort orders or all your different filtered versions of your pages. They would end up with many different variants of your pages that are not significantly different from a search engine user experience perspective.
Solutions such as rel=canonical tags and parameters settings in Webmaster Tools have some limitations. For example, rel=canonical tags are considered "hints" by the search engines, and they may not choose to accept them, and even if they are accepted, they do not necessarily keep the search engines from continuing to crawl those pages.
A better solution might be to use JSON and jQuery to implement your faceted navigation so that a new page is not created when a user picks a filter or a sort order. Let's take a look at how it works.
Using JSON and jQuery to filter on the client side
The main benefit of the implementation discussed below is that a new URL is not created when a user is on a page of yours and applies a filter or sort order. When you use JSON and jQuery, the entire process happens on the client device without involving your web server at all.
When a user initially requests one of the product pages on your web site, the interaction looks like this:
This transfers the page to the browser the user used to request the page. Now when a user picks a sort order (or filter) on that page, here is what happens:
When the user picks one of those options, a jQuery request is made to the JSON data object. Translation: the entire interaction happens within the client's browser and the sort or filter is applied there. Simply put, the smarts to handle that sort or filter resides entirely within the code on the client device that was transferred with the initial request for the page.
As a result, there is no new page created and no new URL for Google or Bing to crawl. Any concerns about crawl budget or inefficient use of PageRank are completely eliminated. This is great stuff! However, there remain limitations in this implementation.
Specifically, if your list of products spans multiple pages on your site, the sorting and filtering will only be applied to the data set already transferred to the user's browser with the initial request. In short, you may only be sorting the first page of products, and not across the entire set of products. It's possible to have the initial JSON data object contain the full set of pages, but this may not be a good idea if the page size ends up being large. In that event, we will need to do a bit more.
What Ajax does for you
Now we are going to dig in slightly deeper and outline how Ajax will allow us to handle sorting, filtering, AND pagination. Warning: There is some tech talk in this section, but I will try to follow each technical explanation with a layman's explanation about what's happening.
The conceptual Ajax implementation looks like this:
In this structure, we are using an Ajax layer to manage the communications with the web server. Imagine that we have a set of 10 pages, the user has gotten the first page of those 10 on their device and then requests a change to the sort order. The Ajax requests a fresh set of data from the web server for your site, similar to a normal HTML transaction, except that it runs asynchronously in a separate thread.
If you don't know what that means, the benefit is that the rest of the page can load completely while the process to capture the data that the Ajax will display is running in parallel. This will be things like your main menu, your footer links to related products, and other page elements. This can improve the perceived performance of the page.
When a user selects a different sort order, the code registers an event handler for a given object (e.g. HTML Element or other DOM objects) and then executes an action. The browser will perform the action in a different thread to trigger the event in the main thread when appropriate. This happens without needing to execute a full page refresh, only the content controlled by the Ajax refreshes.
To translate this for the non-technical reader, it just means that we can update the sort order of the page, without needing to redraw the entire page, or change the URL, even in the case of a paginated sequence of pages. This is a benefit because it can be faster than reloading the entire page, and it should make it clear to search engines that you are not trying to get some new page into their index.
Effectively, it does this within the existing Document Object Model (DOM), which you can think of as the basic structure of the documents and a spec for the way the document is accessed and manipulated.
How will Google handle this type of implementation?
For those of you who read Adam Audette's excellent recent post on the tests his team performed on how Google reads Javascript, you may be wondering if Google will still load all these page variants on the same URL anyway, and if they will not like it.
I had the same question, so I reached out to Google's Gary Illyes to get an answer. Here is the dialog that transpired:
Eric Enge: I'd like to ask you about using JSON and jQuery to render different sort orders and filters within the same URL. I.e. the user selects a sort order or a filter, and the content is reordered and redrawn on the page on the client site. Hence no new URL would be created. It's effectively a way of canonicalizing the content, since each variant is a strict subset.
Then there is a second level consideration with this approach, which involves doing the same thing with pagination. I.e. you have 10 pages of products, and users still have sorting and filtering options. In order to support sorting and filtering across the entire 10 page set, you use an Ajax solution, so all of that still renders on one URL.
So, if you are on page 1, and a user executes a sort, they get that all back in that one page. However, to do this right, going to page 2 would also render on the same URL. Effectively, you are taking the 10 page set and rendering it all within one URL. This allows sorting, filtering, and pagination without needing to use canonical, noindex, prev/next, or robots.txt.
If this was not problematic for Google, the only downside is that it makes the pagination not visible to Google. Does that make sense, or is it a bad idea?
Gary Illyes: If you have one URL only, and people have to click on stuff to see different sort orders or filters for the exact same content under that URL, then typically we would only see the default content.
If you don't have pagination information, that's not a problem, except we might not see the content on the other pages that are not contained in the HTML within the initial page load. The meaning of rel-prev/next is to funnel the signals from child pages (page 2, 3, 4, etc.) to the group of pages as a collection, or to the view-all page if you have one. If you simply choose to render those paginated versions on a single URL, that will have the same impact from a signals point of view, meaning that all signals will go to a single entity, rather than distributed to several URLs.
Summary
Keep in mind, the reason why Google implemented tags like rel=canonical, NoIndex, rel=prev/next, and others is to reduce their crawling burden and overall page bloat and to help focus signals to incoming pages in the best way possible. The use of Ajax/JSON/jQuery as outlined above does this simply and elegantly.
On most e-commerce sites, there are many different "facets" of how a user might want to sort and filter a list of products. With the Ajax-style implementation, this can be done without creating new pages. The end users get the control they are looking for, the search engines don't have to deal with excess pages they don't want to see, and signals in to the site (such as links) are focused on the main pages where they should be.
The one downside is that Google may not see all the content when it is paginated. A site that has lots of very similar products in a paginated list does not have to worry too much about Google seeing all the additional content, so this isn't much of a concern if your incremental pages contain more of what's on the first page. Sites that have content that is materially different on the additional pages, however, might not want to use this approach.
These solutions do require Javascript coding expertise but are not really that complex. If you have the ability to consider a path like this, you can free yourself from trying to understand the various tags, their limitations, and whether or not they truly accomplish what you are looking for.
Credit: Thanks for Clark Lefavour for providing a review of the above for technical correctness.
Hi!!
Nice article, I found it very instructive.
Hi Eric,
one thing that is not addressed in this article is that the URL that you call on the server side to deliver you the filtered list of products (on Json or XML) will be crawled by Google.
Let's say that in order to get the Json list of products your client side script make a GET request to website.com/server-script/filter=1filter&otherfilter=filter2
Google will always try to crawl those URLs, and sometimes will even index that content.
So I would say, that the crawling budget wont be spare just by using AJAX but by not giving the bots hints about where to find your server-side script that will deliver the JSON.
I will like though to know your opinion about that.
Hi All,
To help illustrate, here is a sample of some potential code:
function showfacet() {
var currentPath=window.location.pathname;
var currentPage =
currentPath.substring(currentPath.lastIndexOf('/') + 1); req = GetXmlHttpObject(); if (req==null) { alert ("Browser does not support AJAX Request"); return; } var proxy="/app/qs_api.php?p="+currentPage;
req.open("GET",proxy,false);
req.onreadystatechange=handleHttpResponse;
req.send(null);
}
You will note that there are no clear URLs presented in this. Google can, and may well, sample some of this code to see what happens, but I have not seen any evidence that they do this in any extensive way. Keep in mind what Gary Illyes said above:
"If you have one URL only, and people have to click on stuff to see different sort orders or filters for the exact same content under that URL, then typically we would only see the default content."
For that reason, my experience is that this approach will in fact save you substantial crawl budget.
Hi Eric,
in that case there is chances that the gbot starts to crawl /app/qs_api.php?p=
Besides that, it will be considered a relative URL. So my point here, is that even though that's not a real page, the gbot will expend time crawling it, rising the question about if those fake-URLs will leak some link power too.
Google officially said that they may crawl things that look like URL in order to understand better a website, but there is still a lack of feedback about how at the end they are treating them.
I have seen Google indexing the content that is delivered from those AJAX calls and rank the original page for that content, and at the same time, I have seen Google indexing fakes-URL that delivered a complete HTML page instead of a JSON object, and after weeks removing those URLs from the index, to rank the original page for that content.
I would like to hear some official statement about how is Google handling those URLs and how much should we care about them.
Update: I found this --> https://support.google.com/webmasters/answer/24094...
In Crawl Errors, you might occasionally see 404 errors for URLs you don't believe exist on your own site or on the web. These unexpected URLs might be generated by Googlebot trying to follow links found in JavaScript, Flash files, or other embedded content.
...the link may appear as a 404 (Not Found) error in the Crawl Errors feature in Search Console.
Google strives to detect these types of issues and resolve them so that they will disappear from Crawl Errors.
So, still "Google strives to detect these type of issues" is not helping that much.
proxy="/app/qs_api.php?p="+currentPage;
^^ there is your URL :)
and it is dynamic as well as you add the p parameter. I saw Google crawling this stuff in the logfiles.
Can you add some code please?
As far as I understand your explanation, Google will of course crawl this data. Yes, they might not index it, because of the missing specific URL, but you still have problem with your crawl budget (which usually is the bigger problem)
HI fiacyberz,
if you have in HTML:
<a href="#color-white">white</a>
<a href="#color-black">black</a>
<a href="#color-red">red</a>
Then the following script should be executed when the user click on one of those links:
$.ajax({
type: “GET”,
dataType: "json", → or XML
url: /server-side-script-that-deliver-the-json?color=[the-color-clicked]
success: function(response){ //update the product listing thanks to the json list };
});
In that case the bots will try to crawl /server-side-script-that-deliver-the-json?color=[the-color-clicked]
and as you can expect, you will have the same ammount of combinations as if you didnt use AJAX at all.
My approach would be to use POST instead of GET, so Google wont have this "alternative URL navigation", and to not server any URL. Therefore the page that has to deliver the server json is the same as you are (your IT team will want to kill you but that's our job, ain't it?)
With that approach the bot won't have any alternative navigation and the crawling budget will be spared. At the same time the pagerank and all the link metrics won't flow through the faceted-navigation pages.
Makes that sense for you?
Best
Sure, POST is a much better approach. You dont have to use Ajax (only for users convenience), but should be able to combine it.
If you want to reload the page, take a look at PRG => Post Redirect Get
And dont use a-tags for the filters, labels and invisible input fields are much better (in case of Google counting the number of a-tags)
btw. havent seen you were asking the same question in a comment.. ;-)
Hi, thanks for the tips.
The PRG is as well a good idea, but then as you said without the "coolness" of not loading the whole DOM.
Using labels was something I didnt took into account, but sure, it would make much less probable for Google to treat it as URLs
Few days ago I notice that Google is taking something that is not a URL as a relative URL. So it is reporting 404 errors coming from each single page on our shop.
Now I have to deal with IT to try to make them change a piece of code that is totally correct, but that the crazy bot is trying to crawl. Either that or include those fakes URLs on the robots.txt
That rise the question about, would Google transfer Link Juice through such fakes-URLs? In that case I would prefer not to use the robots.txt option.
Any opinion about that?
robots.txt is always the worst solution. only reasonable when your server crashes because Google doesnt understand your site. I saw this once in the last 5 years...
about these URLs... Google used to grab everything which looks like a URL e.g. data="/fubar" then domain.com/fubar was crawled. Even if this was just in some JavaScript part without any intention to an a-tag or span or any html code.
So do I understand you well that you believe that these ajax/json facet links (using POST) are counted by google for calculating link juice?
If so, to avoid link juice dilluition we may need to look rather into hiding the links completely (such as hiding them if cookies are disabled). Downside here may be that facets will frequently contain a lot of topical relevant keywords, that may increase organic traffic to the pages. So if google would not consider them for calculating link juice it may be better to still keep them visible to google.
Googlebot can POST too (https://googlewebmastercentral.blogspot.com.es/2011...), I see so many POST requests in logs, and it also loves making up any URL that looks remotely like a URL, whether its plain text or JS code.
Robots.txt blocks for these pesky requests have always been the right answer for me, that is, unless the resource you're blocking affects the display of the page (case study: https://yoast.com/google-panda-robots-css-js/)
Fiacyberz, why did you say robots.txt is the worst solution? Maybe I missed some of the context?
sure they follow POST sometimes. but since you can use the same URL where you have your form as a target, you dont create new URLs and therefore the crawl budget is not wasted
since Google is indexing (and internally saving) URLs blocked by robots.txt and of course counting links to these URLs this is the worst thing you can do. why not remove (or mask in a good way) these links and have a noindex on the linked page
you just wast ressources with this tactic (which was working very good few years ago)
Not sure I follow you. Just so I'm clear, I'm talking about blocking 'pages' that shouldn't be crawled as they either have no value to users or Googlebot, and are likely being manufactured by Googlebot as a result of trying to index JS.
There has never been a real calculation on crawl budget, and frankly, that term subtlety implies that Google has a finite amount of space/capacity to crawl a site. That's completely not true, obviously. The point about 'crawl budget' that makes some sense is that by diverting the crawler's attention away from useless pages, we should be able to get pages that should be crawled more regularly. Often, higher crawl activity = higher organic traffic to a page, but not guaranteed. I broadly agree that removal, masking, or obfuscating these URL paths is best, bun real corporate environments, fixes are limited - often.
Sure Googles crawl budget is limited. Each domain has its own limit (which can vary of course if Googel thinks a domain is stronger or can handle more crawling)
And since blocking is not removing pages, it is a bad idea. Google will never forget them and always take them into account
Never once, in my 6 years in SEO have I seen Google fail to crawl and index sites with millions (sometimes billions) of pages. How often they crawl might be determined by a variety of factors, like PageRank, how often content changes, etc.. I don't think it's dependent on whether the site being crawled can 'handle' being crawled, that is, unless the server constantly responds with 4xx or 5xx.
You can't always remove pages, it is never that simple - if it was, there would be no need for canonicals, robots.txt, noindex, etc.. Removing pages isn't always an option, especially when you get into enterprise environments. Robots.txt blocks have their uses, and in cases where Google is making up URLs via Javascript (which is what I've been referring to), yes a block is right, especially if that URL/script/resource needs to be available for the functionality of the site but not crawled.
Take a look at you logfiles, you will find a lot of crawling issues. Or just take a look at the Search Console and the crawling graph.Google had always problems with crawling.
Most of the webmaster (like 99%) wouldnt see it or need to see and deal with it. But when it comes to big domains you have to, because it so powerful and important. To clarify, for me, big domains start at 100,000 pages and the bigger the more important this topic is.
Of course it is not always easy to remove pages and sometimes even not possible (in an aceptable timeframe). Then you can use the robots.txt. But it is like using a patch on broken bones.
I would use robots.txt for two reasons:
- your server crashes because of a broken script and Google is crawling to much (happened once in the last 5 years)
- you want to hide something from Google (some masking methods are working this way)
- optional: you cant fix an issue in the next 6 months
I've looked at log files quite a bit (https://moz.com/blog/server-log-essentials-for-seo). When you say problems with crawling, I assume you mean that they overcrawl (invent pages) rather than miss pages. Again, I have never seen Google not completely index (or at least crawl through every page) on even the biggest of sites. For example, I worked on a site 5 years ago with over 400 million pages, Google crawled the 400 million and more, going through those heavy logfiles was tough but worth seeing how much Google is able to handle - therefore, I know firsthand that Google has no problem in regards to how many pages it will crawl. They want absolutely everything, including the hidden web - I don't believe in crawl budget as its been defined.
Also, the crawl activity graph in the search console is not a fair representation of what is actually crawled on your site, I've compared this graph to actual log files and it's been way off 30-40% or more, many times. For me, that graph is useful in identifying skim or deep crawls, and potentially signalling a evaluation before an algorithm update (unconfirmed, this is from experience).
I think we're saying the same things, I would add a few more reasons to use robots.txt:
-blocking pages that are necessary for users and not for search engines, example: search pages (query=) can provide an endless bot path on certain sites.
- scripts that are requested too often (> 50% of requests in day/week), are not critical to design/layout
I agree that robots.txt is a patch, but then again, so is the canonical tag right?
Nice chatting with you :)
These talks are the best ;-)
The article of you is just what I meant. Google is crawling the wrong pages.
What I was seeing a lot, is crawl budget problem here. You have like 100,000 (real) pages. Google is crawling 150,000 a day. When you look at the logfiles, you see Google crawling 100,000 wrong pages. This means, Google needs at least two days to crawl yor real pages. This also means, that Google set up the crawl budget for your domain to 100,000 pages a day (which is ok as you have only 100,000 real pages). But since Google is crawling a lot of wrong pages you need to tell Google what to do.
In this case (as you used in your article) robots.txt is the fastest method. But then you see these pages in the WMT and Google still knows them (and sometimes ranks them as well). As far as I can tell this is a negative point for your domain. It worked back in the days before 2010, but then they changed it somehow and my best solution now is to remove links and noindex/canonicalize these pages.
Sure the WMT data is not very good, but a good indicator. I've seen Google crawling 3 million pages a day (regarding the WMT) on a domain with just like 100,000 pages. Big brand, therefore big crawl budget. But since Google is now crawling only the real pages this domain got a big ranking boost.
Ok, so I did some blocking recently for an e-commerce site with ~190,000 indexable pages, pages crawled by Google was around ~250,000. I noticed that we had some weird stuff being crawled, and just like you mentioned, we were trying to ensure that the daily crawl by Googlebot was used efficiently. The daily crawl average in a 2 week sample was ~1,000 requests per day and ~20% of that was used on pages we didn't need crawled.
So we blocked them. Google didn't request them anymore, but the average crawl requests per day dropped to around ~800. In this case, we didn't replace/divert Googlebot's attention, we ended up losing 'crawl budget'.
That was the first time I'd seen something like that, which does validate what you're saying; this was < 1 year ago.
Hi guys, I will share as well what I have seen:
Not strong website with faceted navigation.
The links were tagged with nofollow and the filtered URLs with parameters had canonical to the "parent" URL without parameters.
We switch to follow links + canonical.
Google get crazy and goes from 3.000 pages crawled per day to 150.000
Google send a notification "you have too many URLs". On the parameter handling tool I see some parameters with > 15 Millions pages.
I used the handling tool to tell Google not to crawl them but Google ignores it.
I decided to block those faceted URLs with robots.txt
Crawling goes back to 3.000 pages a day, the parameters that Google reported on the handling tool drop from 15 Millions to 100K
So, crawling seems stable, indexing working fine.
Next step is going for AJAX.
My feeling is that Google actually may have crawling problems in such cases, not because of the crawling itself but because then it has to "understand" those URLs in order to create a clean index and integrate duplicate URLs and signals
"My feeling is that Google actually may have crawling problems in such cases, not because of the crawling itself but because then it has to "understand" those URLs in order to create a clean index and integrate duplicate URLs and signals" Absolutely :)
Its quite interesting and yet again important post from Eric.
This sounds like a great solution for SEO, but what about paid search? All the different URL's that are created through faceted search make great landing pages.
good question, and a different kind of issue.
Late to the party (been pretty busy) but...
"Keep in mind, the reason why Google implemented tags like rel=canonical, NoIndex, rel=prev/next, and others is to reduce their crawling burden and overall page bloat and to help focus signals to incoming pages in the best way possible. The use of Ajax/JSON/jQuery as outlined above does this simply and elegantly."
Pretty much sums it up perfectly.
Lots of vendors are obsessed with facets and user choices etc. but Google often isn't. Do you really need a separate category for your 1 pink dress? An understanding of simple IA combined with an understanding of what Google is trying to achieve / crawl control works wonders for most sites.
Also monitor your internal site search functionality and expose more facets (via static URLs) as you grow. If 2000 people are searching for 'pink dresses' internally that is a strong signal you should firstly stock more pink dresses and secondly (wait until you have a decent product offering worthy of attention) make a static category page for 'pink dresses'.
Great post as always Eric.
Oh and regarding the Adwords landing page benefits of deep facets. Most of the time the same principles apply in that the conversion rates on weak offerings like the above 'pink dresses' example is so low you wouldn't want it in your campaign anyway. Even then though you could open up a URL that is only available through PPC and isn't index-able by the main bot without affecting your quality score.
Malc.
I don't think this is a great solution.
The first point against this technical implementation is that it can lead to a Usability Nightmare for the user and Good SEO must always meet Usability.
You are suggesting something that doesn't solve what the crawler will crawl as fiacyberz and David commented above and at the same time you are reducing the user filtering capabilities and the affordance of the tools user uses to navigate a website (from mobile, tablet or desktop).
In my opinion something very bad to do at least in the way you explained it (considering also your comment).
Very informative article, thanks.
Hi,
I got a question in regard to facets being served via AJAX request as I couldn't find a definitive answer in regard to an issue we currently face:
(We are working on an indexable facet solution (only a few selected facets will be indexed while others won't)
When visitors on our site select a facet in the facet panel, the site doesn't fully reload. As a consequence only URL and certain tags of the content (H1, description,..) are updated, while other tags like canonical URLs, meta noindex,nofollow tag, or the title tag are not updating as long as you don't refresh the page.
We have no information about how this will be crawled and indexed yet but I was wondering if anyone of you knows, how this will impact SEO?
Question using the Ajax code, how might this effect your sites domain authority if you're saying that certain pages (that are monotonous) need not be crawled. I guess page authority would be the only thing that would take a hit, thus, it really doesn't matter because it's a reproduction of something that's an item that might show up in duplicate all over the site?
Hi All,
One of the problem in these kind of solution is for the campaign side. If you have seperate URL for filters, you could redirect customer to these pages on Google AdWords and other campaigns. So these can increase your Quality Score.
Thanks for the good post!
Eric, always good to see your post on Moz :)
Duplication is prior issue for most of the site and specially for eCommerce sites. If we more dig into this topic than PAGINATION is the function which is raise to the duplication URLs in most of the eCommerce sites. it generates dynamic URLs and many string URLs which are not that much friendly.
As you mention here that to implement this JS is require, is this common for all the platform OR it is vary based on different platform like magento, shopify etc! If i am not wrong AJAX, JSON requires JS to run this and prevention of duplication. If you have common code than drop it here! That would be great help!
Great information Eric! Thanks for the insights!
This was a great read, Eric! Thanks for sharing this data with all of us. I've personally never thought about dealing with duplicate content like this before. I might have to try this out in the future.
Hello Eric,
Good Explanation, i have learned few new things today :)
Anyway, I want to ask a question based on your explanation. Are Amazon, Flipkart, SnapDeal and all big eCommerce site using JSON and jQuery currently? And if no then should they start using it?
I think, the best thing of a well-known eCommerce site is, they throw only those things which user needed, they don't get confused users with different options and this is the best thing to sell a product. Because of this metric, user will go in-depth for only those things in which he/she interested.
I just want to see both sides of coin, one you explained and second I asked ;) Waiting for response.
Hello Sir,
Thanks for another very useful post. I'd like to know your take on two popular techniques that have been proposed over the years for crawling AJAX i.e. Hijax Approach and the AJAX crawling scheme of Google. Are they still matters?
Thanks!
And then there's bing... Might be important to consider for some
Very good point about duplicate content, especially in virtual stores, where more repetitions have the same dabido aque content providers and specify the product with its own description and sometimes it can not be changed by legal status.
I've never had problems with tag rel=canonical, is this typical only on products pages? or content pages also?
Same question here, because I always have used rel canonical and have not had any problems.
Than you for explaining something I never really understood!
hi Eric Enge
i like your article . great
please share more ..... good luck