Eliminate Duplicate Content in Faceted Navigation with Ajax/JSON/JQuery

Comments 42

Please keep your comments TAGFEE by following the community etiquette.

E-mail me when new comments are posted

Sort by:

Comments are closed on posts more than 30 days old. Got a burning question? Head to our Q&A section to start a new conversation.

odenmc

2015-06-23T00:32:46-07:00

Hi!!

Nice article, I found it very instructive.

8 0

Hi!! Nice article, I found it very instructive.
Cancel
andreasmauer

2015-06-11T03:01:24-07:00

Hi Eric,

one thing that is not addressed in this article is that the URL that you call on the server side to deliver you the filtered list of products (on Json or XML) will be crawled by Google.

Let's say that in order to get the Json list of products your client side script make a GET request to website.com/server-script/filter=1filter&otherfilter=filter2

Google will always try to crawl those URLs, and sometimes will even index that content.

So I would say, that the crawling budget wont be spare just by using AJAX but by not giving the bots hints about where to find your server-side script that will deliver the JSON.

I will like though to know your opinion about that.

7 0

Hi Eric, one thing that is not addressed in this article is that the URL that you call on the server side to deliver you the filtered list of products (on Json or XML) will be crawled by Google. Let's say that in order to get the Json list of products your client side script make a GET request to website.com/server-script/filter=1filter&otherfilter=filter2 Google will always try to crawl those URLs, and sometimes will even index that content. So I would say, that the crawling budget wont be spare just by using AJAX but by not giving the bots hints about where to find your server-side script that will deliver the JSON. I will like though to know your opinion about that. 
Cancel
Eric Enge

2015-06-11T18:45:42-07:00

Hi All,

To help illustrate, here is a sample of some potential code:

function showfacet() {

var currentPath=window.location.pathname;

var currentPage =

currentPath.substring(currentPath.lastIndexOf('/') + 1); req = GetXmlHttpObject(); if (req==null) { alert ("Browser does not support AJAX Request"); return; } var proxy="/app/qs_api.php?p="+currentPage;

req.open("GET",proxy,false);

req.onreadystatechange=handleHttpResponse;

req.send(null);

}

You will note that there are no clear URLs presented in this. Google can, and may well, sample some of this code to see what happens, but I have not seen any evidence that they do this in any extensive way. Keep in mind what Gary Illyes said above:

"If you have one URL only, and people have to click on stuff to see different sort orders or filters for the exact same content under that URL, then typically we would only see the default content."

For that reason, my experience is that this approach will in fact save you substantial crawl budget.

3 0

Hi All, To help illustrate, here is a sample of some potential code: function showfacet() { var currentPath=window.location.pathname; var currentPage = currentPath.substring(currentPath.lastIndexOf('/') + 1); req = GetXmlHttpObject(); if (req==null) { alert ("Browser does not support AJAX Request"); return; } var proxy="/app/qs_api.php?p="+currentPage; req.open("GET",proxy,false); req.onreadystatechange=handleHttpResponse; req.send(null); } You will note that there are no clear URLs presented in this. Google can, and may well, sample some of this code to see what happens, but I have not seen any evidence that they do this in any extensive way. Keep in mind what Gary Illyes said above: "If you have one URL only, and people have to click on stuff to see different sort orders or filters for the exact same content under that URL, then typically we would only see the default content." For that reason, my experience is that this approach will in fact save you substantial crawl budget.
Cancel
- andreasmauer
 
 2015-06-11T23:29:10-07:00
 
 Hi Eric,
 
 in that case there is chances that the gbot starts to crawl /app/qs_api.php?p=
 
 Besides that, it will be considered a relative URL. So my point here, is that even though that's not a real page, the gbot will expend time crawling it, rising the question about if those fake-URLs will leak some link power too.
 
 Google officially said that they may crawl things that look like URL in order to understand better a website, but there is still a lack of feedback about how at the end they are treating them.
 
 I have seen Google indexing the content that is delivered from those AJAX calls and rank the original page for that content, and at the same time, I have seen Google indexing fakes-URL that delivered a complete HTML page instead of a JSON object, and after weeks removing those URLs from the index, to rank the original page for that content.
 
 I would like to hear some official statement about how is Google handling those URLs and how much should we care about them.
 
 Update: I found this --> https://support.google.com/webmasters/answer/24094...
 
 In Crawl Errors, you might occasionally see 404 errors for URLs you don't believe exist on your own site or on the web. These unexpected URLs might be generated by Googlebot trying to follow links found in JavaScript, Flash files, or other embedded content.
 
 ...the link may appear as a 404 (Not Found) error in the Crawl Errors feature in Search Console.
 
 Google strives to detect these types of issues and resolve them so that they will disappear from Crawl Errors.
 
 So, still "Google strives to detect these type of issues" is not helping that much.
 
 andreasmauer edited 2015-06-12T09:55:18-07:00
 3 0
 
 Hi Eric, in that case there is chances that the gbot starts to crawl /app/qs_api.php?p= Besides that, it will be considered a relative URL. So my point here, is that even though that's not a real page, the gbot will expend time crawling it, rising the question about if those fake-URLs will leak some link power too. Google officially said that they may crawl things that look like URL in order to understand better a website, but there is still a lack of feedback about how at the end they are treating them. I have seen Google indexing the content that is delivered from those AJAX calls and rank the original page for that content, and at the same time, I have seen Google indexing fakes-URL that delivered a complete HTML page instead of a JSON object, and after weeks removing those URLs from the index, to rank the original page for that content. I would like to hear some official statement about how is Google handling those URLs and how much should we care about them. Update: I found this --> <a href="https://support.google.com/webmasters/answer/2409439?hl=en" rel="nofollow">https://support.google.com/webmasters/answer/24094...</a> In Crawl Errors, you might occasionally see 404 errors for URLs you don't believe exist on your own site or on the web. These unexpected URLs might be generated by Googlebot trying to follow links found in JavaScript, Flash files, or other embedded content. ...the link may appear as a 404 (Not Found) error in the Crawl Errors feature in Search Console. Google strives to detect these types of issues and resolve them so that they will disappear from Crawl Errors. So, still "Google strives to detect these type of issues" is not helping that much. 
 Cancel
- Julian Cordes
 
 2015-06-12T00:54:56-07:00
 
 proxy="/app/qs_api.php?p="+currentPage;
 
 ^^ there is your URL :)
 and it is dynamic as well as you add the p parameter. I saw Google crawling this stuff in the logfiles.
 
 2 0
 
 proxy="/app/qs_api.php?p="+currentPage; ^^ there is your URL :) and it is dynamic as well as you add the p parameter. I saw Google crawling this stuff in the logfiles.
 Cancel
Julian Cordes

2015-06-11T04:31:12-07:00

Can you add some code please?
As far as I understand your explanation, Google will of course crawl this data. Yes, they might not index it, because of the missing specific URL, but you still have problem with your crawl budget (which usually is the bigger problem)

3 0

Can you add some code please? As far as I understand your explanation, Google will of course crawl this data. Yes, they might not index it, because of the missing specific URL, but you still have problem with your crawl budget (which usually is the bigger problem)
Cancel
- andreasmauer
 
 2015-06-11T04:45:41-07:00
 
 HI fiacyberz,
 
 if you have in HTML:
 
 <a href="#color-white">white</a>
 
 <a href="#color-black">black</a>
 
 <a href="#color-red">red</a>
 
 Then the following script should be executed when the user click on one of those links:
 
 $.ajax({
 
 type: “GET”,
 
 dataType: "json", → or XML
 
 url: /server-side-script-that-deliver-the-json?color=[the-color-clicked]
 
 success: function(response){ //update the product listing thanks to the json list };
 
 });
 
 In that case the bots will try to crawl /server-side-script-that-deliver-the-json?color=[the-color-clicked]
 
 and as you can expect, you will have the same ammount of combinations as if you didnt use AJAX at all.
 
 My approach would be to use POST instead of GET, so Google wont have this "alternative URL navigation", and to not server any URL. Therefore the page that has to deliver the server json is the same as you are (your IT team will want to kill you but that's our job, ain't it?)
 
 With that approach the bot won't have any alternative navigation and the crawling budget will be spared. At the same time the pagerank and all the link metrics won't flow through the faceted-navigation pages.
 
 Makes that sense for you?
 
 Best
 
 6 0
 
 HI fiacyberz, if you have in HTML: <a href="#color-white">white</a> <a href="#color-black">black</a> <a href="#color-red">red</a> Then the following script should be executed when the user click on one of those links: $.ajax({ type: “GET”, dataType: "json", → or XML url: /server-side-script-that-deliver-the-json?color=[the-color-clicked] success: function(response){ //update the product listing thanks to the json list }; }); In that case the bots will try to crawl /server-side-script-that-deliver-the-json?color=[the-color-clicked] and as you can expect, you will have the same ammount of combinations as if you didnt use AJAX at all. My approach would be to use POST instead of GET, so Google wont have this "alternative URL navigation", and to not server any URL. Therefore the page that has to deliver the server json is the same as you are (your IT team will want to kill you but that's our job, ain't it?) With that approach the bot won't have any alternative navigation and the crawling budget will be spared. At the same time the pagerank and all the link metrics won't flow through the faceted-navigation pages. Makes that sense for you? Best 
 Cancel
 - Julian Cordes
 
 2015-06-11T05:17:55-07:00
 
 Sure, POST is a much better approach. You dont have to use Ajax (only for users convenience), but should be able to combine it.
 If you want to reload the page, take a look at PRG => Post Redirect Get
 
 And dont use a-tags for the filters, labels and invisible input fields are much better (in case of Google counting the number of a-tags)
 
 btw. havent seen you were asking the same question in a comment.. ;-)
 
 fiacyberz edited 2015-06-11T05:19:23-07:00
 2 0
 
 Sure, POST is a much better approach. You dont have to use Ajax (only for users convenience), but should be able to combine it. If you want to reload the page, take a look at PRG => Post Redirect Get And dont use a-tags for the filters, labels and invisible input fields are much better (in case of Google counting the number of a-tags) btw. havent seen you were asking the same question in a comment.. ;-) 
 Cancel
 - andreasmauer
 
 2015-06-11T09:30:21-07:00
 
 Hi, thanks for the tips.
 
 The PRG is as well a good idea, but then as you said without the "coolness" of not loading the whole DOM.
 
 Using labels was something I didnt took into account, but sure, it would make much less probable for Google to treat it as URLs
 
 Few days ago I notice that Google is taking something that is not a URL as a relative URL. So it is reporting 404 errors coming from each single page on our shop.
 
 Now I have to deal with IT to try to make them change a piece of code that is totally correct, but that the crazy bot is trying to crawl. Either that or include those fakes URLs on the robots.txt
 
 That rise the question about, would Google transfer Link Juice through such fakes-URLs? In that case I would prefer not to use the robots.txt option.
 
 Any opinion about that?
 
 3 0
 
 Hi, thanks for the tips. The PRG is as well a good idea, but then as you said without the "coolness" of not loading the whole DOM. Using labels was something I didnt took into account, but sure, it would make much less probable for Google to treat it as URLs Few days ago I notice that Google is taking something that is not a URL as a relative URL. So it is reporting 404 errors coming from each single page on our shop. Now I have to deal with IT to try to make them change a piece of code that is totally correct, but that the crazy bot is trying to crawl. Either that or include those fakes URLs on the robots.txt That rise the question about, would Google transfer Link Juice through such fakes-URLs? In that case I would prefer not to use the robots.txt option. Any opinion about that?
 Cancel
 - Julian Cordes
 
 2015-06-12T00:51:10-07:00
 
 robots.txt is always the worst solution. only reasonable when your server crashes because Google doesnt understand your site. I saw this once in the last 5 years...
 
 about these URLs... Google used to grab everything which looks like a URL e.g. data="/fubar" then domain.com/fubar was crawled. Even if this was just in some JavaScript part without any intention to an a-tag or span or any html code.
 
 2 0
 
 robots.txt is always the worst solution. only reasonable when your server crashes because Google doesnt understand your site. I saw this once in the last 5 years... about these URLs... Google used to grab everything which looks like a URL e.g. data="/fubar" then domain.com/fubar was crawled. Even if this was just in some JavaScript part without any intention to an a-tag or span or any html code.
 Cancel
 
 lcourse
 
 2015-06-25T05:24:20-07:00
 
 So do I understand you well that you believe that these ajax/json facet links (using POST) are counted by google for calculating link juice?
 
 If so, to avoid link juice dilluition we may need to look rather into hiding the links completely (such as hiding them if cookies are disabled). Downside here may be that facets will frequently contain a lot of topical relevant keywords, that may increase organic traffic to the pages. So if google would not consider them for calculating link juice it may be better to still keep them visible to google.
 
 2 0
 
 So do I understand you well that you believe that these ajax/json facet links (using POST) are counted by google for calculating link juice? If so, to avoid link juice dilluition we may need to look rather into hiding the links completely (such as hiding them if cookies are disabled). Downside here may be that facets will frequently contain a lot of topical relevant keywords, that may increase organic traffic to the pages. So if google would not consider them for calculating link juice it may be better to still keep them visible to google.
 Cancel
 - David Sottimano
 
 2015-06-12T04:26:10-07:00
 
 Googlebot can POST too (https://googlewebmastercentral.blogspot.com.es/2011...), I see so many POST requests in logs, and it also loves making up any URL that looks remotely like a URL, whether its plain text or JS code.
 
 Robots.txt blocks for these pesky requests have always been the right answer for me, that is, unless the resource you're blocking affects the display of the page (case study: https://yoast.com/google-panda-robots-css-js/)
 
 Fiacyberz, why did you say robots.txt is the worst solution? Maybe I missed some of the context?
 
 2 0
 
 Googlebot can POST too (<a href="https://googlewebmastercentral.blogspot.com.es/2011/11/get-post-and-safely-surfacing-more-of.html" rel="nofollow">https://googlewebmastercentral.blogspot.com.es/2011...</a>), I see so many POST requests in logs, and it also loves making up any URL that looks remotely like a URL, whether its plain text or JS code. Robots.txt blocks for these pesky requests have always been the right answer for me, that is, unless the resource you're blocking affects the display of the page (case study: <a href="https://yoast.com/google-panda-robots-css-js/" rel="nofollow">https://yoast.com/google-panda-robots-css-js/</a>) Fiacyberz, why did you say robots.txt is the worst solution? Maybe I missed some of the context? 
 Cancel
 - Julian Cordes
 
 2015-06-12T14:52:23-07:00
 
 sure they follow POST sometimes. but since you can use the same URL where you have your form as a target, you dont create new URLs and therefore the crawl budget is not wasted
 
 since Google is indexing (and internally saving) URLs blocked by robots.txt and of course counting links to these URLs this is the worst thing you can do. why not remove (or mask in a good way) these links and have a noindex on the linked page
 you just wast ressources with this tactic (which was working very good few years ago)
 
 2 0
 
 sure they follow POST sometimes. but since you can use the same URL where you have your form as a target, you dont create new URLs and therefore the crawl budget is not wasted since Google is indexing (and internally saving) URLs blocked by robots.txt and of course counting links to these URLs this is the worst thing you can do. why not remove (or mask in a good way) these links and have a noindex on the linked page you just wast ressources with this tactic (which was working very good few years ago) 
 Cancel
 
 David Sottimano
 
 2015-06-17T00:44:14-07:00
 
 Not sure I follow you. Just so I'm clear, I'm talking about blocking 'pages' that shouldn't be crawled as they either have no value to users or Googlebot, and are likely being manufactured by Googlebot as a result of trying to index JS.
 
 There has never been a real calculation on crawl budget, and frankly, that term subtlety implies that Google has a finite amount of space/capacity to crawl a site. That's completely not true, obviously. The point about 'crawl budget' that makes some sense is that by diverting the crawler's attention away from useless pages, we should be able to get pages that should be crawled more regularly. Often, higher crawl activity = higher organic traffic to a page, but not guaranteed. I broadly agree that removal, masking, or obfuscating these URL paths is best, bun real corporate environments, fixes are limited - often.
 
 2 0
 
 Not sure I follow you. Just so I'm clear, I'm talking about blocking 'pages' that shouldn't be crawled as they either have no value to users or Googlebot, and are likely being manufactured by Googlebot as a result of trying to index JS. There has never been a real calculation on crawl budget, and frankly, that term subtlety implies that Google has a finite amount of space/capacity to crawl a site. That's completely not true, obviously. The point about 'crawl budget' that makes some sense is that by diverting the crawler's attention away from useless pages, we should be able to get pages that should be crawled more regularly. Often, higher crawl activity = higher organic traffic to a page, but not guaranteed. I broadly agree that removal, masking, or obfuscating these URL paths is best, bun real corporate environments, fixes are limited - often. 
 Cancel
 
 Julian Cordes
 
 2015-06-19T04:20:20-07:00
 
 Sure Googles crawl budget is limited. Each domain has its own limit (which can vary of course if Googel thinks a domain is stronger or can handle more crawling)
 
 And since blocking is not removing pages, it is a bad idea. Google will never forget them and always take them into account
 
 2 0
 
 Sure Googles crawl budget is limited. Each domain has its own limit (which can vary of course if Googel thinks a domain is stronger or can handle more crawling) And since blocking is not removing pages, it is a bad idea. Google will never forget them and always take them into account 
 Cancel
 
 David Sottimano
 
 2015-06-19T04:33:43-07:00
 
 Never once, in my 6 years in SEO have I seen Google fail to crawl and index sites with millions (sometimes billions) of pages. How often they crawl might be determined by a variety of factors, like PageRank, how often content changes, etc.. I don't think it's dependent on whether the site being crawled can 'handle' being crawled, that is, unless the server constantly responds with 4xx or 5xx.
 
 You can't always remove pages, it is never that simple - if it was, there would be no need for canonicals, robots.txt, noindex, etc.. Removing pages isn't always an option, especially when you get into enterprise environments. Robots.txt blocks have their uses, and in cases where Google is making up URLs via Javascript (which is what I've been referring to), yes a block is right, especially if that URL/script/resource needs to be available for the functionality of the site but not crawled.
 
 2 0
 
 Never once, in my 6 years in SEO have I seen Google fail to crawl and index sites with millions (sometimes billions) of pages. How often they crawl might be determined by a variety of factors, like PageRank, how often content changes, etc.. I don't think it's dependent on whether the site being crawled can 'handle' being crawled, that is, unless the server constantly responds with 4xx or 5xx. You can't always remove pages, it is never that simple - if it was, there would be no need for canonicals, robots.txt, noindex, etc.. Removing pages isn't always an option, especially when you get into enterprise environments. Robots.txt blocks have their uses, and in cases where Google is making up URLs via Javascript (which is what I've been referring to), yes a block is right, especially if that URL/script/resource needs to be available for the functionality of the site but not crawled.
 Cancel
 
 Julian Cordes
 
 2015-06-19T04:50:18-07:00
 
 Take a look at you logfiles, you will find a lot of crawling issues. Or just take a look at the Search Console and the crawling graph.Google had always problems with crawling.
 
 Most of the webmaster (like 99%) wouldnt see it or need to see and deal with it. But when it comes to big domains you have to, because it so powerful and important. To clarify, for me, big domains start at 100,000 pages and the bigger the more important this topic is.
 
 Of course it is not always easy to remove pages and sometimes even not possible (in an aceptable timeframe). Then you can use the robots.txt. But it is like using a patch on broken bones.
 
 I would use robots.txt for two reasons:
 - your server crashes because of a broken script and Google is crawling to much (happened once in the last 5 years)
 - you want to hide something from Google (some masking methods are working this way)
 - optional: you cant fix an issue in the next 6 months
 
 2 0
 
 Take a look at you logfiles, you will find a lot of crawling issues. Or just take a look at the Search Console and the crawling graph.Google had always problems with crawling. Most of the webmaster (like 99%) wouldnt see it or need to see and deal with it. But when it comes to big domains you have to, because it so powerful and important. To clarify, for me, big domains start at 100,000 pages and the bigger the more important this topic is. Of course it is not always easy to remove pages and sometimes even not possible (in an aceptable timeframe). Then you can use the robots.txt. But it is like using a patch on broken bones. I would use robots.txt for two reasons: - your server crashes because of a broken script and Google is crawling to much (happened once in the last 5 years) - you want to hide something from Google (some masking methods are working this way) - optional: you cant fix an issue in the next 6 months 
 Cancel
 
 David Sottimano
 
 2015-06-19T05:33:13-07:00
 
 I've looked at log files quite a bit (https://moz.com/blog/server-log-essentials-for-seo). When you say problems with crawling, I assume you mean that they overcrawl (invent pages) rather than miss pages. Again, I have never seen Google not completely index (or at least crawl through every page) on even the biggest of sites. For example, I worked on a site 5 years ago with over 400 million pages, Google crawled the 400 million and more, going through those heavy logfiles was tough but worth seeing how much Google is able to handle - therefore, I know firsthand that Google has no problem in regards to how many pages it will crawl. They want absolutely everything, including the hidden web - I don't believe in crawl budget as its been defined.
 
 Also, the crawl activity graph in the search console is not a fair representation of what is actually crawled on your site, I've compared this graph to actual log files and it's been way off 30-40% or more, many times. For me, that graph is useful in identifying skim or deep crawls, and potentially signalling a evaluation before an algorithm update (unconfirmed, this is from experience).
 
 I think we're saying the same things, I would add a few more reasons to use robots.txt:
 
 -blocking pages that are necessary for users and not for search engines, example: search pages (query=) can provide an endless bot path on certain sites.
 
 - scripts that are requested too often (> 50% of requests in day/week), are not critical to design/layout
 
 I agree that robots.txt is a patch, but then again, so is the canonical tag right?
 
 Nice chatting with you :)
 
 3 0
 
 I've looked at log files quite a bit (<a href="https://moz.com/blog/server-log-essentials-for-seo" rel="nofollow">https://moz.com/blog/server-log-essentials-for-seo</a>). When you say problems with crawling, I assume you mean that they overcrawl (invent pages) rather than miss pages. Again, I have never seen Google not completely index (or at least crawl through every page) on even the biggest of sites. For example, I worked on a site 5 years ago with over 400 million pages, Google crawled the 400 million and more, going through those heavy logfiles was tough but worth seeing how much Google is able to handle - therefore, I know firsthand that Google has no problem in regards to how many pages it will crawl. They want absolutely everything, including the hidden web - I don't believe in crawl budget as its been defined. Also, the crawl activity graph in the search console is not a fair representation of what is actually crawled on your site, I've compared this graph to actual log files and it's been way off 30-40% or more, many times. For me, that graph is useful in identifying skim or deep crawls, and potentially signalling a evaluation before an algorithm update (unconfirmed, this is from experience). I think we're saying the same things, I would add a few more reasons to use robots.txt: -blocking pages that are necessary for users and not for search engines, example: search pages (query=) can provide an endless bot path on certain sites. - scripts that are requested too often (> 50% of requests in day/week), are not critical to design/layout I agree that robots.txt is a patch, but then again, so is the canonical tag right? Nice chatting with you :)
 Cancel
 
 Julian Cordes
 
 2015-06-19T05:59:25-07:00
 
 These talks are the best ;-)
 
 The article of you is just what I meant. Google is crawling the wrong pages.
 
 What I was seeing a lot, is crawl budget problem here. You have like 100,000 (real) pages. Google is crawling 150,000 a day. When you look at the logfiles, you see Google crawling 100,000 wrong pages. This means, Google needs at least two days to crawl yor real pages. This also means, that Google set up the crawl budget for your domain to 100,000 pages a day (which is ok as you have only 100,000 real pages). But since Google is crawling a lot of wrong pages you need to tell Google what to do.
 
 In this case (as you used in your article) robots.txt is the fastest method. But then you see these pages in the WMT and Google still knows them (and sometimes ranks them as well). As far as I can tell this is a negative point for your domain. It worked back in the days before 2010, but then they changed it somehow and my best solution now is to remove links and noindex/canonicalize these pages.
 
 Sure the WMT data is not very good, but a good indicator. I've seen Google crawling 3 million pages a day (regarding the WMT) on a domain with just like 100,000 pages. Big brand, therefore big crawl budget. But since Google is now crawling only the real pages this domain got a big ranking boost.
 
 3 0
 
 These talks are the best ;-) The article of you is just what I meant. Google is crawling the wrong pages. What I was seeing a lot, is crawl budget problem here. You have like 100,000 (real) pages. Google is crawling 150,000 a day. When you look at the logfiles, you see Google crawling 100,000 wrong pages. This means, Google needs at least two days to crawl yor real pages. This also means, that Google set up the crawl budget for your domain to 100,000 pages a day (which is ok as you have only 100,000 real pages). But since Google is crawling a lot of wrong pages you need to tell Google what to do. In this case (as you used in your article) robots.txt is the fastest method. But then you see these pages in the WMT and Google still knows them (and sometimes ranks them as well). As far as I can tell this is a negative point for your domain. It worked back in the days before 2010, but then they changed it somehow and my best solution now is to remove links and noindex/canonicalize these pages. Sure the WMT data is not very good, but a good indicator. I've seen Google crawling 3 million pages a day (regarding the WMT) on a domain with just like 100,000 pages. Big brand, therefore big crawl budget. But since Google is now crawling only the real pages this domain got a big ranking boost. 
 Cancel
 
 David Sottimano
 
 2015-06-19T06:41:15-07:00
 
 Ok, so I did some blocking recently for an e-commerce site with ~190,000 indexable pages, pages crawled by Google was around ~250,000. I noticed that we had some weird stuff being crawled, and just like you mentioned, we were trying to ensure that the daily crawl by Googlebot was used efficiently. The daily crawl average in a 2 week sample was ~1,000 requests per day and ~20% of that was used on pages we didn't need crawled.
 
 So we blocked them. Google didn't request them anymore, but the average crawl requests per day dropped to around ~800. In this case, we didn't replace/divert Googlebot's attention, we ended up losing 'crawl budget'.
 
 That was the first time I'd seen something like that, which does validate what you're saying; this was < 1 year ago.
 
 3 0
 
 Ok, so I did some blocking recently for an e-commerce site with ~190,000 indexable pages, pages crawled by Google was around ~250,000. I noticed that we had some weird stuff being crawled, and just like you mentioned, we were trying to ensure that the daily crawl by Googlebot was used efficiently. The daily crawl average in a 2 week sample was ~1,000 requests per day and ~20% of that was used on pages we didn't need crawled. So we blocked them. Google didn't request them anymore, but the average crawl requests per day dropped to around ~800. In this case, we didn't replace/divert Googlebot's attention, we ended up losing 'crawl budget'. That was the first time I'd seen something like that, which does validate what you're saying; this was < 1 year ago. 
 Cancel
 
 andreasmauer
 
 2015-06-19T09:08:16-07:00
 
 Hi guys, I will share as well what I have seen:
 
 Not strong website with faceted navigation.
 
 The links were tagged with nofollow and the filtered URLs with parameters had canonical to the "parent" URL without parameters.
 
 We switch to follow links + canonical.
 
 Google get crazy and goes from 3.000 pages crawled per day to 150.000
 
 Google send a notification "you have too many URLs". On the parameter handling tool I see some parameters with > 15 Millions pages.
 
 I used the handling tool to tell Google not to crawl them but Google ignores it.
 
 I decided to block those faceted URLs with robots.txt
 
 Crawling goes back to 3.000 pages a day, the parameters that Google reported on the handling tool drop from 15 Millions to 100K
 
 So, crawling seems stable, indexing working fine.
 
 Next step is going for AJAX.
 
 My feeling is that Google actually may have crawling problems in such cases, not because of the crawling itself but because then it has to "understand" those URLs in order to create a clean index and integrate duplicate URLs and signals
 
 4 0
 
 Hi guys, I will share as well what I have seen: Not strong website with faceted navigation. The links were tagged with nofollow and the filtered URLs with parameters had canonical to the "parent" URL without parameters. We switch to follow links + canonical. Google get crazy and goes from 3.000 pages crawled per day to 150.000 Google send a notification "you have too many URLs". On the parameter handling tool I see some parameters with > 15 Millions pages. I used the handling tool to tell Google not to crawl them but Google ignores it. I decided to block those faceted URLs with robots.txt Crawling goes back to 3.000 pages a day, the parameters that Google reported on the handling tool drop from 15 Millions to 100K So, crawling seems stable, indexing working fine. Next step is going for AJAX. My feeling is that Google actually may have crawling problems in such cases, not because of the crawling itself but because then it has to "understand" those URLs in order to create a clean index and integrate duplicate URLs and signals
 Cancel
 
 David Sottimano
 
 2015-06-20T01:09:14-07:00
 
 "My feeling is that Google actually may have crawling problems in such cases, not because of the crawling itself but because then it has to "understand" those URLs in order to create a clean index and integrate duplicate URLs and signals" Absolutely :)
 
 3 0
 
 "My feeling is that Google actually may have crawling problems in such cases, not because of the crawling itself but because then it has to "understand" those URLs in order to create a clean index and integrate duplicate URLs and signals" Absolutely :)
 Cancel
Mustansar

2015-06-15T02:47:16-07:00

Its quite interesting and yet again important post from Eric.

3 0

Its quite interesting and yet again important post from Eric. 
Cancel
Mike_FC

2015-06-11T15:40:16-07:00

This sounds like a great solution for SEO, but what about paid search? All the different URL's that are created through faceted search make great landing pages.

3 0

This sounds like a great solution for SEO, but what about paid search? All the different URL's that are created through faceted search make great landing pages.
Cancel
- Eric Enge
 
 2015-06-11T18:38:36-07:00
 
 good question, and a different kind of issue.
 
 3 0
 
 good question, and a different kind of issue. 
 Cancel
Epiphany

2015-06-19T02:29:22-07:00

Late to the party (been pretty busy) but...

"Keep in mind, the reason why Google implemented tags like rel=canonical, NoIndex, rel=prev/next, and others is to reduce their crawling burden and overall page bloat and to help focus signals to incoming pages in the best way possible. The use of Ajax/JSON/jQuery as outlined above does this simply and elegantly."

Pretty much sums it up perfectly.

Lots of vendors are obsessed with facets and user choices etc. but Google often isn't. Do you really need a separate category for your 1 pink dress? An understanding of simple IA combined with an understanding of what Google is trying to achieve / crawl control works wonders for most sites.

Also monitor your internal site search functionality and expose more facets (via static URLs) as you grow. If 2000 people are searching for 'pink dresses' internally that is a strong signal you should firstly stock more pink dresses and secondly (wait until you have a decent product offering worthy of attention) make a static category page for 'pink dresses'.

Great post as always Eric.

Oh and regarding the Adwords landing page benefits of deep facets. Most of the time the same principles apply in that the conversion rates on weak offerings like the above 'pink dresses' example is so low you wouldn't want it in your campaign anyway. Even then though you could open up a URL that is only available through PPC and isn't index-able by the main bot without affecting your quality score.

Malc.

2 0

Late to the party (been pretty busy) but... "Keep in mind, the reason why Google implemented tags like rel=canonical, NoIndex, rel=prev/next, and others is to reduce their crawling burden and overall page bloat and to help focus signals to incoming pages in the best way possible. The use of Ajax/JSON/jQuery as outlined above does this simply and elegantly." Pretty much sums it up perfectly. Lots of vendors are obsessed with facets and user choices etc. but Google often isn't. Do you really need a separate category for your 1 pink dress? An understanding of simple IA combined with an understanding of what Google is trying to achieve / crawl control works wonders for most sites. Also monitor your internal site search functionality and expose more facets (via static URLs) as you grow. If 2000 people are searching for 'pink dresses' internally that is a strong signal you should firstly stock more pink dresses and secondly (wait until you have a decent product offering worthy of attention) make a static category page for 'pink dresses'. Great post as always Eric. Oh and regarding the Adwords landing page benefits of deep facets. Most of the time the same principles apply in that the conversion rates on weak offerings like the above 'pink dresses' example is so low you wouldn't want it in your campaign anyway. Even then though you could open up a URL that is only available through PPC and isn't index-able by the main bot without affecting your quality score. Malc.
Cancel
Andrea Pernici

2015-06-30T00:56:21-07:00

I don't think this is a great solution.

The first point against this technical implementation is that it can lead to a Usability Nightmare for the user and Good SEO must always meet Usability.

You are suggesting something that doesn't solve what the crawler will crawl as fiacyberz and David commented above and at the same time you are reducing the user filtering capabilities and the affordance of the tools user uses to navigate a website (from mobile, tablet or desktop).

In my opinion something very bad to do at least in the way you explained it (considering also your comment).

2 0

I don't think this is a great solution. The first point against this technical implementation is that it can lead to a Usability Nightmare for the user and Good SEO must always meet Usability. You are suggesting something that doesn't solve what the crawler will crawl as fiacyberz and David commented above and at the same time you are reducing the user filtering capabilities and the affordance of the tools user uses to navigate a website (from mobile, tablet or desktop). In my opinion something very bad to do at least in the way you explained it (considering also your comment).
Cancel
Zack Notes

2016-11-16T12:43:36-08:00

Very informative article, thanks.

znotes edited 2016-11-16T12:44:29-08:00
2 0

Very informative article, thanks. 
Cancel
FashionLux

2016-04-21T09:41:11-07:00

Hi,

I got a question in regard to facets being served via AJAX request as I couldn't find a definitive answer in regard to an issue we currently face:

(We are working on an indexable facet solution (only a few selected facets will be indexed while others won't)

When visitors on our site select a facet in the facet panel, the site doesn't fully reload. As a consequence only URL and certain tags of the content (H1, description,..) are updated, while other tags like canonical URLs, meta noindex,nofollow tag, or the title tag are not updating as long as you don't refresh the page.

We have no information about how this will be crawled and indexed yet but I was wondering if anyone of you knows, how this will impact SEO?

FashionLux edited 2016-04-21T09:41:45-07:00
2 0

Hi, I got a question in regard to facets being served via AJAX request as I couldn't find a definitive answer in regard to an issue we currently face: (We are working on an indexable facet solution (only a few selected facets will be indexed while others won't) When visitors on our site select a facet in the facet panel, the site doesn't fully reload. As a consequence only URL and certain tags of the content (H1, description,..) are updated, while other tags like canonical URLs, meta noindex,nofollow tag, or the title tag are not updating as long as you don't refresh the page. We have no information about how this will be crawled and indexed yet but I was wondering if anyone of you knows, how this will impact SEO? 
Cancel
Todd Maxwell

2015-06-30T19:13:06-07:00

Question using the Ajax code, how might this effect your sites domain authority if you're saying that certain pages (that are monotonous) need not be crawled. I guess page authority would be the only thing that would take a hit, thus, it really doesn't matter because it's a reproduction of something that's an item that might show up in duplicate all over the site?

2 0

Question using the Ajax code, how might this effect your sites domain authority if you're saying that certain pages (that are monotonous) need not be crawled. I guess page authority would be the only thing that would take a hit, thus, it really doesn't matter because it's a reproduction of something that's an item that might show up in duplicate all over the site? 
Cancel
Ahmet Soybelli

2015-06-16T23:13:04-07:00

Hi All,

One of the problem in these kind of solution is for the campaign side. If you have seperate URL for filters, you could redirect customer to these pages on Google AdWords and other campaigns. So these can increase your Quality Score.

2 0

Hi All, One of the problem in these kind of solution is for the campaign side. If you have seperate URL for filters, you could redirect customer to these pages on Google AdWords and other campaigns. So these can increase your Quality Score. 
Cancel
premiumbusinessweb

2015-06-24T03:14:30-07:00

Thanks for the good post!

2 0

Thanks for the good post!
Cancel
Hiren Vaghela

2015-06-11T23:42:50-07:00

Eric, always good to see your post on Moz :)

Duplication is prior issue for most of the site and specially for eCommerce sites. If we more dig into this topic than PAGINATION is the function which is raise to the duplication URLs in most of the eCommerce sites. it generates dynamic URLs and many string URLs which are not that much friendly.

As you mention here that to implement this JS is require, is this common for all the platform OR it is vary based on different platform like magento, shopify etc! If i am not wrong AJAX, JSON requires JS to run this and prevention of duplication. If you have common code than drop it here! That would be great help!

Great information Eric! Thanks for the insights!

2 0

Eric, always good to see your post on Moz :) Duplication is prior issue for most of the site and specially for eCommerce sites. If we more dig into this topic than PAGINATION is the function which is raise to the duplication URLs in most of the eCommerce sites. it generates dynamic URLs and many string URLs which are not that much friendly. As you mention here that to implement this JS is require, is this common for all the platform OR it is vary based on different platform like magento, shopify etc! If i am not wrong AJAX, JSON requires JS to run this and prevention of duplication. If you have common code than drop it here! That would be great help! Great information Eric! Thanks for the insights! 
Cancel
Sonlight

2015-06-11T09:19:13-07:00

This was a great read, Eric! Thanks for sharing this data with all of us. I've personally never thought about dealing with duplicate content like this before. I might have to try this out in the future.

2 0

This was a great read, Eric! Thanks for sharing this data with all of us. I've personally never thought about dealing with duplicate content like this before. I might have to try this out in the future.
Cancel
Shubham Tiwari

2015-06-11T03:55:16-07:00

Hello Eric,

Good Explanation, i have learned few new things today :)

Anyway, I want to ask a question based on your explanation. Are Amazon, Flipkart, SnapDeal and all big eCommerce site using JSON and jQuery currently? And if no then should they start using it?

I think, the best thing of a well-known eCommerce site is, they throw only those things which user needed, they don't get confused users with different options and this is the best thing to sell a product. Because of this metric, user will go in-depth for only those things in which he/she interested.

I just want to see both sides of coin, one you explained and second I asked ;) Waiting for response.

2 0

Hello Eric, Good Explanation, i have learned few new things today :) Anyway, I want to ask a question based on your explanation. Are Amazon, Flipkart, SnapDeal and all big eCommerce site using JSON and jQuery currently? And if no then should they start using it? I think, the best thing of a well-known eCommerce site is, they throw only those things which user needed, they don't get confused users with different options and this is the best thing to sell a product. Because of this metric, user will go in-depth for only those things in which he/she interested. I just want to see both sides of coin, one you explained and second I asked ;) Waiting for response. 
Cancel
Umar Khan

2015-06-11T10:46:49-07:00

Hello Sir,
Thanks for another very useful post. I'd like to know your take on two popular techniques that have been proposed over the years for crawling AJAX i.e. Hijax Approach and the AJAX crawling scheme of Google. Are they still matters?

Thanks!

2 0

Hello Sir, Thanks for another very useful post. I'd like to know your take on two popular techniques that have been proposed over the years for crawling AJAX i.e. Hijax Approach and the AJAX crawling scheme of Google. Are they still matters? Thanks! 
Cancel
Todd McDonald

2015-06-11T13:23:03-07:00

And then there's bing... Might be important to consider for some

2 0

And then there's bing... Might be important to consider for some
Cancel
Doblejotaseo

2015-06-12T10:18:13-07:00

Very good point about duplicate content, especially in virtual stores, where more repetitions have the same dabido aque content providers and specify the product with its own description and sometimes it can not be changed by legal status.

2 0

Very good point about duplicate content, especially in virtual stores, where more repetitions have the same dabido aque content providers and specify the product with its own description and sometimes it can not be changed by legal status.
Cancel
websa100

2015-06-11T00:23:59-07:00

I've never had problems with tag rel=canonical, is this typical only on products pages? or content pages also?

websa100 edited 2015-06-11T00:24:20-07:00
2 0

I've never had problems with tag rel=canonical, is this typical only on products pages? or content pages also?
Cancel
- sotelor10
 
 2015-06-11T15:12:27-07:00
 
 Same question here, because I always have used rel canonical and have not had any problems.
 
 2 0
 
 Same question here, because I always have used rel canonical and have not had any problems.
 Cancel
Toby Bateson

2015-06-23T09:39:57-07:00

Than you for explaining something I never really understood!

T0BY edited 2015-06-28T23:39:03-07:00
1 0

Than you for explaining something I never really understood!
Cancel
babak.kh

2015-06-11T04:37:03-07:00

hi Eric Enge

i like your article . great

please share more ..... good luck

2 8

hi <a href="https://moz.com/community/users/18040" rel="nofollow">Eric Enge</a> i like your article . great please share more ..... good luck 
Cancel

Post Analytics

Eliminate Duplicate Content in Faceted Navigation with Ajax/JSON/JQuery

Defining the problem with faceted navigation

Using JSON and jQuery to filter on the client side

What Ajax does for you

How will Google handle this type of implementation?

Summary

Comments 42

Defining the problem with faceted navigation

Using JSON and jQuery to filter on the client side

What Ajax does for you

How will Google handle this type of implementation?

Summary

Comments 42

Log in to Moz

Don't have an account?