Just How Smart Are Search Robots?

Comments 106

Please keep your comments TAGFEE by following the community etiquette.

E-mail me when new comments are posted

Sort by:

Comments are closed on posts more than 30 days old. Got a burning question? Head to our Q&A section to start a new conversation.

Bill Slawski

2011-11-30T06:48:11-08:00

Love the way that you pulled this all together, Mike. Especially the Pacman analogy.

The technical section of Google's webmaster guidelines tell us that:

Use a text browser such as Lynx to examine your site, because most search engine spiders see your site much as Lynx would. If fancy features such as JavaScript, cookies, session IDs, frames, DHTML, or Flash keep you from seeing all of your site in a text browser, then search engine spiders may have trouble crawling your site.

I'm a little hesitant to attribute too much in the way of advanced capabilities to Googlebot, not so much because of that warning, but rather because of uncertainty of how deeply and how richly Google might crawl, analyze, and index pages.

For instance, Google has advanced the use of Optical Character Recognition in many ways with their book scanning projects, their papers and patents on how they can use it in StreetView videos, and in Google Goggle's applications, and yet it still seems like they aren't using it to index text found on the Web in images.

But chances are that they could do that anyday soon. Still, I'm going to keep on advising people to make sure that if text on their pages is important, that they make sure it's text that can be read by Googlebot.

To add to your takeaways, though, if someone is purposefully using text in images to hide something from Googlebot, it's a trick that may have a short shelf life.

Love the way that you presented "evidence" of Googlebot's capabilities through patents, through behavior from the search engines that show certain capabilities, through technology from Google. You then provided a motive for Google to keep quiet about their abilities to use this kind of technology (a little like a CSI investigation).

I agree completely with your takeaways, and think that there might be some more that we could add as well.

For instance, the way that Googlebot can interpret different parts of pages through both an understanding of the DOM of a page and a simulated rendering of the page to capture visual aspects of it such as whitespace, tells us that we should take care in where we put what on a page. Chances are that words within a main content area of a page are going to be the ones given the most weight in a relevancy analysis, for instance.

Thanks for sharing this very thoughtful post with us.

18 0

Love the way that you pulled this all together, Mike. Especially the Pacman analogy. The technical section of Google's webmaster guidelines tell us that: Use a text browser such as <a href="https://www.google.com/search?q=lynx+browser" rel="nofollow">Lynx</a> to examine your site, because most search engine spiders see your site much as Lynx would. If fancy features such as JavaScript, cookies, session IDs, frames, DHTML, or Flash keep you from seeing all of your site in a text browser, then search engine spiders may have trouble crawling your site. I'm a little hesitant to attribute too much in the way of advanced capabilities to Googlebot, not so much because of that warning, but rather because of uncertainty of how deeply and how richly Google might crawl, analyze, and index pages. For instance, Google has advanced the use of Optical Character Recognition in many ways with their book scanning projects, their papers and patents on how they can use it in StreetView videos, and in Google Goggle's applications, and yet it still seems like they aren't using it to index text found on the Web in images. But chances are that they could do that anyday soon. Still, I'm going to keep on advising people to make sure that if text on their pages is important, that they make sure it's text that can be read by Googlebot. To add to your takeaways, though, if someone is purposefully using text in images to hide something from Googlebot, it's a trick that may have a short shelf life. Love the way that you presented "evidence" of Googlebot's capabilities through patents, through behavior from the search engines that show certain capabilities, through technology from Google. You then provided a motive for Google to keep quiet about their abilities to use this kind of technology (a little like a CSI investigation). I agree completely with your takeaways, and think that there might be some more that we could add as well. For instance, the way that Googlebot can interpret different parts of pages through both an understanding of the DOM of a page and a simulated rendering of the page to capture visual aspects of it such as whitespace, tells us that we should take care in where we put what on a page. Chances are that words within a main content area of a page are going to be the ones given the most weight in a relevancy analysis, for instance. Thanks for sharing this very thoughtful post with us.
Cancel
- iPullRank
 
 2011-11-30T08:04:40-08:00
 
 Hey Bill,
 
 Thanks again for pointing out key patents for this and for your response here.
 
 I don't disagree with you. I think OCR is a great example of what you're saying as Google has fine tuned that ability for years but OCR is also very computationally expensive and it doesn't make sense to use it on every image throughout the whole web. I think it's far more likely that they will render a whole page visually and examine it for things like white space and JS transformations. If a site is templated they can just render the differences as needed to limit how much computation is involved.
 
 Not sure if Googlebot interpetting different parts of the page is a new takeaway as prominence has already been considered a factor and at the end of the day people should be making the best websites they can for reasons that reach far beyond SEO. Also since everyone started reverse engineering Panda these are already things that we should consider.
 
 I do however think that this is more reason for SEO and UX need to get much more closely together.
 
 Josh has a post coming out on Distilled on Friday with some more examples that will help seal this a little more. I don't want to give it all away though ;)
 
 -Mike
 
 7 0
 
 Hey Bill, Thanks again for pointing out key patents for this and for your response here. I don't disagree with you. I think OCR is a great example of what you're saying as Google has fine tuned that ability for years but OCR is also very computationally expensive and it doesn't make sense to use it on every image throughout the whole web. I think it's far more likely that they will render a whole page visually and examine it for things like white space and JS transformations. If a site is templated they can just render the differences as needed to limit how much computation is involved. Not sure if Googlebot interpetting different parts of the page is a new takeaway as prominence has already been considered a factor and at the end of the day people should be making the best websites they can for reasons that reach far beyond SEO. Also since everyone started reverse engineering Panda these are already things that we should consider. I do however think that this is more reason for SEO and UX need to get much more closely together. Josh has a post coming out on Distilled on Friday with some more examples that will help seal this a little more. I don't want to give it all away though ;) -Mike
 Cancel
 - Josh-u-a
 
 2011-11-30T09:21:35-08:00
 
 Hello Bill!
 
 Thanks for jumping in the conversation, and thank you for offering up more patent goodness.
 
 OCR is definitely being within the indexing stack, and extracting content from a snapshop via Instant Preview isn't unlikely; it's just not as efficient as getting in there and rendering the DOM Hierarchy and then analyzing the document. I imagine they'd much rather save that effort for converting PDFs that haven't been OCR'd and trying to read images or other content they know they can't work with in other more efficient methods.
 
 In the Instant Preview examples Mike has in his post, you can actually see manipulation of the preview to highlight the relevant content being identified and extracted. In my upcoming paper on Distilled, I show a few examples where Google is diving well below the fold to extract accurate sentences that summarize a page in relation to a query.
 
 The level of accuracy with which they do this suggests one of two things:
 
 1) They have the most accurate OCR system in the world and are holding out on us in a major way (which is still totally plausible, and even quite likely)
 
 2) They're using a headless browser to efficiently render a page and then deploying natural language processing algorithms and basic CSS to identify a snippet and highlight it; then taking an Instant Preview.
 
 Speaking of advanced OCR insanity and Streetview... anyone else see pictures of building addreses popping up in the Recaptcha Project lately?
 
 - Joshua
 
 4 0
 
 Hello Bill! Thanks for jumping in the conversation, and thank you for offering up more patent goodness. OCR is definitely being within the indexing stack, and extracting content from a snapshop via Instant Preview isn't unlikely; it's just not as efficient as getting in there and rendering the DOM Hierarchy and then analyzing the document. I imagine they'd much rather save that effort for converting PDFs that haven't been OCR'd and trying to read images or other content they know they can't work with in other more efficient methods. In the Instant Preview examples Mike has in his post, you can actually see manipulation of the preview to highlight the relevant content being identified and extracted. In my upcoming paper on Distilled, I show a few examples where Google is diving well below the fold to extract accurate sentences that summarize a page in relation to a query. The level of accuracy with which they do this suggests one of two things: 1) They have the most accurate OCR system in the world and are holding out on us in a major way (which is still totally plausible, and even quite likely) 2) They're using a headless browser to efficiently render a page and then deploying natural language processing algorithms and basic CSS to identify a snippet and highlight it; then taking an Instant Preview. Speaking of advanced OCR insanity and Streetview... anyone else see pictures of building addreses popping up in the Recaptcha Project lately? - Joshua
 Cancel
HiveDigitalInc

2011-11-30T07:48:05-08:00

Great post, but a couple of responses:

1. Javascript can still be used to hide content if it is remotely accessed from a directory protected by robots.txt, especially if you block the Google Web Preview as well as GoogleBot

2. I strongly doubt that GoogleBot and their Instant Preview "headless browser" crawl in parallel. Rendering the DOM to images for every URL GoogleBot spiders would be all sorts of impossible and a huge waste of processing power given the frequency with which the preview feature is used and the small fraction of web documents that actually end up ever being found in Google.

3. Just in general, a patent does not necessitate use of that invention.

I have often said this, but I think we tend to over-estimate Google. Every cool idea, every great patent that you hear come out of Google has to be scalable. Think about Panda, for example - it is such a difficult filter to compute that it takes weeks between updates, even given Google's insane computing powers.

Regardless, awesome post and well worth the read. It is always good to stay ahead of the game.

13 0

Great post, but a couple of responses: 1. Javascript can still be used to hide content if it is remotely accessed from a directory protected by robots.txt, especially if you block the Google Web Preview as well as GoogleBot 2. I strongly doubt that GoogleBot and their Instant Preview "headless browser" crawl in parallel. Rendering the DOM to images for every URL GoogleBot spiders would be all sorts of impossible and a huge waste of processing power given the frequency with which the preview feature is used and the small fraction of web documents that actually end up ever being found in Google. 3. Just in general, a patent does not necessitate use of that invention. I have often said this, but I think we tend to over-estimate Google. Every cool idea, every great patent that you hear come out of Google has to be scalable. Think about Panda, for example - it is such a difficult filter to compute that it takes weeks between updates, even given Google's insane computing powers. Regardless, awesome post and well worth the read. It is always good to stay ahead of the game.
Cancel
- iPullRank
 
 2011-11-30T08:51:45-08:00
 
 Hey Russ,
 
 Fair assessment for #1 but I've definitely seen instances where G has blatantly ignored robots.txt so my point is that it's not a future-proof way to go about doing things. Also if they are crawling visually I'm not sure how much they care about where the content is coming from just that it is eyebrow-raising transformation.
 
 #2 Definitely not impossible as they are doing it on demand for instant previews. It is however realistic for indexing that they only use it for pages they deem important enough. I agree with that.
 
 #3 I agree with you in cases where there is no evidence. Here I don't.
 
 Thanks for the insights Russ!
 
 2 0
 
 Hey Russ, Fair assessment for #1 but I've definitely seen instances where G has blatantly ignored robots.txt so my point is that it's not a future-proof way to go about doing things. Also if they are crawling visually I'm not sure how much they care about where the content is coming from just that it is eyebrow-raising transformation. #2 Definitely not impossible as they are doing it on demand for instant previews. It is however realistic for indexing that they only use it for pages they deem important enough. I agree with that. #3 I agree with you in cases where there is no evidence. Here I don't. Thanks for the insights Russ!
 Cancel
 - EBCeller
 
 2011-12-01T15:53:18-08:00
 
 Question for Russvirante & IPullRank - what implications would you say this has for giving direction to Dev teams on best way to code a site that is AJAX heavy?
 
 What does this mean for options for graceful degradation or progressive enhancement? Are those neccessary? It seems to me that Google would still prefer an HTML equivalent that aligns with the actual user experience for efficiency sake in the processing, but if a site does not currently employ those practices, providing that state can be intensive for dev teams with limited resources. If G can crawl AJAX/JS, is relying on that a viable/effective alternative?
 
 Also, it sounds like making AJAX/JS are in alignment with what you show users and not rely on robots.txt to disallow access to external JS files. Would you say that is correct assumption?
 
 Thanks for your thoughts.
 
 Carrie
 
 1 0
 
 Question for Russvirante & IPullRank - what implications would you say this has for giving direction to Dev teams on best way to code a site that is AJAX heavy? What does this mean for options for graceful degradation or progressive enhancement? Are those neccessary? It seems to me that Google would still prefer an HTML equivalent that aligns with the actual user experience for efficiency sake in the processing, but if a site does not currently employ those practices, providing that state can be intensive for dev teams with limited resources. If G can crawl AJAX/JS, is relying on that a viable/effective alternative? Also, it sounds like making AJAX/JS are in alignment with what you show users and not rely on robots.txt to disallow access to external JS files. Would you say that is correct assumption? Thanks for your thoughts. Carrie
 Cancel
 - iPullRank
 
 2011-12-01T19:58:10-08:00
 
 Hey Carrie,
 
 It's hard to say what the implications on best practices right now are as they are definitely not using this on everything they index. I would still say that you should follow established SEO best practices until there is more evidence that they have rolled this out on a wider scale.
 
 In the meantime I would also run tests to see if whatever you're serving with AJAX/JS does get indexed.
 
 You can trust that Josh and I will continue to look for more evidence and run tests but at the end of the day we are SEOs because regardless of Google's capability they have proven themselves unreliable.
 
 -Mike
 
 1 0
 
 Hey Carrie, It's hard to say what the implications on best practices right now are as they are definitely not using this on everything they index. I would still say that you should follow established SEO best practices until there is more evidence that they have rolled this out on a wider scale. In the meantime I would also run tests to see if whatever you're serving with AJAX/JS does get indexed. You can trust that Josh and I will continue to look for more evidence and run tests but at the end of the day we are SEOs because regardless of Google's capability they have proven themselves unreliable. -Mike
 Cancel
- Josh-u-a
 
 2011-11-30T09:34:19-08:00
 
 Hi Russ,
 
 I agree with you regarding Issue #3... patents are for the protection of intellectual property and don't necessitate implementation beyond the actual patent documentation.
 
 In the original paper I address some of the architecture changes that were made to Webkit when it became Chromium that really seem to suggest scalability and redundancy for a crawling architecture.
 
 First and foremost we have the V8 JavaScript Engine which is capable of being embedded or operating stand-alone. This capability has made projects like NodeJS possible. It's written in C++ and interprets the JavaScript straight to machine code... it's lightning fast and lives in it's own processing thread.
 
 Next there's the fact that each Chrome "tab" is actually a unique processor thread. If you open your task manager on a Windows PC you'll see several instances of Chrome running; one for each tab open. This provides alot of processing benefit and is highly scalable, especially if you view Google's crawling architecture as a cloud designed to scale resources ala Amazon's EC2. This also provides amazing fault tolerance for a crawler, since if one thread crashes, the other crawling processes are undisturbed.
 
 Lastly, we have the headless component of the equation... alot of the overhead in browsing is related to actually rendering the elements to the screen for a user. Headless browsers don't render a browser window, they're just a background process exposing the DOM; PhantomJS is a great example of this.
 
 We could also dive into the Chromium Remote and other crazy stuff like that, but it's probably just easier to read the original on IPullRank.
 
 In the followup that will be on the Distilled Blog Friday, I extend the original paper and Mike's awesome contributions here with some more proofs directly from the SERPs; so be sure to check for that too, as I think it will be of interest to you.
 
 Thanks again for your thoughts!
 
 3 0
 
 Hi Russ, I agree with you regarding Issue #3... patents are for the protection of intellectual property and don't necessitate implementation beyond the actual patent documentation. In the original paper I address some of the architecture changes that were made to Webkit when it became Chromium that really seem to suggest scalability and redundancy for a crawling architecture. First and foremost we have the V8 JavaScript Engine which is capable of being embedded or operating stand-alone. This capability has made projects like NodeJS possible. It's written in C++ and interprets the JavaScript straight to machine code... it's lightning fast and lives in it's own processing thread. Next there's the fact that each Chrome "tab" is actually a unique processor thread. If you open your task manager on a Windows PC you'll see several instances of Chrome running; one for each tab open. This provides alot of processing benefit and is highly scalable, especially if you view Google's crawling architecture as a cloud designed to scale resources ala Amazon's EC2. This also provides amazing fault tolerance for a crawler, since if one thread crashes, the other crawling processes are undisturbed. Lastly, we have the headless component of the equation... alot of the overhead in browsing is related to actually rendering the elements to the screen for a user. Headless browsers don't render a browser window, they're just a background process exposing the DOM; PhantomJS is a great example of this. We could also dive into the Chromium Remote and other crazy stuff like that, but it's probably just easier to read the original on IPullRank. In the followup that will be on the Distilled Blog Friday, I extend the original paper and Mike's awesome contributions here with some more proofs directly from the SERPs; so be sure to check for that too, as I think it will be of interest to you. Thanks again for your thoughts!
 Cancel
Gianluca Fiorelli

2011-11-29T23:43:36-08:00

Jesus, this is not a post, this is the Revelation Bible; if your conclusions are correct (and you provide quite many proves), this is like taking a giant Blue Pill and discover G-Matrix in its reality. The UX Quality Score seems proved to be the feeder of the Panda algorithm, from what you write, Mike. And It gIves a solid technical base to all the speculations about the use of the web users do. We all knew we were acting in a Matrix, now you give a something that confirms it.

P.s.: the only negative part of your post > you talking about your parents playing Pacman in the Arcades. I was doing the same in the middle '80! To feel old at 8.30 am my time is not so cool ;)

12 0

Jesus, this is not a post, this is the Revelation Bible; if your conclusions are correct (and you provide quite many proves), this is like taking a giant Blue Pill and discover G-Matrix in its reality. The UX Quality Score seems proved to be the feeder of the Panda algorithm, from what you write, Mike. And It gIves a solid technical base to all the speculations about the use of the web users do. We all knew we were acting in a Matrix, now you give a something that confirms it. P.s.: the only negative part of your post > you talking about your parents playing Pacman in the Arcades. I was doing the same in the middle '80! To feel old at 8.30 am my time is not so cool ;) 
Cancel
- iPullRank
 
 2011-11-30T09:11:26-08:00
 
 Hahah I played Pac-Man with them in the 80s. Granted I could still count my age on my hands then.
 
 Thanks for reading Gianluca!
 
 -Mike
 
 2 0
 
 Hahah I played Pac-Man with them in the 80s. Granted I could still count my age on my hands then. Thanks for reading Gianluca! -Mike
 Cancel
Justin Briggs

2011-11-30T07:45:50-08:00

First, nice post and creative as always. :)

This is one of those things where it's nice to see evidence, but I'm also kind of left feeling like "yeah, not really too surprised". We've known for years they could do visual segmentation, beyond just boiler plate code analysis. Matt Cutts said once that links could be weighted differently on the how far down they appear visually on a page. Then previews in SERPs showed that they can render Flash and JavaScript ads to generate previews. Then we saw those Adwords emails back in February after Panda that talked about the visual space taken up by ads above the fold on a 1024 screen. They've been extracting text from images, doing advanced image analysis, and can even do search by images. So their visual analysis is exceptionally better than it was a few years ago.

I certainly believe, and the evidence shows, they're moving towards a more visual crawler, but I’d be willing to bet that a huge portion of the analysis is still being treated as if it's all done with a text based browser. (Not that they aren't gathering this visual information, but it doesn't mean it's reflected in results yet)

I'd imagine visual analysis, JavaScript execution, filling out forms, and deep crawls are extremely expensive in terms of computation and resources. So what I imagine is the case is that this technology has existed for quite some time and what we're seeing are improvements in efficiencies and scalability.

I remarked in my recent posts that all of these things being uncovered have been sitting out there for years, even the social stuff. We're just now seeing them being put into practice. Google has likely been able to do this level of deep media and JavaScript crawling on a small sample fairly easily, but the limitation might have come when you look at the daunting demand of doing that at scale for the entire internet, for all countries, and in all languages.

Even Panda wasn't a change in understanding; it was a change in scaling machine learning. It was a process / technology improvement. The Freshness update wasn't a change in understanding either, but advancement built on top of the path laid by Caffeine.

I think it might also be safe to say that GoogleBot's capabilities are conditional. When the environment (code, PageRank, trust, etc.) is right, it might warrant different levels of crawling. If nothing in the code suggests AJAX, flash, JavaScript rendering, then resources could be saved by not executing visual rendering of the page.

Then I think we get ourselves into another conversation, like crawl efficiency and crawl budget. Does the cap of resources allotted to a domain get burned through if the crawl must spend additional time dealing with Flash, JavaScript and AJAX? If so, it could still be a good recommendation to keep JS and Flash to a minimum, especially if you’re having indexation issues.

Justin_Briggs edited 2011-11-30T07:50:12-08:00
10 0

First, nice post and creative as always. :) This is one of those things where it's nice to see evidence, but I'm also kind of left feeling like "yeah, not really too surprised". We've known for years they could do visual segmentation, beyond just boiler plate code analysis. Matt Cutts said once that links could be weighted differently on the how far down they appear visually on a page. Then previews in SERPs showed that they can render Flash and JavaScript ads to generate previews. Then we saw those Adwords emails back in February after Panda that talked about the visual space taken up by ads above the fold on a 1024 screen. They've been extracting text from images, doing advanced image analysis, and can even do search by images. So their visual analysis is exceptionally better than it was a few years ago. I certainly believe, and the evidence shows, they're moving towards a more visual crawler, but I’d be willing to bet that a huge portion of the analysis is still being treated as if it's all done with a text based browser. (Not that they aren't gathering this visual information, but it doesn't mean it's reflected in results yet) I'd imagine visual analysis, JavaScript execution, filling out forms, and deep crawls are extremely expensive in terms of computation and resources. So what I imagine is the case is that this technology has existed for quite some time and what we're seeing are improvements in efficiencies and scalability. I remarked in my recent posts that all of these things being uncovered have been sitting out there for years, even the social stuff. We're just now seeing them being put into practice. Google has likely been able to do this level of deep media and JavaScript crawling on a small sample fairly easily, but the limitation might have come when you look at the daunting demand of doing that at scale for the entire internet, for all countries, and in all languages. Even Panda wasn't a change in understanding; it was a change in scaling machine learning. It was a process / technology improvement. The Freshness update wasn't a change in understanding either, but advancement built on top of the path laid by Caffeine. I think it might also be safe to say that GoogleBot's capabilities are conditional. When the environment (code, PageRank, trust, etc.) is right, it might warrant different levels of crawling. If nothing in the code suggests AJAX, flash, JavaScript rendering, then resources could be saved by not executing visual rendering of the page. Then I think we get ourselves into another conversation, like crawl efficiency and crawl budget. Does the cap of resources allotted to a domain get burned through if the crawl must spend additional time dealing with Flash, JavaScript and AJAX? If so, it could still be a good recommendation to keep JS and Flash to a minimum, especially if you’re having indexation issues.
Cancel
- iPullRank
 
 2011-11-30T08:30:04-08:00
 
 Justin,
 
 Yeah, I don't disagree with you here and I brought up similar points in my reply to Bill.
 
 The crawlers may be separate and Google may just fire up the visual crawler as needed. In most cases it might be enough to just crawl a homepage and a few internal pages because the rest of the site doesn't change much visually. Also based on my experience with PhantomJS visual crawling is actually incredibly fast as there is no actual visual interface. The analysis will obviously be quite computationally expensive but this is Google we're talking about here; I'm sure they have figured out a way to scale it.
 
 I think an important point though is the machine learning aspect of the end user's version of Chrome and how it could potentially be informing Googlebot and therefore making the filling out of forms and such far less computationally expensive than you are suggesting since their educated guesses are for more accurate then just random filling of fields.
 
 Def agree with GBot's powers being conditional and used as needed.
 
 To your point on expending crawl allocation I think in general page speed is a good proxy for determining that.
 
 Awesome insights Justin, thank you.
 
 -Mike
 
 3 0
 
 Justin, Yeah, I don't disagree with you here and I brought up similar points in my reply to Bill. The crawlers may be separate and Google may just fire up the visual crawler as needed. In most cases it might be enough to just crawl a homepage and a few internal pages because the rest of the site doesn't change much visually. Also based on my experience with PhantomJS visual crawling is actually incredibly fast as there is no actual visual interface. The analysis will obviously be quite computationally expensive but this is Google we're talking about here; I'm sure they have figured out a way to scale it. I think an important point though is the machine learning aspect of the end user's version of Chrome and how it could potentially be informing Googlebot and therefore making the filling out of forms and such far less computationally expensive than you are suggesting since their educated guesses are for more accurate then just random filling of fields. Def agree with GBot's powers being conditional and used as needed. To your point on expending crawl allocation I think in general page speed is a good proxy for determining that. Awesome insights Justin, thank you. -Mike
 Cancel
 - Justin Briggs
 
 2011-11-30T09:12:09-08:00
 
 I'm definitely only speculating on many things, but a few thoughts. (And not disagreeing, just intellectual)
 
 I agree with Russ that we may give Google too much credit at times in terms of computational power. No doubt their computational abilities are absolutely amazing, but Panda does take 20 to 30 days to process. That isn't so much a problem with Google, but just the extensive amount of work / data needed. I've also seen a comment before that the Google bomb algo is a push button algo, which doesn't run all the time.
 
 And I think we're also discussing two different types of computational work. There is the crawl work (all we're seeing with PhantomJS) and then the analysis computation. Even if visual crawling is exceptionally fast, doing something with all that data is another problem all together. Any time they get into image analysis, the amount of computing power needed increases significantly. This is why I feel that they're certainly doing the first problem (crawling), and the evidence shows that, but they haven't quite solved the second (scalable advance statistical analysis of robust media / file / technology types), which is reflected in search quality, but they're working on it.
 
 And I think we have to be cautious using page speed as a proxy for the resource requirements for the type of crawl we're discussing. Page speed is the time it takes for the initial load, but subsequent AJAX calls, form submission or the deep crawl of a Flash file, is resource demands in addition to the page speed.
 
 Thanks for the post. I'm really enjoying all the conversation it started. Regardless of exactly how they're doing it all, it's certainly good to get people talking about it.
 
 3 0
 
 I'm definitely only speculating on many things, but a few thoughts. (And not disagreeing, just intellectual) I agree with Russ that we may give Google too much credit at times in terms of computational power. No doubt their computational abilities are absolutely amazing, but Panda does take 20 to 30 days to process. That isn't so much a problem with Google, but just the extensive amount of work / data needed. I've also seen a comment before that the Google bomb algo is a push button algo, which doesn't run all the time. And I think we're also discussing two different types of computational work. There is the crawl work (all we're seeing with PhantomJS) and then the analysis computation. Even if visual crawling is exceptionally fast, doing something with all that data is another problem all together. Any time they get into image analysis, the amount of computing power needed increases significantly. This is why I feel that they're certainly doing the first problem (crawling), and the evidence shows that, but they haven't quite solved the second (scalable advance statistical analysis of robust media / file / technology types), which is reflected in search quality, but they're working on it. And I think we have to be cautious using page speed as a proxy for the resource requirements for the type of crawl we're discussing. Page speed is the time it takes for the initial load, but subsequent AJAX calls, form submission or the deep crawl of a Flash file, is resource demands in addition to the page speed. Thanks for the post. I'm really enjoying all the conversation it started. Regardless of exactly how they're doing it all, it's certainly good to get people talking about it.
 Cancel
 - iPullRank
 
 2011-11-30T09:30:17-08:00
 
 In the spirt of speculation...
 
 The problem is quite complicated but I don't know that we give Google too much credit. After all these are the people that brought the crawler into the real world with self-driving cars and have data centers that float on the ocean. They built a program where I can look at your house from my desk. Granted they are not NASA but Search is their bread and butter. If it can be done and it makes sense for improving their main product it's realistic that they will eventually do it. In the case of browser-based crawling the future is the past.
 
 Panda takes 30 days to roll out but we really don't know how long it takes for them to actually do the analysis. How much of that time is them QAing the new indices and re-running it? Obviously indexation and ranking is a big data problem of epic proportions but everything they do is an issue of scalability and every year they conquer it better.
 
 Hmmm...that's a good point then I'd say page speed + size of rich media + speed of dynamic assets. That should be some sort of load score as well.
 
 3 0
 
 In the spirt of speculation... The problem is quite complicated but I don't know that we give Google too much credit. After all these are the people that brought the crawler into the real world with self-driving cars and have data centers that float on the ocean. They built a program where I can look at your house from my desk. Granted they are not NASA but Search is their bread and butter. If it can be done and it makes sense for improving their main product it's realistic that they will eventually do it. In the case of browser-based crawling the future is the past. Panda takes 30 days to roll out but we really don't know how long it takes for them to actually do the analysis. How much of that time is them QAing the new indices and re-running it? Obviously indexation and ranking is a big data problem of epic proportions but everything they do is an issue of scalability and every year they conquer it better. Hmmm...that's a good point then I'd say page speed + size of rich media + speed of dynamic assets. That should be some sort of load score as well.
 Cancel
 - Bill Slawski
 
 2011-11-30T09:50:24-08:00
 
 I loved reading through a recent Google patent that described how they may do a Panda type analysis on videos, which could help them do things like identify when adult content might be inserted into otherwise innoculous content within a video (a point brought out in the patent). Running something like that in a timely manner on a site like YouTube seems a staggering undertaking to me, but so does doing something similar to all the sites on the World Wide Web.
 
 I have noticed Google acquiring more and more hardware related patents, and seemingly investing in more and more data centers as well. They are working hard on building the capacity to do things like we are speculating upon.
 
 And infrastructure updates like Caffeine that allow for incremental updates to indexes about specific documents have been working to reduce computational costs of such efforts significantly.
 
 For instance, the recent Google freshness update was likely based upon approaches originally developed as described in divisional patents from Google's Historical Data patent from 2005 that only really became feasible once Caffeine was in place.
 
 Some of the things we see and read in places like patents might not necessarily be technologically feasible, or at least not yet. But I like being aware of the possibilities, and being able to do things like avoid potential problems like Panda when they launch.
 
 3 0
 
 I loved reading through a recent Google patent that described how they may do a Panda type analysis on videos, which could help them do things like identify when adult content might be inserted into otherwise innoculous content within a video (a point brought out in the patent). Running something like that in a timely manner on a site like YouTube seems a staggering undertaking to me, but so does doing something similar to all the sites on the World Wide Web. I have noticed Google acquiring more and more hardware related patents, and seemingly investing in more and more data centers as well. They are working hard on building the capacity to do things like we are speculating upon. And infrastructure updates like Caffeine that allow for incremental updates to indexes about specific documents have been working to reduce computational costs of such efforts significantly. For instance, the recent Google freshness update was likely based upon approaches originally developed as described in divisional patents from Google's Historical Data patent from 2005 that only really became feasible once Caffeine was in place. Some of the things we see and read in places like patents might not necessarily be technologically feasible, or at least not yet. But I like being aware of the possibilities, and being able to do things like avoid potential problems like Panda when they launch.
 Cancel
- Dr. Peter J. Meyers
 
 2011-11-30T12:41:08-08:00
 
 I should read before I comment :) Just made the same point about Caffeine. Too many people don't get that these infrastructure advances fuel SEO updates for years. Caffeine made the things Google has wanted to do for the last 3-5 years but couldn't possible. That's why I think it's plausible to think that ideas from 2004 could be coming into play, computationally.
 
 2 0
 
 I should read before I comment :) Just made the same point about Caffeine. Too many people don't get that these infrastructure advances fuel SEO updates for years. Caffeine made the things Google has wanted to do for the last 3-5 years but couldn't possible. That's why I think it's plausible to think that ideas from 2004 could be coming into play, computationally.
 Cancel
- Jason Lancaster
 
 2011-12-05T16:14:55-08:00
 
 To quote: "I think it might also be safe to say that GoogleBot's capabilities are conditional. When the environment (code, PageRank, trust, etc.) is right, it might warrant different levels of crawling"
 
 Justin's point is KEY. iPullRank mentioned that some methods are "computationally expensive" and a headless browser falls into that category, at least when the enormity of indexing the net is concerned.
 
 Were I to design a search engine, I would save my sophisticated, computationally expensive crawlers for sites that warranted that level of attention. To use this type of crawler on the whole web would be a waste of resources.
 
 Therefore, I think the smart money is on using JS and flash in a limited manner...at least when it comes to a page's core content. I'd guess that, most of the time, Googlebot is dumb.
 
 1 1
 
 To quote: "I think it might also be safe to say that GoogleBot's capabilities are conditional. When the environment (code, PageRank, trust, etc.) is right, it might warrant different levels of crawling" Justin's point is KEY. iPullRank mentioned that some methods are "computationally expensive" and a headless browser falls into that category, at least when the enormity of indexing the net is concerned. Were I to design a search engine, I would save my sophisticated, computationally expensive crawlers for sites that warranted that level of attention. To use this type of crawler on the whole web would be a waste of resources. Therefore, I think the smart money is on using JS and flash in a limited manner...at least when it comes to a page's core content. I'd guess that, most of the time, Googlebot is dumb.
 Cancel
Himanshu Sharma

2011-11-30T02:48:32-08:00

If there were nobel prizes in the search industry then Joshua Giardino, Michael king and Bill Slawski deserve them for this groundbreaking research, seriously. Hats off to you guys. This is the best post of 2011.

OptimizeSmart edited 2011-11-30T02:54:37-08:00
11 1

If there were nobel prizes in the search industry then Joshua Giardino, Michael king and Bill Slawski deserve them for this groundbreaking research, seriously. Hats off to you guys. This is the best post of 2011.
Cancel
- Anthony D. Nelson
 
 2011-11-30T07:07:16-08:00
 
 Well said.
 
 3 0
 
 Well said.
 Cancel
Staff

Dr. Peter J. Meyers
Staff

2011-11-30T12:36:55-08:00

Great food for thought here, Mike. Tying into what Russ said, I think the toughest part is always separating what Google CAN do from what Google DOES do. No question they've had the capability to parse pages visually for years, and they use it at times. What's hard to sort out is - how often do they use it and in what capacities?

I doubt that every page crawled is fully visually segmented, personally, but I strongly suspect they extract elements on at least a sitewide basis, to understand navigation, footers, etc. I do think the idea that we can just move some HTML around in 2011 and fool Google is pretty naive. That worked 5 years ago, but not so much these days (I haven't seen any good evidence of code-order effects in the past 2 years, honestly).

The other incredibly important point people overlook that this all touches on is the importance of infrastructure updates (like Caffeine and, before it, Big Daddy). Too many webmasters and SEOs consider that boring tech stuff - it's not an algo update, so who cares? I'd argue those infrastructure overhauls are 10X more important than most algo updates, because they fuel all updates for years to come. Caffeine made Panda possible and gave Google computational horsepower. That means they can do more faster and put technology they bought in 2004 to work in 2011. Caffeine will be powering advancements well into 2012 and probably beyond, IMO.

Dr-Pete edited 2011-11-30T12:38:58-08:00
9 0

Great food for thought here, Mike. Tying into what Russ said, I think the toughest part is always separating what Google CAN do from what Google DOES do. No question they've had the capability to parse pages visually for years, and they use it at times. What's hard to sort out is - how often do they use it and in what capacities? I doubt that every page crawled is fully visually segmented, personally, but I strongly suspect they extract elements on at least a sitewide basis, to understand navigation, footers, etc. I do think the idea that we can just move some HTML around in 2011 and fool Google is pretty naive. That worked 5 years ago, but not so much these days (I haven't seen any good evidence of code-order effects in the past 2 years, honestly). The other incredibly important point people overlook that this all touches on is the importance of infrastructure updates (like Caffeine and, before it, Big Daddy). Too many webmasters and SEOs consider that boring tech stuff - it's not an algo update, so who cares? I'd argue those infrastructure overhauls are 10X more important than most algo updates, because they fuel all updates for years to come. Caffeine made Panda possible and gave Google computational horsepower. That means they can do more faster and put technology they bought in 2004 to work in 2011. Caffeine will be powering advancements well into 2012 and probably beyond, IMO.
Cancel
Corey Eulas

2011-11-30T06:50:51-08:00

I want to meet the person who gave this a thumbs down, and kick them in the chest. Seriously? Thumbs down? You probably hate bacon and puppies too.

Excellent post Mike, very informative and well layed out.

13 5

I want to meet the person who gave this a thumbs down, and kick them in the chest. Seriously? Thumbs down? You probably hate bacon and puppies too. Excellent post Mike, very informative and well layed out.
Cancel
- David Minchala
 
 2011-11-30T08:00:25-08:00
 
 LOL, they didn't much care for your comment either, apparently.
 
 1 0
 
 LOL, they didn't much care for your comment either, apparently.
 Cancel
- iPullRank
 
 2011-11-30T08:06:05-08:00
 
 I very much appreciate the sentiment Corey, but they are certainly entitled to their opinions.
 
 3 0
 
 I very much appreciate the sentiment Corey, but they are certainly entitled to their opinions. 
 Cancel
Bob Jones

2011-11-29T22:03:55-08:00

Man, you're on a roll Mike! What it all boils down to is simply to provide awesome quality content, be it copy, images, videos etc - most people trying to game the system purely for monetary reasons will almost always fall behind the curb. I bet some of them are probably spending more time trying to find the next exploit, workaround, time-saver and shortcut, than they would need to spend on quality content creation in order to earn more for themselves.

bobjones edited 2011-11-29T22:10:08-08:00
8 0

Man, you're on a roll Mike! What it all boils down to is simply to provide awesome quality content, be it copy, images, videos etc - most people trying to game the system purely for monetary reasons will almost always fall behind the curb. I bet some of them are probably spending more time trying to find the next exploit, workaround, time-saver and shortcut, than they would need to spend on quality content creation in order to earn more for themselves.
Cancel
- Josh-u-a
 
 2011-11-29T22:57:50-08:00
 
 There's a lot of truth to focusing on building for people over building for the algorithm. If you're building for people, you're future proofing yourself, because the algorithm is all about identifying meaningful features of human engagement and interaction.
 
 Deploying a headless browser as a crawler just makes this feature extraction that much more effective since they can map the data from the Google Toolbar, Human Search Quality Reviews, and the Chrome opt-in program right back to their search stack.
 
 With that said, the technological aspect is a very important part of building for people. UX/UI concerns are very important, especially with the explosion of both social and mobile. Discovery is another aspect of building for people... if people can't find it, they can't interact with it and spread it... and Google owns a good chunk of new content discovery in any country. Facebook is probably the next biggest driver of new content discovery, but it's a walled garden; thus building for GoogleBot is in some respects building for people too.
 
 Haha it's a crazy little circle!
 
 2 0
 
 There's a lot of truth to focusing on building for people over building for the algorithm. If you're building for people, you're future proofing yourself, because the algorithm is all about identifying meaningful features of human engagement and interaction. Deploying a headless browser as a crawler just makes this feature extraction that much more effective since they can map the data from the Google Toolbar, Human Search Quality Reviews, and the Chrome opt-in program right back to their search stack. With that said, the technological aspect is a very important part of building for people. UX/UI concerns are very important, especially with the explosion of both social and mobile. Discovery is another aspect of building for people... if people can't find it, they can't interact with it and spread it... and Google owns a good chunk of new content discovery in any country. Facebook is probably the next biggest driver of new content discovery, but it's a walled garden; thus building for GoogleBot is in some respects building for people too. Haha it's a crazy little circle!
 Cancel
michelemazzali

2011-11-30T03:35:16-08:00

wow, that's a killer post Mr. Mike.

About the advertisment stuff, I was asked to hold a class about search engines for a University here in Italy. Near the end I found myself talking about algorithm updates, Panda, content quality and so on, supporting as usual the thesis that good content always pays the effort.

Anyway, shuffling through the data, I saw that one of the sites rising in popularity and search traffic is YouTube. Nothing wrong with it, absolutely, YouTube is a great UGC and video sharing site. So I opened the browser to show it to my students, and...

...oh, gosh, it's PACKED with ads. Ultra-invasive, huge and with unbearable user experience issues! So, dear Mr. Google, while I still hate ads-packed sites and support your (supposed) struggle for quality, why are you becoming what you hate?

6 0

wow, that's a killer post Mr. Mike. About the advertisment stuff, I was asked to hold a class about search engines for a University here in Italy. Near the end I found myself talking about algorithm updates, Panda, content quality and so on, supporting as usual the thesis that good content always pays the effort. Anyway, shuffling through the data, I saw that one of the sites rising in popularity and search traffic is YouTube. Nothing wrong with it, absolutely, YouTube is a great UGC and video sharing site. So I opened the browser to show it to my students, and... ...oh, gosh, it's PACKED with ads. Ultra-invasive, huge and with unbearable user experience issues! So, dear Mr. Google, while I still hate ads-packed sites and support your (supposed) struggle for quality, why are you becoming what you hate?
Cancel
- Joshua Hedlund
 
 2011-11-30T06:03:29-08:00
 
 YouTube also has loads of thin and duplicate content. It remains a glaring contradiction to everything Google has been recommending post-Panda. For now.
 
 4 0
 
 YouTube also has loads of thin and duplicate content. It remains a glaring contradiction to everything Google has been recommending post-Panda. For now.
 Cancel
- SearchOfficeSpace23
 
 2011-12-02T01:58:33-08:00
 
 Amen to that !!! it makes me angry as well...
 
 1 0
 
 Amen to that !!! it makes me angry as well...
 Cancel
- Nauweb
 
 2012-01-07T11:05:31-08:00
 
 Agreed. But I think YouTube is so "relevant" that even if it gets some how "penalized" it would still ranking great.
 
 1 0
 
 Agreed. But I think YouTube is so "relevant" that even if it gets some how "penalized" it would still ranking great.
 Cancel
Charlotte Waller

2011-11-30T02:30:04-08:00

Awesome post - particularly the evidence section and point about site loading speed, how this is effectively measured has been a question of mine. I'm happy if GoogleBot can see sites visually - it should encourage making sites for the user not the engines :)

5 0

Awesome post - particularly the evidence section and point about site loading speed, how this is effectively measured has been a question of mine. I'm happy if GoogleBot can see sites visually - it should encourage making sites for the user not the engines :)
Cancel
- Simon Cullum
 
 2011-11-30T05:04:25-08:00
 
 Spot-on Charlotte. It's another Awesome post by Mike.
 
 Will hopefully help to bring Search and U/X teams closer together, superb in-depth research, analysis and explanations such as Mike's here are so valuable :-)
 
 4 0
 
 Spot-on Charlotte. It's another Awesome post by Mike. Will hopefully help to bring Search and U/X teams closer together, superb in-depth research, analysis and explanations such as Mike's here are so valuable :-)
 Cancel
Ian Goodall

2011-11-30T01:23:10-08:00

What a monster of a post. You make a very compelling case here. It stands to reason that GoogleBot would want to be as close to a human interacting the web as possible, and since we see the web through browsers why can't the Bot be one?

Anyway, great job :-)

4 0

What a monster of a post. You make a very compelling case here. It stands to reason that GoogleBot would want to be as close to a human interacting the web as possible, and since we see the web through browsers why can't the Bot be one? Anyway, great job :-)
Cancel
Sha Menz

2011-11-29T22:57:05-08:00

Awesome post Mike!

The real question is, do people want to hear the message? Seems like the writing has been on the wall (in the SERPs) for a very long time, but very few have wanted to see it. I'm glad you came out and published this in the SEOmoz arena, as I fear it is going to take a lot to move some people from their "find a trick to fool googlebot" mentality. Hopefully the respect that you and the SEOmoz Blog have in the industry will make some of those people sit up and take notice.

As for Matt's public announcements, I think the key is always to watch out for the qualifications. Here's a classic (and intriguing) example from his recent blog post on algorithm changes:

" today we want to give you a flavor of specific algorithm changes by publishing a highlight list of many of the improvements we’ve made over the past couple weeks."

A little further on in the post, after a little more waffle about the more than 500 changes made in a year, he says "In that spirit, here’s a list of ten improvements from the past couple weeks:"

Since by now we are all feeling so warm, fuzzy and to be honest just plain shocked, at the idea of Google revealing such information, it is only natural that we might overlook the fact that "many" in fact refers to considerably more than the "ten" in the "highlight list".

Let's face it, the man is as talented with language and suggestion as he is as an engineer ;)

Where Search Engines are concerned, I am reminded more and more of that very old saying "Believe none of what you hear and only half of what you see".

Great post Mike, I really hope the message gets through.

Sha

ShaMenz edited 2011-11-29T23:00:46-08:00
4 0

Awesome post Mike! The real question is, do people want to hear the message? Seems like the writing has been on the wall (in the SERPs) for a very long time, but very few have wanted to see it. I'm glad you came out and published this in the SEOmoz arena, as I fear it is going to take a lot to move some people from their "find a trick to fool googlebot" mentality. Hopefully the respect that you and the SEOmoz Blog have in the industry will make some of those people sit up and take notice. As for Matt's public announcements, I think the key is always to watch out for the qualifications. Here's a classic (and intriguing) example from his recent blog post on algorithm changes: " today we want to give you a flavor of specific algorithm changes by publishing a highlight list of many of the improvements we’ve made over the past couple weeks." A little further on in the post, after a little more waffle about the more than 500 changes made in a year, he says "In that spirit, here’s a list of ten improvements from the past couple weeks:" Since by now we are all feeling so warm, fuzzy and to be honest just plain shocked, at the idea of Google revealing such information, it is only natural that we might overlook the fact that "many" in fact refers to considerably more than the "ten" in the "highlight list". Let's face it, the man is as talented with language and suggestion as he is as an engineer ;) Where Search Engines are concerned, I am reminded more and more of that very old saying "Believe none of what you hear and only half of what you see". Great post Mike, I really hope the message gets through. Sha
Cancel
Sergio Redondo

2011-11-30T03:26:16-08:00

If I could, I'd give you 1000 Thumbs Up. What post (post?), God! Eh, Mike, you're an absolutely pro.

It's exciting the fact that User Experience Design is going to be so important.

From now, I'm sure Google has a problem with you. You're kicking its ass and telling it: Come on, guy, tell us the truth! what are you waiting for? Did you think we weren't so good at?

Amazing, Mike. You've just got Google into trouble.

4 0

If I could, I'd give you 1000 Thumbs Up. What post (post?), God! Eh, Mike, you're an absolutely pro. It's exciting the fact that User Experience Design is going to be so important. From now, I'm sure Google has a problem with you. You're kicking its ass and telling it: Come on, guy, tell us the truth! what are you waiting for? Did you think we weren't so good at? Amazing, Mike. You've just got Google into trouble.
Cancel
Brian Crouch

2011-11-29T23:02:58-08:00

"Any content you thought you were hiding with post-load JavaScript -- stop it. Bait and switching is now 100% ineffective."

Amen, may it be forever left behind...

3 0

"Any content you thought you were hiding with post-load JavaScript -- stop it. Bait and switching is now 100% ineffective." Amen, may it be forever left behind... 
Cancel
- SeoDuck48
 
 2011-12-09T13:06:47-08:00
 
 I'm sure some people are still playing with this. You're right the times of black hat are gone.
 
 1 1
 
 I'm sure some people are still playing with this. You're right the times of black hat are gone.
 Cancel
Aaron Schinke

2011-11-30T07:31:13-08:00

Definitely one of the best posts of the year. Great research and very articulate thoughts. I would literally pay to read this.

3 0

Definitely one of the best posts of the year. Great research and very articulate thoughts. I would literally pay to read this.
Cancel
AjayYadav_InboundMarketer

2011-11-30T22:19:50-08:00

That's totally, undoubtedly an awesome post. I was not ever having the idea of Google bot to be a Headless browser and old text based carwler combo.. now i got to learn, why The Google folks were giving so much attention on building Chrome inspite of looking at the competition in browser industry and having the Google toolbar, they still made their focus to build chromium webkit.

You are right ..it's like Skynet. The Google Bot is trying to reach at the level of Skynet.

I have got new name for google bot that's Ms. Pacman. and yes ..this Ms. Pacman is much clever. I have to be careful from it. Thanks mike!

3 0

That's totally, undoubtedly an awesome post. I was not ever having the idea of Google bot to be a Headless browser and old text based carwler combo.. now i got to learn, why The Google folks were giving so much attention on building Chrome inspite of looking at the competition in browser industry and having the Google toolbar, they still made their focus to build chromium webkit. You are right ..it's like Skynet. The Google Bot is trying to reach at the level of Skynet. I have got new name for google bot that's Ms. Pacman. and yes ..this Ms. Pacman is much clever. I have to be careful from it. Thanks mike!
Cancel
des mc carthy

2011-11-30T18:11:05-08:00

A quick look at the log files from any website will show that the bots are definitely not headless browers. If they were, they would act more like actual users retrieving all the resources (js, css, jpg ...) required to construct the webpage. It is much more likely that the headless browser is part of a second line analysis done using the cached content. Here google would probably use a feedback loop from the search indexes to determine, says the pages it will require for instant preview etc... I think that the amount of resources required for headless browsing would require some rationing of resources. Why waste good processing power on spam... On the up side, the technology used for headless browsing will probably get fed back into the rendering engines used by chrome and the likes...

3 0

A quick look at the log files from any website will show that the bots are definitely not headless browers. If they were, they would act more like actual users retrieving all the resources (js, css, jpg ...) required to construct the webpage. It is much more likely that the headless browser is part of a second line analysis done using the cached content. Here google would probably use a feedback loop from the search indexes to determine, says the pages it will require for instant preview etc... I think that the amount of resources required for headless browsing would require some rationing of resources. Why waste good processing power on spam... On the up side, the technology used for headless browsing will probably get fed back into the rendering engines used by chrome and the likes...
Cancel
- des mc carthy
 
 2012-11-19T12:54:01-08:00
 
 I owe you an apology. I have spotted the headless monster. Its is alive and crawling. I made some changes to a webpage and placed some images in a sub directory, and low and behold along comes googlebot 2.1, sees the new structure and followup with fetching the images from the sub-directory, with the refer info as the parent page. This was not the google image bot.
 
 1 0
 
 I owe you an apology. I have spotted the headless monster. Its is alive and crawling. I made some changes to a webpage and placed some images in a sub directory, and low and behold along comes googlebot 2.1, sees the new structure and followup with fetching the images from the sub-directory, with the refer info as the parent page. This was not the google image bot. 
 Cancel
Jaspal Kalsi

2011-11-30T00:27:27-08:00

Nice post however I feel the only answer it will get ... if it does from Google will be that just because they could do something for years before this, it does not mean they did it.

Capability and acutal use has always differed in the case of Google so maybe Matt is only hinting that some capabilities are now going to be brought into the play.

3 0

Nice post however I feel the only answer it will get ... if it does from Google will be that just because they could do something for years before this, it does not mean they did it. Capability and acutal use has always differed in the case of Google so maybe Matt is only hinting that some capabilities are now going to be brought into the play.
Cancel
Scott Bartell

2011-11-30T07:44:21-08:00

Also, don't forget that Google Analytics is on a large chunk of websites out there. I'm pretty sure they don't ignore this data. It's probably used in conjunction with crawlers to improve the quality even more.

2 0

Also, don't forget that Google Analytics is on a large chunk of websites out there. I'm pretty sure they don't ignore this data. It's probably used in conjunction with crawlers to improve the quality even more.
Cancel
JasonD

2011-11-30T08:36:00-08:00

GBot has been deploying spiders able to interpret and understand the FULL DOM since at least 2004.

The particular bots may NOT have been used for ALL of their web scraping needs but they were definately used for some.

2 0

GBot has been deploying spiders able to interpret and understand the FULL DOM since at least 2004. The particular bots may NOT have been used for ALL of their web scraping needs but they were definately used for some.
Cancel
- HiveDigitalInc
 
 2011-11-30T08:54:30-08:00
 
 I guess a point of contention here is not necessarily that Google could understand the Document Object Model, but to what degree it would go through the process of rendering it out. GoogleBot may understand that "some text here<img src='mybigpic'>more text here" contains an image in context, but does it know that the image is only 5px by 5px or that it might be 1000px by 1000px and pushes the second text to the bottom of the page.
 
 I do very much agree with you that in certain contexts Google may choose to deploy a deeper look at data. For example, I believe the top 10 results are treated with greater scrutiny than the next 10. This is not to say that the algorithm is different, but that there are different filters that get applied as a site gets closer and closer to the top 10.
 
 1 0
 
 I guess a point of contention here is not necessarily that Google could understand the Document Object Model, but to what degree it would go through the process of rendering it out. GoogleBot may understand that "some text here<img src='mybigpic'>more text here" contains an image in context, but does it know that the image is only 5px by 5px or that it might be 1000px by 1000px and pushes the second text to the bottom of the page. I do very much agree with you that in certain contexts Google may choose to deploy a deeper look at data. For example, I believe the top 10 results are treated with greater scrutiny than the next 10. This is not to say that the algorithm is different, but that there are different filters that get applied as a site gets closer and closer to the top 10.
 Cancel
 - JasonD
 
 2011-11-30T09:15:02-08:00
 
 Russ
 
 Let me reword it
 
 I believe G has been able to understand the fully rendered DOM and changes within the DOM since at least 2004.
 
 Whether we call it a headless browser or not is moot - GBot and/or portions of code that are automated within Google operate as a browser and have since at least the earlier part of last decade.
 
 All of your other points I completely agree with.
 
 1 0
 
 Russ Let me reword it I believe G has been able to understand the fully rendered DOM and changes within the DOM since at least 2004. Whether we call it a headless browser or not is moot - GBot and/or portions of code that are automated within Google operate as a browser and have since at least the earlier part of last decade. All of your other points I completely agree with. 
 Cancel
Associate

iPullRank
Associate

2011-12-02T10:04:57-08:00

Hey all,

Check out Josh Giardino's follow up post "Google Stop Playing the Jig Is Still Up!" on the Distilled blog https://www.distilled.net/blog/seo/google-stop-playing-the-jig-is-still-up-guest-post/. It features further research and evidence.

-Mike

2 0

Hey all, Check out Josh Giardino's follow up post "Google Stop Playing the Jig Is Still Up!" on the Distilled blog <a href="https://www.distilled.net/blog/seo/google-stop-playing-the-jig-is-still-up-guest-post/" rel="nofollow">https://www.distilled.net/blog/seo/google-stop-playing-the-jig-is-still-up-guest-post/</a>. It features further research and evidence. -Mike
Cancel
Charles McLaughlin Piche

2011-12-01T18:53:27-08:00

In my opinion yes they can see all this stuff but it's not yet implemented to work in parallel with the text based crawler..

I have cloaked sites (from the old days) still indexed and bringing traffic that would certainly be banned if they would compare the "instant preview crawl" with the text based crawl. The 2 are totally differents pages...

SEOCharles- edited 2011-12-01T18:54:40-08:00
2 0

In my opinion yes they can see all this stuff but it's not yet implemented to work in parallel with the text based crawler.. I have cloaked sites (from the old days) still indexed and bringing traffic that would certainly be banned if they would compare the "instant preview crawl" with the text based crawl. The 2 are totally differents pages...
Cancel
- iPullRank
 
 2011-12-01T19:50:11-08:00
 
 Hey Charles would you be keen to share those sites? Or at least screenshots of the cached pages and the instant previews? I'm very curious.
 
 -Mike
 
 1 0
 
 Hey Charles would you be keen to share those sites? Or at least screenshots of the cached pages and the instant previews? I'm very curious. -Mike
 Cancel
 - Charles McLaughlin Piche
 
 2011-12-02T08:37:27-08:00
 
 I wouldn't share the sites here but if you keep it for yourself I can give you one, just PM me.
 
 I might do the test with a new site to see if the same thing happen.
 
 1 0
 
 I wouldn't share the sites here but if you keep it for yourself I can give you one, just PM me. I might do the test with a new site to see if the same thing happen.
 Cancel
Asif Dilshad

2011-11-30T09:42:15-08:00

Great work sir you have throw out all you research in good manner so even a starter SEO will understand what is happening in a SEO world. Google always tries to mislead us (SEO’s) with If and Buts. They don’t tell us everything about search because as a SEO’s we are the biggest rival who tries to compete with them to show results as our wish. As far as the Googleboot is concerned it is reaching that point where thy literary behave like a human net surfers(Humans). The most positive part of the advancement is, now we don’t need to treat googleboot (oops PACKMAN) and humans in a different manner.

But I am still confused do they actually able to read Flash sites right now or they need some more time?

Asif.Dilshad edited 2011-11-30T09:43:55-08:00
2 0

Great work sir you have throw out all you research in good manner so even a starter SEO will understand what is happening in a SEO world. Google always tries to mislead us (SEO’s) with If and Buts. They don’t tell us everything about search because as a SEO’s we are the biggest rival who tries to compete with them to show results as our wish. As far as the Googleboot is concerned it is reaching that point where thy literary behave like a human net surfers(Humans). The most positive part of the advancement is, now we don’t need to treat googleboot (oops PACKMAN) and humans in a different manner. But I am still confused do they actually able to read Flash sites right now or they need some more time?
Cancel
Steve Ollington

2011-11-30T07:45:39-08:00

Thank you Mike, a very informative and well written post, with great take-aways... I especially like the idea for the new tool :)

2 0

Thank you Mike, a very informative and well written post, with great take-aways... I especially like the idea for the new tool :)
Cancel
Jason Stinnett

2011-11-30T08:12:22-08:00

I think the key thing to keep in mind is that there is a difference between seeing all the angles and knowing how and when to play those angles.

The post provided some compelling arguments that Google sees and understands what a site's up to -- but that doesn't mean that they're acting on it today.

It's somewhat akin to seeing and understanding what some of the most competitive SEOs are doing, and then replicating or one-upping that yourself or within your organization. It's doable, but it's a journey, and you're far more likely to chip away at implementing a new approach then you are to change the way you work overnight.

2 0

I think the key thing to keep in mind is that there is a difference between seeing all the angles and knowing how and when to play those angles. The post provided some compelling arguments that Google sees and understands what a site's up to -- but that doesn't mean that they're acting on it today. It's somewhat akin to seeing and understanding what some of the most competitive SEOs are doing, and then replicating or one-upping that yourself or within your organization. It's doable, but it's a journey, and you're far more likely to chip away at implementing a new approach then you are to change the way you work overnight.
Cancel
Michael Janik

2011-11-30T03:17:22-08:00

I think that Google might even be able to read text in graphics. Why all the antispam Numbers and letters under forms are so distorted? I think because spam bots are able to read non distorted letters and numbers. And if they can do that why Google should be able to that? The only reason that I can think of is that it takes up to much calculation power (processor time) but on the other hand this because less and less a problem with the advance of technology. Istn't it!?

2 0

I think that Google might even be able to read text in graphics. Why all the antispam Numbers and letters under forms are so distorted? I think because spam bots are able to read non distorted letters and numbers. And if they can do that why Google should be able to that? The only reason that I can think of is that it takes up to much calculation power (processor time) but on the other hand this because less and less a problem with the advance of technology. Istn't it!?
Cancel
- splendidapple
 
 2011-12-16T05:02:06-08:00
 
 check out "Luis von Ahn talk massive scale online collaboration" on TED for an interesting talk on captchas
 
 1 0
 
 check out "Luis von Ahn talk massive scale online collaboration" on TED for an interesting talk on captchas
 Cancel
Kevin Alvarez

2011-11-30T06:32:50-08:00

This is one of the most comprehensive and interesting articles I have read in a while. And I loved the Pac-Man analogy because without it, I am not sure I would have been able to get a full grasp. I believe this reinforces the fact that as SEOs, we need to be more concerned with providing users with an improved overall experience from content quality, content placement and content design.

2 0

This is one of the most comprehensive and interesting articles I have read in a while. And I loved the Pac-Man analogy because without it, I am not sure I would have been able to get a full grasp. I believe this reinforces the fact that as SEOs, we need to be more concerned with providing users with an improved overall experience from content quality, content placement and content design.
Cancel
Moosa Hemani

2011-11-29T23:11:36-08:00

What a post sir! Super phenomenal... I can’t say Google lie, but this is common that Google manipulate /mislead from time to time but you are right this is something we should have our eye on.

The Super Googlebot (with headless browser) is good in a way that now the content that we hide from search engine through Java Script and others are no more hidden and one have to show Google exactly the same copy as they show it to users.

One part i would highly agree on is that SEO Tools should get smart!

2 0

What a post sir! Super phenomenal... I can’t say Google lie, but this is common that Google manipulate /mislead from time to time but you are right this is something we should have our eye on. The Super Googlebot (with headless browser) is good in a way that now the content that we hide from search engine through Java Script and others are no more hidden and one have to show Google exactly the same copy as they show it to users. One part i would highly agree on is that SEO Tools should get smart!
Cancel
Rabeel Arwin Dennis

2011-11-30T03:48:39-08:00

WOW!! This is superb, i was reading your post on ipullrank yesterday and now i've read your post twice to get this hardcore techy stuff into my head. Its really mind shredding and informative, now SEOs must thing beyond what we've currently been thinking.

2 0

WOW!! This is superb, i was reading your post on ipullrank yesterday and now i've read your post twice to get this hardcore techy stuff into my head. Its really mind shredding and informative, now SEOs must thing beyond what we've currently been thinking.
Cancel
TedIves

2011-11-30T05:48:54-08:00

As far as Googlebot crawling pages that can only be found using Flash, I think it's more likely that those pages were found through Chrome and Toolbar data coming back from end-users. Highlighting sections of content in previews was something I had not noticed, very interesting!

2 0

As far as Googlebot crawling pages that can only be found using Flash, I think it's more likely that those pages were found through Chrome and Toolbar data coming back from end-users. Highlighting sections of content in previews was something I had not noticed, very interesting!
Cancel
Alex Fusman

2011-12-01T14:46:26-08:00

Awesome post. Thanks for all the comprehensive research and analysis. This should certainly be a wakeup call to those in our industry, in regards to site quality and user experience and their effects on rankings.

1 0

Awesome post. Thanks for all the comprehensive research and analysis. This should certainly be a wakeup call to those in our industry, in regards to site quality and user experience and their effects on rankings.
Cancel
Syed Noman Ali

2011-11-29T22:20:16-08:00

Man you got a ton of your knowledge here appreciated!! and must say your last three conclusion points were true i think your post will be tweeted and share by all of Mozers ;)

1 0

Man you got a ton of your knowledge here appreciated!! and must say your last three conclusion points were true i think your post will be tweeted and share by all of Mozers ;) 
Cancel
SearchOfficeSpace23

2011-12-02T04:17:01-08:00

Hey Mike, This is just an awesome post and an awesome research. I am not that much technical SEO person but I think this info can be useful for you.

I've just searched in google.co.uk for "tring office rental" and on instant preview I can see the cached copy of my page https://screencast.com/t/fiG2oLAP but in actual page at https://www.searchofficespace.com/uk/office-space/tring-serviced-offices.html shows as december. So by this can we suggest that if instant preview is actually caching websites and not showing the realtime page that means it acts as a crawler? Is this another proof of what you said in this article?

Kind Regards

Suren

1 0

Hey Mike, This is just an awesome post and an awesome research. I am not that much technical SEO person but I think this info can be useful for you. I've just searched in google.co.uk for "tring office rental" and on instant preview I can see the cached copy of my page https://screencast.com/t/fiG2oLAP but in actual page at https://www.searchofficespace.com/uk/office-space/tring-serviced-offices.html shows as december. So by this can we suggest that if instant preview is actually caching websites and not showing the realtime page that means it acts as a crawler? Is this another proof of what you said in this article? Kind Regards Suren
Cancel
Jeff Downer

2011-12-01T09:26:56-08:00

You're dead on about not chasing Google. I have achieved my best results by placing transparent and relevant content in as many appropriate places as I can find. I have been leaving PR and "no follow" considerations behind and spenfing more time with onpage optimization.

It is gratifying to think that search robots are out there paying attention.

2 1

You're dead on about not chasing Google. I have achieved my best results by placing transparent and relevant content in as many appropriate places as I can find. I have been leaving PR and "no follow" considerations behind and spenfing more time with onpage optimization. It is gratifying to think that search robots are out there paying attention.
Cancel
Nirav Barot

2011-12-01T01:57:22-08:00

very good job done by you guys.. really appriciate and let Matt Cutt know this as well.. I would also like to tell this is a revolutionery post and i know many of us would agreed with me. As it's always too dificult to calcualt the process of google and you guys done tremendous job i must say.. hats off to you guys.. may google will hire you soon.. ;)

1 0

very good job done by you guys.. really appriciate and let Matt Cutt know this as well.. I would also like to tell this is a revolutionery post and i know many of us would agreed with me. As it's always too dificult to calcualt the process of google and you guys done tremendous job i must say.. hats off to you guys.. may google will hire you soon.. ;)
Cancel
- iPullRank
 
 2011-12-01T08:17:00-08:00
 
 haha..they already tried.
 
 2 0
 
 haha..they already tried.
 Cancel
 - Nirav Barot
 
 2011-12-05T23:47:50-08:00
 
 hey tht's really great.. now don't tell me that you are not intrested!!! or ain't u ?
 
 1 1
 
 hey tht's really great.. now don't tell me that you are not intrested!!! or ain't u ?
 Cancel
algogmbh_petra

2011-11-29T22:16:28-08:00

The wily Mr. Cutts and his team (and of course the rest of the search enginges) should go on and make their robots smarter and smarter - I support that.There is still enough crap around and I don't have the time to make spam reports the whole day trough.

Regarding SEO tools - glad to hear your opinion that SEOmoz is leading the pack.

1 0

The wily Mr. Cutts and his team (and of course the rest of the search enginges) should go on and make their robots smarter and smarter - I support that.There is still enough crap around and I don't have the time to make spam reports the whole day trough. Regarding SEO tools - glad to hear your opinion that SEOmoz is leading the pack.
Cancel
Cory Howell

2011-11-29T22:25:06-08:00

Great analysis Mike. I think these are all things we've seen bits & pieces here & there throughout the years; Google just never confirmed anything 100%. You've done a great job summarizing everything & the examples are superb. Excellent write-up ~ thanks for sharing!

1 0

Great analysis Mike. I think these are all things we've seen bits & pieces here & there throughout the years; Google just never confirmed anything 100%. You've done a great job summarizing everything & the examples are superb. Excellent write-up ~ thanks for sharing!
Cancel
mcday

2011-12-01T07:30:23-08:00

No wonder CLOAKING is so despised by Google. How long have they been hating on that?

1 0

No wonder CLOAKING is so despised by Google. How long have they been hating on that?
Cancel
Sasha Zabelin

2011-12-13T14:25:44-08:00

Hats off to you Mike

Just enjoyed your super long analysis about the google bot.

What a final conclusion GOOGLEBOT + HEADLESS BROWSER = SUPER GOOGLEBOT

You are awsome

1 0

Hats off to you Mike Just enjoyed your super long analysis about the google bot. What a final conclusion GOOGLEBOT + HEADLESS BROWSER = SUPER GOOGLEBOT You are awsome
Cancel
Erica Salvaneschi

2012-02-08T01:51:19-08:00

Micheal this post is BOSS. One thing I've looked into is Google Insights scraping with ImportXML and cannot be done as far as my experience goes. I compared two URLs, the "home" of GInsights and one with search parameters (preceded by # obviously): the scraping happens but it's the same result for the two urls, which suggests importxml can't go further than #...? It would be great if Richard (who's notoriously a importxml expert!!) could share with us! Thank you!

1 0

Micheal this post is BOSS. One thing I've looked into is Google Insights scraping with ImportXML and cannot be done as far as my experience goes. I compared two URLs, the "home" of GInsights and one with search parameters (preceded by # obviously): the scraping happens but it's the same result for the two urls, which suggests importxml can't go further than #...? It would be great if Richard (who's notoriously a importxml expert!!) could share with us! Thank you!
Cancel
Nataniel

2013-09-06T00:52:43-07:00

I like metaphors, but unfortunately you can't use it for projects, like, for example, gambling seo. I'm afraid, they will look ridiculous. The post is ok.

[link removed]

KeriMorgret edited 2013-09-09T09:13:48-07:00
1 0

 I like metaphors, but unfortunately you can't use it for projects, like, for example, gambling seo. I'm afraid, they will look ridiculous. The post is ok. [link removed] 
Cancel
SpookSEO

2014-03-27T02:50:29-07:00

You nailed it Mike! This is one of the best articles I have read so far today. I have learned something new from your post. Anyway, I really like the pacman metaphor. It actually make your post more interesting. Looking forward to read more of your articles.

1 0

 You nailed it Mike! This is one of the best articles I have read so far today. I have learned something new from your post. Anyway, I really like the pacman metaphor. It actually make your post more interesting. Looking forward to read more of your articles. 
Cancel
Jason Cook

2012-01-03T11:13:35-08:00

nice work, Michael (and Josh and Bill). I'm glad to see that Russ broached the patent-doesnt-equal-usage caveat. Your response is also valid - but I think perhaps Russ (and my) point there is that this type of headless broswer activity probably hasn't been occuring for years and years.

But as you said, there is quite a bit of evidence that it is now happening. The patents, the Webmaster videos and conference comments regarding Instant Preview, Javascript, and link/content placement on a page are all compelling arguments that Google has left the text-only Penguin behind and they are viewing web documents through a user's eyes.

1 0

nice work, Michael (and Josh and Bill). I'm glad to see that Russ broached the patent-doesnt-equal-usage caveat. Your response is also valid - but I think perhaps Russ (and my) point there is that this type of headless broswer activity probably hasn't been occuring for years and years. But as you said, there is quite a bit of evidence that it is now happening. The patents, the Webmaster videos and conference comments regarding Instant Preview, Javascript, and link/content placement on a page are all compelling arguments that Google has left the text-only Penguin behind and they are viewing web documents through a user's eyes.
Cancel
Steffen Daleng

2011-12-16T06:28:48-08:00

Brilliant article and research.!

I especially love the direct quote "with great powers..." Your absolutely right, it seems like the most logical conclusion.

1 0

Brilliant article and research.! I especially love the direct quote "with great powers..." Your absolutely right, it seems like the most logical conclusion.
Cancel
Hiren Vaghela

2011-11-30T21:26:50-08:00

Great read Mike, It looks like small ebook for me. This is really useful info for all techy people all around the world. Search engine defines their parameter basis on various factors. Mike have pick up the crucial ones and i agree that he pulls the best one.

1 0

Great read Mike, It looks like small ebook for me. This is really useful info for all techy people all around the world. Search engine defines their parameter basis on various factors. Mike have pick up the crucial ones and i agree that he pulls the best one.
Cancel
Eric Van Buskirk

2011-12-15T16:04:08-08:00

I love this because it is useful to both super advanced SEOs and everyone else. I've always thought "look at the page visually" because in the end Goog has the means to do this. But you gave me a great, concise way of understanding that.

1 0

I love this because it is useful to both super advanced SEOs and everyone else. I've always thought "look at the page visually" because in the end Goog has the means to do this. But you gave me a great, concise way of understanding that.
Cancel
Barry Baker

2011-12-15T17:54:07-08:00

A powerful read, thank you Michael!

1 0

A powerful read, thank you Michael!
Cancel
Neil Ferree

2011-12-09T06:55:53-08:00

Kudos Mike! Every now and again, I get lucky and find an SEO guy who authors an excellent article like this that I can share with my client's to .edu them of the many underlying mechanics that need to be understood if they want their content and site to get the props they think it deserves. BTW, your moniker iPullRank is pretty cool!

NeilFerree edited 2011-12-09T06:56:12-08:00
1 0

Kudos Mike! Every now and again, I get lucky and find an SEO guy who authors an excellent article like this that I can share with my client's to .edu them of the many underlying mechanics that need to be understood if they want their content and site to get the props they think it deserves. BTW, your moniker iPullRank is pretty cool!
Cancel
Nerds On Call

2011-11-30T18:21:27-08:00

Wow. That literally blew my mind. I had no idea Google's ability to "see a website" was so advanced. This is goign to give me so many ideas for ways to imporove SEO.

Thanks for writing.

1 0

Wow. That literally blew my mind. I had no idea Google's ability to "see a website" was so advanced. This is goign to give me so many ideas for ways to imporove SEO. Thanks for writing. 
Cancel
Carl Joel Määttä

2011-11-29T23:57:38-08:00

I like the Pacman metaphor on this article, makes the article more interesting and fun to read Mike.

I hope this information in the article is a wake-up call for some SEOs that may still believe Googlebot as a headless browser and I agree to what you write - it hasen't been for some time.

I'm excited about the user experience design and how important it will be ahead. I agree with you that the SEO tools must get smarter and I'm looking forward for a metric like UX quality score and some correlation data about that.

2 1

I like the Pacman metaphor on this article, makes the article more interesting and fun to read Mike. I hope this information in the article is a wake-up call for some SEOs that may still believe Googlebot as a headless browser and I agree to what you write - it hasen't been for some time. I'm excited about the user experience design and how important it will be ahead. I agree with you that the SEO tools must get smarter and I'm looking forward for a metric like UX quality score and some correlation data about that.
Cancel
Huge1102

2011-11-30T11:00:47-08:00

Nicely done, Michael! I dig the analogy (could also count my age on my hands when Pacman came around).

Frankly, I'm surprised that so many folks out there didn't understand that this is where Google is at (and has been for years). This is why it's increasingly important to optimize for humans moreso than for some antiquated notion of "Robots" or "Spiders"

For soon, Googlebot will be will more human than humans...

1 0

Nicely done, Michael! I dig the analogy (could also count my age on my hands when Pacman came around). Frankly, I'm surprised that so many folks out there didn't understand that this is where Google is at (and has been for years). This is why it's increasingly important to optimize for humans moreso than for some antiquated notion of "Robots" or "Spiders" For soon, Googlebot will be will more human than humans...
Cancel
Baptiste Placé

2011-11-30T08:31:00-08:00

IMO Google can use the headless version of the googlebot, and will use it whenever they determine it's necessary, based on toolbar and Chrome data. The flag that activates this crawl may be also based on site traffic - you can see new ajax urls crawled in WMT when a sites traffic goes up.

1 0

IMO Google can use the headless version of the googlebot, and will use it whenever they determine it's necessary, based on toolbar and Chrome data. The flag that activates this crawl may be also based on site traffic - you can see new ajax urls crawled in WMT when a sites traffic goes up.
Cancel
deniswsrosa

2011-11-30T07:56:37-08:00

How about Style Sheets? Google bot can indentify if an element is outside od the "visible zone" of my page? how deep is the processing of CSS?

1 0

How about Style Sheets? Google bot can indentify if an element is outside od the "visible zone" of my page? how deep is the processing of CSS?
Cancel
- iPullRank
 
 2011-11-30T09:05:57-08:00
 
 Matt Cutts has said in the past that using the huge negative margins for CSS image replacement is frowned upon. Things like color-on-color backgrounds have been detected for a long time now. I think it's safe to say they have a great understanding of your CSS with or without a headless browser.
 
 4 0
 
 Matt Cutts has said in the past that using the huge negative margins for CSS image replacement is frowned upon. Things like color-on-color backgrounds have been detected for a long time now. I think it's safe to say they have a great understanding of your CSS with or without a headless browser.
 Cancel
Michael Janik

2011-11-30T03:19:36-08:00

Ah, and what I don't like about this advertisment detection is that great content comes often with advertisment. So we might have less great content in the future when the guys with the great content earn less on advertisment because Google has punished them to have advertisement on their site.

1 0

Ah, and what I don't like about this advertisment detection is that great content comes often with advertisment. So we might have less great content in the future when the guys with the great content earn less on advertisment because Google has punished them to have advertisement on their site.
Cancel
Simon Dalley

2011-11-30T03:17:57-08:00

Great post! It's all about building content and sites for people, I think the industry has been doing a good job at this for a couple of years now (although there is still some way to go). I think Matt is probably letting us know that Googlebot is getting much smarter, works much more like a browser and is able to look out for things that people are looking for in web content.

It's not really back to the drawing board, or even a wake up call, just a gentle reminder.

For too long there has been an attitude of us against them - both from the search marketing community and from Google's point of view. It's firmly rooted in the history of Google's relationship with SEOs however we are far closer now than ever before to being a purely marketing based industry and marketing relies on data to do what it does best. This is no longer about cheating Google, it's about best practices, promotion, analysing data and working out what to do that's best. Although the search marketing community has by the whole come to terms with it's status I think Google still views us as the enemy.

I once read an article in which the author referred to SEOs as parasites - and I think this is the way Google views us, but if the industry represents Google parasites we are the ones that helps Google digest its food.

Love the use of Pacman through this post!

1 0

Great post! It's all about building content and sites for people, I think the industry has been doing a good job at this for a couple of years now (although there is still some way to go). I think Matt is probably letting us know that Googlebot is getting much smarter, works much more like a browser and is able to look out for things that people are looking for in web content. It's not really back to the drawing board, or even a wake up call, just a gentle reminder. For too long there has been an attitude of us against them - both from the search marketing community and from Google's point of view. It's firmly rooted in the history of Google's relationship with SEOs however we are far closer now than ever before to being a purely marketing based industry and marketing relies on data to do what it does best. This is no longer about cheating Google, it's about best practices, promotion, analysing data and working out what to do that's best. Although the search marketing community has by the whole come to terms with it's status I think Google still views us as the enemy. I once read an article in which the author referred to SEOs as parasites - and I think this is the way Google views us, but if the industry represents Google parasites we are the ones that helps Google digest its food. Love the use of Pacman through this post!
Cancel
Woj Kwasi

2011-11-30T20:08:44-08:00

Nice one Mike! My favourite bit:

While the headless browser and Googlebot as we know it may be separate in semantic explanation I believe that they always crawl in parallel and inform indexation and ultimately rankings. In other words it's like a 2-player simultaneous version of Pacman with a 3D Ms. Pacman and a regular Pacman playing the same levels at the same time. After all it wouldn't make sense for the crawlers to crawl the whole web twice independently.

Truedat

1 0

Nice one Mike! My favourite bit: While the headless browser and Googlebot as we know it may be separate in semantic explanation I believe that they always crawl in parallel and inform indexation and ultimately rankings. In other words it's like a 2-player simultaneous version of Pacman with a 3D Ms. Pacman and a regular Pacman playing the same levels at the same time. After all it wouldn't make sense for the crawlers to crawl the whole web twice independently. Truedat
Cancel
phantom

2011-11-30T12:14:21-08:00

This is a great post Mike. Asked this via Twitter, but I'll also try here too; say Googlebot went away years ago, and Google replaced it with web user data from Chrome. Would this be consistent with the headless browser you talk about here; wouldn't the results be consistent with a headless browser, at least from an engine or searcher perspective? Would the end result be the same from an SEO POV?

1 0

This is a great post Mike. Asked this via Twitter, but I'll also try here too; say Googlebot went away years ago, and Google replaced it with web user data from Chrome. Would this be consistent with the headless browser you talk about here; wouldn't the results be consistent with a headless browser, at least from an engine or searcher perspective? Would the end result be the same from an SEO POV?
Cancel
Rudy Chou

2011-11-30T17:06:24-08:00

Although what you've shared is speculation since you don't work at Google... I've heard even Google engineers don't know about what they're contributing and creating.

However, everything mentioned is highly plausible and definitely possible. If you are an SEO focused only on building links without caring about managing expectations once the visitor hits your site, your in for a surprise.

UX design correlates with managing user expectations. Its not how much traffic you bring, but what your actually doing with your traffic.

Or in this case, lets see what we're doing with GoogleBot.

1 0

Although what you've shared is speculation since you don't work at Google... I've heard even Google engineers don't know about what they're contributing and creating. However, everything mentioned is highly plausible and definitely possible. If you are an SEO focused only on building links without caring about managing expectations once the visitor hits your site, your in for a surprise. UX design correlates with managing user expectations. Its not how much traffic you bring, but what your actually doing with your traffic. Or in this case, lets see what we're doing with GoogleBot.
Cancel
Matt Commins

2011-11-30T18:51:40-08:00

By far the best line: "Any content you thought you were hiding with post-load JavaScript -- stop it. "

1 0

By far the best line: "Any content you thought you were hiding with post-load JavaScript -- stop it. "
Cancel
Brandignity

2011-11-30T13:11:47-08:00

This is some really robust data in this post. I am sure Google has been refining the heck out of their search tool in order to fend off the evil spammers. I think they have some nice technology refinements happening right now and it will be interesting to see where things go from here.

1 0

This is some really robust data in this post. I am sure Google has been refining the heck out of their search tool in order to fend off the evil spammers. I think they have some nice technology refinements happening right now and it will be interesting to see where things go from here. 
Cancel
Matt Malone

2011-11-30T16:00:26-08:00

Wow, another great one Mike. Thanks to all you guys for doing this research - making my job easier.

1 0

Wow, another great one Mike. Thanks to all you guys for doing this research - making my job easier. 
Cancel
AlexThomas

2011-11-30T12:59:00-08:00

Really interesting point about the cached Flash previews Mike!

Got me thinking since all I've seen is Google compiling Flash through Adobe's SDK. I always thought it was only the links and approximate location of them on the page that were parsed from the ActionScript - not full CSS style colours etc.

Actually getting the layouts compiled, skipping the screenshot to the loading splash screen (intro homepage) - that's quite the feat!

They've come quite a way since 2008. I'm sure Bing must have too since they started working with Adobe around the same time.

How well do you think they handle ActionScript animations over time? - Obscuring links with other animated elements etc.

1 0

Really interesting point about the cached Flash previews Mike! Got me thinking since all I've seen is Google compiling Flash through Adobe's SDK. I always thought it was only the links and approximate location of them on the page that were parsed from the ActionScript - not full CSS style colours etc. Actually getting the layouts compiled, skipping the screenshot to the loading splash screen (intro homepage) - that's quite the feat! They've come quite a way since 2008. I'm sure Bing must have too since they started working with Adobe around the same time. How well do you think they handle ActionScript animations over time? - Obscuring links with other animated elements etc.
Cancel
Dubs

2011-11-30T12:44:09-08:00

Great post Mike! +1 for the Awesome Graphics used!!SEO Tools will get smarter and more impressive especially once new Google analytics platform is fully released!

1 0

Great post Mike! +1 for the Awesome Graphics used!!SEO Tools will get smarter and more impressive especially once new Google analytics platform is fully released!
Cancel
Manoj Pallai

2011-11-30T22:28:01-08:00

Mike, Awesome.. Just gratifying! Your described is may be meaningful, but I would like to say one of my recent experience from last google update: I have a AJAX based website (one of my client's)still it's caching regularly and with in 26 days I had got pr3 for each and every inner pages where as my home page is now pr5.Then how we can say google lie ?

1 1

Mike, Awesome.. Just gratifying! Your described is may be meaningful, but I would like to say one of my recent experience from last google update: I have a AJAX based website (one of my client's)still it's caching regularly and with in 26 days I had got pr3 for each and every inner pages where as my home page is now pr5.Then how we can say google lie ? 
Cancel
MyFavouriteCottages

2011-11-30T02:29:58-08:00

I'll be honest, this article was more successful in making me want to pay some pac-man then destroying my established views. I /really/ want some pacman now.

For the most part, if your generaly not trying to game the system, it's just going to mean googles actuay considering if your site is fugly, and saving your customers the pain of leaving it by dropping it down the rankings.

2 2

I'll be honest, this article was more successful in making me want to pay some pac-man then destroying my established views. I /really/ want some pacman now. For the most part, if your generaly not trying to game the system, it's just going to mean googles actuay considering if your site is fugly, and saving your customers the pain of leaving it by dropping it down the rankings.
Cancel
- Jane Copland
 
 2011-11-30T08:53:32-08:00
 
 I disagree (not with wanting to play Pacman): If this is true, and the argument backing up the thesis is certainly excellent, then we need to modify our understanding of how Google works, whether we're gaming the system or not. If we don't change our established views about how Google works, we'll think of Google in 2004 / 2009 / 2012 terms for far too long.
 
 After all, we'd be horribly ineffectual SEOs if we did our job to satisfy Google circa 2002.
 
 This is why I don't get why people get so upset when someone writes or says something about "black hat" SEO. You can learn a lot from just understanding something, whether you choose to use it or not.
 
 8 1
 
 I disagree (not with wanting to play Pacman): If this is true, and the argument backing up the thesis is certainly excellent, then we need to modify our understanding of how Google works, whether we're gaming the system or not. If we don't change our established views about how Google works, we'll think of Google in 2004 / 2009 / 2012 terms for far too long. After all, we'd be horribly ineffectual SEOs if we did our job to satisfy Google circa 2002. This is why I don't get why people get so upset when someone writes or says something about "black hat" SEO. You can learn a lot from just understanding something, whether you choose to use it or not.
 Cancel
 - MyFavouriteCottages
 
 2011-12-01T03:04:42-08:00
 
 I must admit I worded that poorly. It has changed how I'll be viewing google from now on, I don't really /have/ much of an established view with how fluid things are. It was more ment to be a comment on how much I liked the pac-man analogy.
 
 It's actualy ment to be an agreement with the kind of thing your saying as well.
 
 I'm certain black-hat techniques will evolve to deal with this. White-hat will have a few extra things to consider. If people do what they used to and just SEO the code, google will start 'seeing' it, and your SEO would suffer.
 
 I hope thats clearer. I must admit, my first coment is slightly embarisingly open to intupritation.
 
 1 0
 
 I must admit I worded that poorly. It has changed how I'll be viewing google from now on, I don't really /have/ much of an established view with how fluid things are. It was more ment to be a comment on how much I liked the pac-man analogy. It's actualy ment to be an agreement with the kind of thing your saying as well. I'm certain black-hat techniques will evolve to deal with this. White-hat will have a few extra things to consider. If people do what they used to and just SEO the code, google will start 'seeing' it, and your SEO would suffer. I hope thats clearer. I must admit, my first coment is slightly embarisingly open to intupritation.
 Cancel
- Erica Salvaneschi
 
 2012-02-08T01:30:13-08:00
 
 there you go https://www.google.com/pacman/ ah, the irony!
 
 1 0
 
 there you go https://www.google.com/pacman/ ah, the irony!
 Cancel
Nathaniel Bailey

2011-11-30T07:56:47-08:00

"Basically they have trained us to believe that Googlebot, Slurp and Bingbot are a lot like Pacman in that you point it in a direction and it gobbles up everything it can without being able to see where it’s going or what it’s looking at."

What on earth are you on about? If that was true how would google know which sites rank where in the SERP's? That really makes no sense at all dude!

1 3

"Basically they have trained us to believe that Googlebot, Slurp and Bingbot are a lot like Pacman in that you point it in a direction and it gobbles up everything it can without being able to see where it’s going or what it’s looking at." What on earth are you on about? If that was true how would google know which sites rank where in the SERP's? That really makes no sense at all dude!
Cancel
- Nauweb
 
 2012-01-07T11:33:26-08:00
 
 What Googlebot does is collect (eat) and somehow interpret the data. Not the analysis.
 
 1 0
 
 What Googlebot does is collect (eat) and somehow interpret the data. Not the analysis.
 Cancel
WEBDizajnGuru

2011-11-29T23:46:08-08:00

SEO is a science that is constantly changing and evolving from day to day. SEO should evolve into something more than cheating the system.

Thanks for a great post.

2 5

SEO is a science that is constantly changing and evolving from day to day. SEO should evolve into something more than cheating the system. Thanks for a great post.
Cancel
- Angelwarrior
 
 2011-12-08T19:03:43-08:00
 
 Awsome info Mike. No it is not like a science. A science does not change so oft as SEO. SEO has become more like a religion - at least for internet marketers. It has become like a very fickle religion.
 
 1 0
 
 Awsome info Mike. No it is not like a science. A science does not change so oft as SEO. SEO has become more like a religion - at least for internet marketers. It has become like a very fickle religion.
 Cancel
RyanSat

2011-11-30T01:11:11-08:00

Thank god the stupid bot is getting smarter! Yes,you can make it dude!!

Good post.but,boring metaphors,you should limit them.

2 12

Thank god the stupid bot is getting smarter! Yes,you can make it dude!! Good post.but,boring metaphors,you should limit them.
Cancel
- DanAlmond
 
 2011-11-30T02:27:38-08:00
 
 I disagree about the metaphors, they make the post accessible to SEOs like myself, who don't come from a development or programming background - non-technical SEOs if you would.
 
 8 0
 
 I disagree about the metaphors, they make the post accessible to SEOs like myself, who don't come from a development or programming background - non-technical SEOs if you would.
 Cancel
- Steve Ollington
 
 2011-11-30T07:37:13-08:00
 
 I like the metaphors, they help with explaining and rememberring things... that's what metaphors are for.
 
 3 0
 
 I like the metaphors, they help with explaining and rememberring things... that's what metaphors are for.
 Cancel

Post Analytics

Comments 106

Log in to Moz

Don't have an account?