Matt Cutts announced at Pubcon that Googlebot is “getting smarter.” He also announced that Googlebot can crawl AJAX to retrieve Facebook comments coincidentally only hours after I unveiled Joshua Giardino's research that suggested Googlebot is actually a headless browser based off the Chromium codebase at SearchLove New York. I'm going to challenge Matt Cutts' statements, Googlebot hasn't just recently gotten smarter, it actually hasn’t been a text-based crawler for some time now; nor has BingBot or Slurp for that matter. There is evidence that Search Robots are headless web browsers and the Search Engines have had this capability since 2004.
Disclaimer: I do not work for any Search Engine. These ideas are speculative based on patent research done by Joshua Giardino, myself, some direction from Bill Slawski and what can be observed on Search Engine Results Pages.
A headless browser is simply a full-featured web browser with no visual interface. Similar to the TSR (Terminate Stay Resident) programs that live on your system tray in Windows they run without you seeing anything on your screen but other programs may interact with them. With a headless browser you can interface with it via a command-line or scripting language and therefore load a webpage and programmatically examine the same output a user would see in Firefox, Chrome or (gasp) Internet Explorer. Vanessa Fox alluded that Google may be using these to crawl AJAX in January of 2010.
However Search Engines would have us believe that their crawlers are still similar to Unix’s Lynx browser and can only see and understand text and its associated markup. Basically they have trained us to believe that Googlebot, Slurp and Bingbot are a lot like Pacman in that you point it in a direction and it gobbles up everything it can without being able to see where it’s going or what it’s looking at. Think of the dashes that Pacman eats as webpages. Every once in a while it hits a wall and is forced in another direction. Think of SEOs as the power pills. Think of ghosts as technical SEO issues that might trip up Pacman and cause him to not complete the level that is your page. When an SEO gets involved with a site it helps a search engine spider eat the ghost; when they don’t Pacman dies and starts another life on another site.
That’s what they have been selling us for years the only problem is it’s simply not true anymore and hasn’t been for some time. To be fair though Google normally only lies by omission so it’s our fault for taking so long to figure it out.
I encourage you to read Josh’s paper in full but some highlights that indicate this are:
-
A patent filed in 2004 entitled “Document Segmentation Based on Visual Gaps” discusses methods Google uses to render pages visually and traversing the Document Object Model (DOM) to better understand the content and structure of a page. A key excerpt from that patent says “Other techniques for generating appropriate weights may also be used, such as based on examination of the behavior or source code of Web browser software or using a labeled corpus of hand-segmented web pages to automatically set weights through a machine learning process.”
-
The wily Mr. Cutts suggested at Pubcon that GoogleBot will soon be taking into account what is happening above the fold as an indication user experience quality as though it were a new feature. That’s curious because according to the “Ranking Documents Based on User Behavior and/or Feature Data” patent from June 17, 2004 they have been able to do this for the past seven years. A key excerpt from that patent describes “Examples of features associated with a link might include the font size of the anchor text associated with the link; the position of the link (measured, for example, in a HTML list, in running text, above or below the first screenful viewed on an 800.times.600 browser display, side (top, bottom, left, right) of document, in a footer, in a sidebar, etc.); if the link is in a list, the position of the link in the list; font color and/or attributes of the link (e.g., italics, gray, same color as background, etc.);” This is evidence that Google has visually considered the fold for some time. I also would say that this is live right now as there are instant previews that show a cut-off at the point which Google is considering the fold.
- It is no secret that Google has been executing JavaScript to a degree for some time now but “Searching Through Content Which is Accessible Through Web-based Forms” shows an indication that Google is using a headless browser to perform the transformations necessary to dynamically input forms. “Many web sites often use JavaScript to modify the method invocation string before form submission. This is done to prevent each crawling of their web forms. These web forms cannot be automatically invoked easily. In various embodiments, to get around this impediment, a JavaScript emulation engine is used. In one implementation, a simple browser client is invoked, which in turn invokes a JavaScript engine.” Hmmm…interesting.
Google also owns a considerable amount of IBM patents as of June and August of 2011 and with that comes a lot of their awesome research into remote systems, parallel computing and headless machines for example the “Simultaneous network configuration of multiple headless machines” patent. Though Google has clearly done extensive research of their own in these areas.
Not to be left out there’s a Microsoft patent entitled “High Performance Script Behavior Detection Through Browser Shimming” where there is not much room for interpretation; in so many words it says Bingbot is a browser. "A method for analyzing one or more scripts contained within a document to determine if the scripts perform one or more predefined functions, the method comprising the steps of: identifying, from the one or more scripts, one or more scripts relevant to the one or more predefined functions; interpreting the one or more relevant scripts; intercepting an external function call from the one or more relevant scripts while the one or more relevant scripts are being interpreted, the external function call directed to a document object model of the document; providing a generic response, independent of the document object model, to the external function call; requesting a browser to construct the document object model if the generic response did not enable further operation of the relevant scripts; and providing a specific response, obtained with reference to the constructed document object model, to the external function call if the browser was requested to construct the document object model."(emphasis mine) Curious, indeed.
Furthermore, Yahoo filed a patent on Feb 22, 2005 entitled "Techniques for crawling dynamic web content" which says "The software system architecture in which embodiments of the invention are implemented may vary. FIG 1 is one example of an architecture in which plug-in modules are integrated with a conventional web crawler and a browser engine which, in one implementation, functions like a conventional web browser without a user interface (also referred to as a "headless browser")." Ladies and gentlemen I believe they call that a "smoking gun." The patent then goes on to discuss automatic and custom form filling and methods for handling JavaScript.
Search Engine crawlers are indeed like Pacman but not the floating mouth without a face that my parents jerked across the screen of arcades and bars in the mid-80’s. Googlebot and Bingbot are actually more like the ray-traced Pacman with eyes, nose and appendages that we’ve continued to ignore on console systems since the 90’s. This Pacman can punch, kick, jump and navigate the web with lightning speed in 4 dimensions (the 4th is time – see the freshness update). That is to say Search Engine crawlers can render the page as we see them in our own web browsers and have achieved such a high level of programmatic understanding that allows them to emulate a user.
Have you ever read the EULA for Chrome? Yeah me neither, but as with most Google products they ask you to opt-in to a program in which your usage data is sent back to Google. I would surmise that this usage data is not just used to inform the ranking algorithm (slightly) but that it is also used as a means to train Googlebot’s machine learning algorithms in order to teach it to input certain fields in forms. For example Google can use user form inputs to figure out what type of data goes into which field and then programmatically fill forms with generic data of that type. If 500 users put in an age in a form field named “age” it has a valid data set that tells it to input an age. Therefore Pacman no longer runs into doors and walls, he has keys and can scale the face of buildings.
-
Instant Previews - This is why you’re seeing annotated screenshots in Instant Previews of the SERPs. The instant previews are in fact an impressive feat in that they not only take a screenshot of a page but they also visually highlight and extract text pertinent to your search query. This simply cannot be accomplished with a text-based crawler.
-
Flash Screenshots - You may have also noticed in Google Webmaster Tools screenshots of Flash sites. Wait I thought Google couldn’t see Flash?
-
AJAX POST Requests Confirmed - Matt Cutts also confirmed that GoogleBot can in fact handle AJAX POST requests coincidentally a matter of hours after the “Googlebot Is Chrome” article was tweeted by Rand, it made its way to the front of HackerNews and brought my site down. By definition AJAX is content loaded by JavaScript when an action takes place after a page is loaded. Therefore it cannot be crawled with a text-based crawler because a text-based crawler does not execute JavaScript it only pulls down existing code as it is rendered at the initial load.
-
Google Crawling Flash - Mat Clayton also showed me some server logs where GoogleBot has been accessing URLs that are only accessible via embedded in Flash modules on Mixcloud.com:
66.249.71.130 "13/Nov/2011:11:55:41 +0000" "GET /config/?w=300&h=300&js=1&embed_type=widget_standard&feed=http%3A//www.mixcloud.com/chrisreadsubstance/bbe-mixtape-competition-2010.json&tk=TlVMTA HTTP/1.1" 200 695 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +https://www.google.com/bot.html)"
66.249.71.116 "13/Nov/2011:11:51:14 +0000" "GET /config/?w=300&h=300&js=1&feed=http%3A//www.mixcloud.com/ZiMoN/electro-house-mix-16.json&embed_type=widget_standard&tk=TlVMTA HTTP/1.1" 200 694 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +https://www.google.com/bot.html)
Granted this is not new, but another post from 2008 explains that Google "explores Flash files in the same way that a person would, by clicking buttons, entering input, and so on." Oh, you mean like a person would with a browser?
- Site Speed – Although Google could potentially get site load times from toolbars and usage data from Chrome it’s far more dependable for them to get it by crawling the web themselves. Without actually executing all the code on a page it’s not realistic that the calculation of page load time would be accurate.
So far this might sound like Googlebot is only a few steps from SkyNet and due to years of SEOs and Google telling us their search crawler is text-based it might sound like science fiction to you. I assure you that it’s not and that a lot of the things I’m talking about can be easily accomplished by programmers far short of the elite engineering team at Google.
PhantomJS is a headless Webkit browser that can be controlled via a JavaScript API. With a little bit of script automation a browser can easily be turned into a web crawler. Ironically the logo is a ghost similar to the ones in Pacman and the concept is quite simple really; PhantomJS is used to load a webpage as a user sees it in Firefox, Chrome or Safari, extract features and follow the links. PhantomJS has infinite applications for scraping and otherwise analyzing sites and I encourage the SEO community to embrace it as we move forward.
Josh has used PhantomJS to prepare some proof of concepts that I shared at SearchLove.
I had mentioned before when I released GoFish that I’d had trouble scraping the breakout terms from Google Insights using a text-based crawler due to the fact that it’s rendered using AJAX. Richard Baxter suggested that it was easily scrapable using an XPath string which leads me to believe that the ImportXML crawling architecture in Google Docs is based on a headless browser as well.
In any event here Josh pulls the breakout terms from the page using PhantomJS:
Creating screenshots with a text-based crawler is impossible but with a headless webkit browser it’s a piece of cake. Here’s an example that Josh has prepared to show screenshots being created programmatically using PhantomJS.
Chromium is Google’s open source fork of the Webkit browser and I seriously doubt that Google’s motives for building a browser were altruistic. The aforementioned research would suggest that GoogleBot is a multi-threaded headless browser based on that same code.
Well actually they do but they say the "instant preview crawler" is a completely separate entity. Think of the Instant Crawler as Ms. Pacman.
A poster on Webmaster Central complained that they were seeing "Mozilla/5.0 (X11; U; Linux x86_64; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/9.0.597 Safari/534.14" rather than "Mozilla/5.0 (en-us) AppleWebKit/525.13 (KHTML, like Gecko; Google Web Preview) Version/3.1 Safari/525.13" as the Google Web Preview user agent in their logs.
John Mu reveals "We use the Chrome-type user-agent for the Instant Previews testing tool, so that we're able to compare what a browser (using that user-agent) would see with what we see through Googlebot accesses for the cached preview image."
While the headless browser and Googlebot as we know it may be separate in semantic explanation I believe that they always crawl in parallel and inform indexation and ultimately rankings. In other words it's like a 2-player simultaneous version of Pacman with a 3D Ms. Pacman and a regular Pacman playing the same levels at the same time. After all it wouldn't make sense for the crawlers to crawl the whole web twice independently.
So why aren't they more transparent about these capabilities as they pertain to rankings? Two Words: Search Quality. As long as Search Engines can hide behind the deficiencies of a text-based crawler they can continue to use it as a scapegoat for their inability to serve up the best results. They can continue to move towards things such as the speculated AuthorRank and lean on SEOs to literally optimize their Search Engines. They can continue to say vague things like “don’t chase the algorithm”, “improve your user experience” and “we’re weighing things above the fold” that force SEOs to scramble and make Google’s job easier.
Google’s primary product (and only product if you’re talking to Eric Schmidt in court) is Search and if it is publicly revealed that their capabilities are far beyond what they advertise they would then be responsible for a higher level of search quality if not indexation of impossible rich media like Flash.
In short they don’t tell us because with great power comes great responsibility.
A lot of people have asked me as Josh and I've led up to unveiling this research “what is the actionable insight?” and “how does it change what I do as far as SEO?” There are really three things as far as I’m concerned:
-
You're not Hiding Anything with Javascript - Any content you thought you were hiding with post-load JavaScript -- stop it. Bait and switching is now 100% ineffective. Pacman sees all.
-
User Experience is Incredibly Important - Google can literally see your site now! As Matt Cutts said they are looking at what's above the fold and therefore they can consider how many ads are rendered on the page in determining rankings. Google can leverage usage data in concert with the design of the site as a proxy to determine out how useful a site is to people. That's both exciting and terrifying but it also means every SEO needs to pick up a copy of "Don't Make Me Think" if they haven't already.
- SEO Tools Must Get Smarter - Most SEO tools are built on text-based scrapers and while many are quite sophisticated (SEOmoz clearly is leading the pack right now) they are still very much the 80’s Pacman. If we are to understand what Google is truly considering when ranking pages we must include more aspects in our own analyses.
-
When discussing things such as Page Authority and the likelihood of spam we should be visually examining pages programmatically rather than just limiting ourselves to the metrics like keyword density and the link graph. In other words we need a UX Quality Score that is influenced by visual analysis and potential spammy transformations.
-
We should be comparing how much the rendered page differs from what would otherwise be expected of the code. We could call this a Delta Score.
-
When measuring the distribution of link equity from a page the dynamic transformations must also be taken into account as Search Engines are able to understand how many links are truly on a page. This could also be included within the Delta Score.
- On another note Natural Language Processing should also be included in our analyses as it is presumably a large part of what makes Google’s algo tick. This is not so much for scoring but for identifying the key concepts that a machine will associate with a given piece of content and truly understanding what a link is worth in context of what you are trying to rank for. In other words we need contextual analysis of the link graph.
There are two things that I will agree with Matt Cutts on. The only constant is change and we must stop chasing the algorithm. However we must also realize that Google will continue to feed us misinformation about their capabilities or dangle enough to make us jump to conclusions and hold on to them. Therefore we must also hold them accountable for their technology. Simply put if they can definitively prove they are not doing any of this stuff – then at this point they should be; after all these are some of the most talented engineers in the universe.
Google continues to make Search Marketing more challenging and revoke the data that allows us to build better user experiences but the simple fact is that our relationship is symbiotic. Search Engines need SEOs and Webmasters to make the web faster, easier for them to understand and we need Search Engines to react to and reward quality content by making it visible. The issue is that Google holds all the cards and I’m happy to have done my part to pull one.
Your move Matt.
Love the way that you pulled this all together, Mike. Especially the Pacman analogy.
The technical section of Google's webmaster guidelines tell us that:
Use a text browser such as Lynx to examine your site, because most search engine spiders see your site much as Lynx would. If fancy features such as JavaScript, cookies, session IDs, frames, DHTML, or Flash keep you from seeing all of your site in a text browser, then search engine spiders may have trouble crawling your site.
I'm a little hesitant to attribute too much in the way of advanced capabilities to Googlebot, not so much because of that warning, but rather because of uncertainty of how deeply and how richly Google might crawl, analyze, and index pages.
For instance, Google has advanced the use of Optical Character Recognition in many ways with their book scanning projects, their papers and patents on how they can use it in StreetView videos, and in Google Goggle's applications, and yet it still seems like they aren't using it to index text found on the Web in images.
But chances are that they could do that anyday soon. Still, I'm going to keep on advising people to make sure that if text on their pages is important, that they make sure it's text that can be read by Googlebot.
To add to your takeaways, though, if someone is purposefully using text in images to hide something from Googlebot, it's a trick that may have a short shelf life.
Love the way that you presented "evidence" of Googlebot's capabilities through patents, through behavior from the search engines that show certain capabilities, through technology from Google. You then provided a motive for Google to keep quiet about their abilities to use this kind of technology (a little like a CSI investigation).
I agree completely with your takeaways, and think that there might be some more that we could add as well.
For instance, the way that Googlebot can interpret different parts of pages through both an understanding of the DOM of a page and a simulated rendering of the page to capture visual aspects of it such as whitespace, tells us that we should take care in where we put what on a page. Chances are that words within a main content area of a page are going to be the ones given the most weight in a relevancy analysis, for instance.
Thanks for sharing this very thoughtful post with us.
Hey Bill,
Thanks again for pointing out key patents for this and for your response here.
I don't disagree with you. I think OCR is a great example of what you're saying as Google has fine tuned that ability for years but OCR is also very computationally expensive and it doesn't make sense to use it on every image throughout the whole web. I think it's far more likely that they will render a whole page visually and examine it for things like white space and JS transformations. If a site is templated they can just render the differences as needed to limit how much computation is involved.
Not sure if Googlebot interpetting different parts of the page is a new takeaway as prominence has already been considered a factor and at the end of the day people should be making the best websites they can for reasons that reach far beyond SEO. Also since everyone started reverse engineering Panda these are already things that we should consider.
I do however think that this is more reason for SEO and UX need to get much more closely together.
Josh has a post coming out on Distilled on Friday with some more examples that will help seal this a little more. I don't want to give it all away though ;)
-Mike
Hello Bill!
Thanks for jumping in the conversation, and thank you for offering up more patent goodness.
OCR is definitely being within the indexing stack, and extracting content from a snapshop via Instant Preview isn't unlikely; it's just not as efficient as getting in there and rendering the DOM Hierarchy and then analyzing the document. I imagine they'd much rather save that effort for converting PDFs that haven't been OCR'd and trying to read images or other content they know they can't work with in other more efficient methods.
In the Instant Preview examples Mike has in his post, you can actually see manipulation of the preview to highlight the relevant content being identified and extracted. In my upcoming paper on Distilled, I show a few examples where Google is diving well below the fold to extract accurate sentences that summarize a page in relation to a query.
The level of accuracy with which they do this suggests one of two things:
1) They have the most accurate OCR system in the world and are holding out on us in a major way (which is still totally plausible, and even quite likely)
2) They're using a headless browser to efficiently render a page and then deploying natural language processing algorithms and basic CSS to identify a snippet and highlight it; then taking an Instant Preview.
Speaking of advanced OCR insanity and Streetview... anyone else see pictures of building addreses popping up in the Recaptcha Project lately?
- Joshua
Great post, but a couple of responses:
1. Javascript can still be used to hide content if it is remotely accessed from a directory protected by robots.txt, especially if you block the Google Web Preview as well as GoogleBot
2. I strongly doubt that GoogleBot and their Instant Preview "headless browser" crawl in parallel. Rendering the DOM to images for every URL GoogleBot spiders would be all sorts of impossible and a huge waste of processing power given the frequency with which the preview feature is used and the small fraction of web documents that actually end up ever being found in Google.
3. Just in general, a patent does not necessitate use of that invention.
I have often said this, but I think we tend to over-estimate Google. Every cool idea, every great patent that you hear come out of Google has to be scalable. Think about Panda, for example - it is such a difficult filter to compute that it takes weeks between updates, even given Google's insane computing powers.
Regardless, awesome post and well worth the read. It is always good to stay ahead of the game.
Hey Russ,
Fair assessment for #1 but I've definitely seen instances where G has blatantly ignored robots.txt so my point is that it's not a future-proof way to go about doing things. Also if they are crawling visually I'm not sure how much they care about where the content is coming from just that it is eyebrow-raising transformation.
#2 Definitely not impossible as they are doing it on demand for instant previews. It is however realistic for indexing that they only use it for pages they deem important enough. I agree with that.
#3 I agree with you in cases where there is no evidence. Here I don't.
Thanks for the insights Russ!
Question for Russvirante & IPullRank - what implications would you say this has for giving direction to Dev teams on best way to code a site that is AJAX heavy?
What does this mean for options for graceful degradation or progressive enhancement? Are those neccessary? It seems to me that Google would still prefer an HTML equivalent that aligns with the actual user experience for efficiency sake in the processing, but if a site does not currently employ those practices, providing that state can be intensive for dev teams with limited resources. If G can crawl AJAX/JS, is relying on that a viable/effective alternative?
Also, it sounds like making AJAX/JS are in alignment with what you show users and not rely on robots.txt to disallow access to external JS files. Would you say that is correct assumption?
Thanks for your thoughts.
Carrie
Hey Carrie,
It's hard to say what the implications on best practices right now are as they are definitely not using this on everything they index. I would still say that you should follow established SEO best practices until there is more evidence that they have rolled this out on a wider scale.
In the meantime I would also run tests to see if whatever you're serving with AJAX/JS does get indexed.
You can trust that Josh and I will continue to look for more evidence and run tests but at the end of the day we are SEOs because regardless of Google's capability they have proven themselves unreliable.
-Mike
Hi Russ,
I agree with you regarding Issue #3... patents are for the protection of intellectual property and don't necessitate implementation beyond the actual patent documentation.
In the original paper I address some of the architecture changes that were made to Webkit when it became Chromium that really seem to suggest scalability and redundancy for a crawling architecture.
First and foremost we have the V8 JavaScript Engine which is capable of being embedded or operating stand-alone. This capability has made projects like NodeJS possible. It's written in C++ and interprets the JavaScript straight to machine code... it's lightning fast and lives in it's own processing thread.
Next there's the fact that each Chrome "tab" is actually a unique processor thread. If you open your task manager on a Windows PC you'll see several instances of Chrome running; one for each tab open. This provides alot of processing benefit and is highly scalable, especially if you view Google's crawling architecture as a cloud designed to scale resources ala Amazon's EC2. This also provides amazing fault tolerance for a crawler, since if one thread crashes, the other crawling processes are undisturbed.
Lastly, we have the headless component of the equation... alot of the overhead in browsing is related to actually rendering the elements to the screen for a user. Headless browsers don't render a browser window, they're just a background process exposing the DOM; PhantomJS is a great example of this.
We could also dive into the Chromium Remote and other crazy stuff like that, but it's probably just easier to read the original on IPullRank.
In the followup that will be on the Distilled Blog Friday, I extend the original paper and Mike's awesome contributions here with some more proofs directly from the SERPs; so be sure to check for that too, as I think it will be of interest to you.
Thanks again for your thoughts!
Jesus, this is not a post, this is the Revelation Bible; if your conclusions are correct (and you provide quite many proves), this is like taking a giant Blue Pill and discover G-Matrix in its reality. The UX Quality Score seems proved to be the feeder of the Panda algorithm, from what you write, Mike. And It gIves a solid technical base to all the speculations about the use of the web users do. We all knew we were acting in a Matrix, now you give a something that confirms it.
P.s.: the only negative part of your post > you talking about your parents playing Pacman in the Arcades. I was doing the same in the middle '80! To feel old at 8.30 am my time is not so cool ;)
Hahah I played Pac-Man with them in the 80s. Granted I could still count my age on my hands then.
Thanks for reading Gianluca!
-Mike
First, nice post and creative as always. :)
This is one of those things where it's nice to see evidence, but I'm also kind of left feeling like "yeah, not really too surprised". We've known for years they could do visual segmentation, beyond just boiler plate code analysis. Matt Cutts said once that links could be weighted differently on the how far down they appear visually on a page. Then previews in SERPs showed that they can render Flash and JavaScript ads to generate previews. Then we saw those Adwords emails back in February after Panda that talked about the visual space taken up by ads above the fold on a 1024 screen. They've been extracting text from images, doing advanced image analysis, and can even do search by images. So their visual analysis is exceptionally better than it was a few years ago.
I certainly believe, and the evidence shows, they're moving towards a more visual crawler, but I’d be willing to bet that a huge portion of the analysis is still being treated as if it's all done with a text based browser. (Not that they aren't gathering this visual information, but it doesn't mean it's reflected in results yet)
I'd imagine visual analysis, JavaScript execution, filling out forms, and deep crawls are extremely expensive in terms of computation and resources. So what I imagine is the case is that this technology has existed for quite some time and what we're seeing are improvements in efficiencies and scalability.
I remarked in my recent posts that all of these things being uncovered have been sitting out there for years, even the social stuff. We're just now seeing them being put into practice. Google has likely been able to do this level of deep media and JavaScript crawling on a small sample fairly easily, but the limitation might have come when you look at the daunting demand of doing that at scale for the entire internet, for all countries, and in all languages.
Even Panda wasn't a change in understanding; it was a change in scaling machine learning. It was a process / technology improvement. The Freshness update wasn't a change in understanding either, but advancement built on top of the path laid by Caffeine.
I think it might also be safe to say that GoogleBot's capabilities are conditional. When the environment (code, PageRank, trust, etc.) is right, it might warrant different levels of crawling. If nothing in the code suggests AJAX, flash, JavaScript rendering, then resources could be saved by not executing visual rendering of the page.
Then I think we get ourselves into another conversation, like crawl efficiency and crawl budget. Does the cap of resources allotted to a domain get burned through if the crawl must spend additional time dealing with Flash, JavaScript and AJAX? If so, it could still be a good recommendation to keep JS and Flash to a minimum, especially if you’re having indexation issues.
Justin,
Yeah, I don't disagree with you here and I brought up similar points in my reply to Bill.
The crawlers may be separate and Google may just fire up the visual crawler as needed. In most cases it might be enough to just crawl a homepage and a few internal pages because the rest of the site doesn't change much visually. Also based on my experience with PhantomJS visual crawling is actually incredibly fast as there is no actual visual interface. The analysis will obviously be quite computationally expensive but this is Google we're talking about here; I'm sure they have figured out a way to scale it.
I think an important point though is the machine learning aspect of the end user's version of Chrome and how it could potentially be informing Googlebot and therefore making the filling out of forms and such far less computationally expensive than you are suggesting since their educated guesses are for more accurate then just random filling of fields.
Def agree with GBot's powers being conditional and used as needed.
To your point on expending crawl allocation I think in general page speed is a good proxy for determining that.
Awesome insights Justin, thank you.
-Mike
I'm definitely only speculating on many things, but a few thoughts. (And not disagreeing, just intellectual)
I agree with Russ that we may give Google too much credit at times in terms of computational power. No doubt their computational abilities are absolutely amazing, but Panda does take 20 to 30 days to process. That isn't so much a problem with Google, but just the extensive amount of work / data needed. I've also seen a comment before that the Google bomb algo is a push button algo, which doesn't run all the time.
And I think we're also discussing two different types of computational work. There is the crawl work (all we're seeing with PhantomJS) and then the analysis computation. Even if visual crawling is exceptionally fast, doing something with all that data is another problem all together. Any time they get into image analysis, the amount of computing power needed increases significantly. This is why I feel that they're certainly doing the first problem (crawling), and the evidence shows that, but they haven't quite solved the second (scalable advance statistical analysis of robust media / file / technology types), which is reflected in search quality, but they're working on it.
And I think we have to be cautious using page speed as a proxy for the resource requirements for the type of crawl we're discussing. Page speed is the time it takes for the initial load, but subsequent AJAX calls, form submission or the deep crawl of a Flash file, is resource demands in addition to the page speed.
Thanks for the post. I'm really enjoying all the conversation it started. Regardless of exactly how they're doing it all, it's certainly good to get people talking about it.
In the spirt of speculation...
The problem is quite complicated but I don't know that we give Google too much credit. After all these are the people that brought the crawler into the real world with self-driving cars and have data centers that float on the ocean. They built a program where I can look at your house from my desk. Granted they are not NASA but Search is their bread and butter. If it can be done and it makes sense for improving their main product it's realistic that they will eventually do it. In the case of browser-based crawling the future is the past.
Panda takes 30 days to roll out but we really don't know how long it takes for them to actually do the analysis. How much of that time is them QAing the new indices and re-running it? Obviously indexation and ranking is a big data problem of epic proportions but everything they do is an issue of scalability and every year they conquer it better.
Hmmm...that's a good point then I'd say page speed + size of rich media + speed of dynamic assets. That should be some sort of load score as well.
I loved reading through a recent Google patent that described how they may do a Panda type analysis on videos, which could help them do things like identify when adult content might be inserted into otherwise innoculous content within a video (a point brought out in the patent). Running something like that in a timely manner on a site like YouTube seems a staggering undertaking to me, but so does doing something similar to all the sites on the World Wide Web.
I have noticed Google acquiring more and more hardware related patents, and seemingly investing in more and more data centers as well. They are working hard on building the capacity to do things like we are speculating upon.
And infrastructure updates like Caffeine that allow for incremental updates to indexes about specific documents have been working to reduce computational costs of such efforts significantly.
For instance, the recent Google freshness update was likely based upon approaches originally developed as described in divisional patents from Google's Historical Data patent from 2005 that only really became feasible once Caffeine was in place.
Some of the things we see and read in places like patents might not necessarily be technologically feasible, or at least not yet. But I like being aware of the possibilities, and being able to do things like avoid potential problems like Panda when they launch.
I should read before I comment :) Just made the same point about Caffeine. Too many people don't get that these infrastructure advances fuel SEO updates for years. Caffeine made the things Google has wanted to do for the last 3-5 years but couldn't possible. That's why I think it's plausible to think that ideas from 2004 could be coming into play, computationally.
To quote: "I think it might also be safe to say that GoogleBot's capabilities are conditional. When the environment (code, PageRank, trust, etc.) is right, it might warrant different levels of crawling"
Justin's point is KEY. iPullRank mentioned that some methods are "computationally expensive" and a headless browser falls into that category, at least when the enormity of indexing the net is concerned.
Were I to design a search engine, I would save my sophisticated, computationally expensive crawlers for sites that warranted that level of attention. To use this type of crawler on the whole web would be a waste of resources.
Therefore, I think the smart money is on using JS and flash in a limited manner...at least when it comes to a page's core content. I'd guess that, most of the time, Googlebot is dumb.
If there were nobel prizes in the search industry then Joshua Giardino, Michael king and Bill Slawski deserve them for this groundbreaking research, seriously. Hats off to you guys. This is the best post of 2011.
Well said.
Great food for thought here, Mike. Tying into what Russ said, I think the toughest part is always separating what Google CAN do from what Google DOES do. No question they've had the capability to parse pages visually for years, and they use it at times. What's hard to sort out is - how often do they use it and in what capacities?
I doubt that every page crawled is fully visually segmented, personally, but I strongly suspect they extract elements on at least a sitewide basis, to understand navigation, footers, etc. I do think the idea that we can just move some HTML around in 2011 and fool Google is pretty naive. That worked 5 years ago, but not so much these days (I haven't seen any good evidence of code-order effects in the past 2 years, honestly).
The other incredibly important point people overlook that this all touches on is the importance of infrastructure updates (like Caffeine and, before it, Big Daddy). Too many webmasters and SEOs consider that boring tech stuff - it's not an algo update, so who cares? I'd argue those infrastructure overhauls are 10X more important than most algo updates, because they fuel all updates for years to come. Caffeine made Panda possible and gave Google computational horsepower. That means they can do more faster and put technology they bought in 2004 to work in 2011. Caffeine will be powering advancements well into 2012 and probably beyond, IMO.
I want to meet the person who gave this a thumbs down, and kick them in the chest. Seriously? Thumbs down? You probably hate bacon and puppies too.
Excellent post Mike, very informative and well layed out.
LOL, they didn't much care for your comment either, apparently.
I very much appreciate the sentiment Corey, but they are certainly entitled to their opinions.
Man, you're on a roll Mike! What it all boils down to is simply to provide awesome quality content, be it copy, images, videos etc - most people trying to game the system purely for monetary reasons will almost always fall behind the curb. I bet some of them are probably spending more time trying to find the next exploit, workaround, time-saver and shortcut, than they would need to spend on quality content creation in order to earn more for themselves.
There's a lot of truth to focusing on building for people over building for the algorithm. If you're building for people, you're future proofing yourself, because the algorithm is all about identifying meaningful features of human engagement and interaction.
Deploying a headless browser as a crawler just makes this feature extraction that much more effective since they can map the data from the Google Toolbar, Human Search Quality Reviews, and the Chrome opt-in program right back to their search stack.
With that said, the technological aspect is a very important part of building for people. UX/UI concerns are very important, especially with the explosion of both social and mobile. Discovery is another aspect of building for people... if people can't find it, they can't interact with it and spread it... and Google owns a good chunk of new content discovery in any country. Facebook is probably the next biggest driver of new content discovery, but it's a walled garden; thus building for GoogleBot is in some respects building for people too.
Haha it's a crazy little circle!
wow, that's a killer post Mr. Mike.
About the advertisment stuff, I was asked to hold a class about search engines for a University here in Italy. Near the end I found myself talking about algorithm updates, Panda, content quality and so on, supporting as usual the thesis that good content always pays the effort.
Anyway, shuffling through the data, I saw that one of the sites rising in popularity and search traffic is YouTube. Nothing wrong with it, absolutely, YouTube is a great UGC and video sharing site. So I opened the browser to show it to my students, and...
...oh, gosh, it's PACKED with ads. Ultra-invasive, huge and with unbearable user experience issues! So, dear Mr. Google, while I still hate ads-packed sites and support your (supposed) struggle for quality, why are you becoming what you hate?
YouTube also has loads of thin and duplicate content. It remains a glaring contradiction to everything Google has been recommending post-Panda. For now.
Amen to that !!! it makes me angry as well...
Agreed. But I think YouTube is so "relevant" that even if it gets some how "penalized" it would still ranking great.
Awesome post - particularly the evidence section and point about site loading speed, how this is effectively measured has been a question of mine. I'm happy if GoogleBot can see sites visually - it should encourage making sites for the user not the engines :)
Spot-on Charlotte. It's another Awesome post by Mike.
Will hopefully help to bring Search and U/X teams closer together, superb in-depth research, analysis and explanations such as Mike's here are so valuable :-)
What a monster of a post. You make a very compelling case here. It stands to reason that GoogleBot would want to be as close to a human interacting the web as possible, and since we see the web through browsers why can't the Bot be one?
Anyway, great job :-)
Awesome post Mike!
The real question is, do people want to hear the message? Seems like the writing has been on the wall (in the SERPs) for a very long time, but very few have wanted to see it. I'm glad you came out and published this in the SEOmoz arena, as I fear it is going to take a lot to move some people from their "find a trick to fool googlebot" mentality. Hopefully the respect that you and the SEOmoz Blog have in the industry will make some of those people sit up and take notice.
As for Matt's public announcements, I think the key is always to watch out for the qualifications. Here's a classic (and intriguing) example from his recent blog post on algorithm changes:
" today we want to give you a flavor of specific algorithm changes by publishing a highlight list of many of the improvements we’ve made over the past couple weeks."
A little further on in the post, after a little more waffle about the more than 500 changes made in a year, he says "In that spirit, here’s a list of ten improvements from the past couple weeks:"
Since by now we are all feeling so warm, fuzzy and to be honest just plain shocked, at the idea of Google revealing such information, it is only natural that we might overlook the fact that "many" in fact refers to considerably more than the "ten" in the "highlight list".
Let's face it, the man is as talented with language and suggestion as he is as an engineer ;)
Where Search Engines are concerned, I am reminded more and more of that very old saying "Believe none of what you hear and only half of what you see".
Great post Mike, I really hope the message gets through.
Sha
If I could, I'd give you 1000 Thumbs Up. What post (post?), God! Eh, Mike, you're an absolutely pro.
It's exciting the fact that User Experience Design is going to be so important.
From now, I'm sure Google has a problem with you. You're kicking its ass and telling it: Come on, guy, tell us the truth! what are you waiting for? Did you think we weren't so good at?
Amazing, Mike. You've just got Google into trouble.
"Any content you thought you were hiding with post-load JavaScript -- stop it. Bait and switching is now 100% ineffective."
Amen, may it be forever left behind...
I'm sure some people are still playing with this. You're right the times of black hat are gone.
Definitely one of the best posts of the year. Great research and very articulate thoughts. I would literally pay to read this.
That's totally, undoubtedly an awesome post. I was not ever having the idea of Google bot to be a Headless browser and old text based carwler combo.. now i got to learn, why The Google folks were giving so much attention on building Chrome inspite of looking at the competition in browser industry and having the Google toolbar, they still made their focus to build chromium webkit.
You are right ..it's like Skynet. The Google Bot is trying to reach at the level of Skynet.
I have got new name for google bot that's Ms. Pacman. and yes ..this Ms. Pacman is much clever. I have to be careful from it. Thanks mike!
A quick look at the log files from any website will show that the bots are definitely not headless browers. If they were, they would act more like actual users retrieving all the resources (js, css, jpg ...) required to construct the webpage. It is much more likely that the headless browser is part of a second line analysis done using the cached content. Here google would probably use a feedback loop from the search indexes to determine, says the pages it will require for instant preview etc... I think that the amount of resources required for headless browsing would require some rationing of resources. Why waste good processing power on spam... On the up side, the technology used for headless browsing will probably get fed back into the rendering engines used by chrome and the likes...
I owe you an apology. I have spotted the headless monster. Its is alive and crawling. I made some changes to a webpage and placed some images in a sub directory, and low and behold along comes googlebot 2.1, sees the new structure and followup with fetching the images from the sub-directory, with the refer info as the parent page. This was not the google image bot.
Nice post however I feel the only answer it will get ... if it does from Google will be that just because they could do something for years before this, it does not mean they did it.
Capability and acutal use has always differed in the case of Google so maybe Matt is only hinting that some capabilities are now going to be brought into the play.
Also, don't forget that Google Analytics is on a large chunk of websites out there. I'm pretty sure they don't ignore this data. It's probably used in conjunction with crawlers to improve the quality even more.
GBot has been deploying spiders able to interpret and understand the FULL DOM since at least 2004.
The particular bots may NOT have been used for ALL of their web scraping needs but they were definately used for some.
I guess a point of contention here is not necessarily that Google could understand the Document Object Model, but to what degree it would go through the process of rendering it out. GoogleBot may understand that "some text here<img src='mybigpic'>more text here" contains an image in context, but does it know that the image is only 5px by 5px or that it might be 1000px by 1000px and pushes the second text to the bottom of the page.
I do very much agree with you that in certain contexts Google may choose to deploy a deeper look at data. For example, I believe the top 10 results are treated with greater scrutiny than the next 10. This is not to say that the algorithm is different, but that there are different filters that get applied as a site gets closer and closer to the top 10.
Russ
Let me reword it
I believe G has been able to understand the fully rendered DOM and changes within the DOM since at least 2004.
Whether we call it a headless browser or not is moot - GBot and/or portions of code that are automated within Google operate as a browser and have since at least the earlier part of last decade.
All of your other points I completely agree with.
Hey all,
Check out Josh Giardino's follow up post "Google Stop Playing the Jig Is Still Up!" on the Distilled blog https://www.distilled.net/blog/seo/google-stop-playing-the-jig-is-still-up-guest-post/. It features further research and evidence.
-Mike
In my opinion yes they can see all this stuff but it's not yet implemented to work in parallel with the text based crawler..
I have cloaked sites (from the old days) still indexed and bringing traffic that would certainly be banned if they would compare the "instant preview crawl" with the text based crawl. The 2 are totally differents pages...
Hey Charles would you be keen to share those sites? Or at least screenshots of the cached pages and the instant previews? I'm very curious.
-Mike
I wouldn't share the sites here but if you keep it for yourself I can give you one, just PM me.
I might do the test with a new site to see if the same thing happen.
Great work sir you have throw out all you research in good manner so even a starter SEO will understand what is happening in a SEO world. Google always tries to mislead us (SEO’s) with If and Buts. They don’t tell us everything about search because as a SEO’s we are the biggest rival who tries to compete with them to show results as our wish. As far as the Googleboot is concerned it is reaching that point where thy literary behave like a human net surfers(Humans). The most positive part of the advancement is, now we don’t need to treat googleboot (oops PACKMAN) and humans in a different manner.
But I am still confused do they actually able to read Flash sites right now or they need some more time?
Thank you Mike, a very informative and well written post, with great take-aways... I especially like the idea for the new tool :)
I think the key thing to keep in mind is that there is a difference between seeing all the angles and knowing how and when to play those angles.
The post provided some compelling arguments that Google sees and understands what a site's up to -- but that doesn't mean that they're acting on it today.
It's somewhat akin to seeing and understanding what some of the most competitive SEOs are doing, and then replicating or one-upping that yourself or within your organization. It's doable, but it's a journey, and you're far more likely to chip away at implementing a new approach then you are to change the way you work overnight.
I think that Google might even be able to read text in graphics. Why all the antispam Numbers and letters under forms are so distorted? I think because spam bots are able to read non distorted letters and numbers. And if they can do that why Google should be able to that? The only reason that I can think of is that it takes up to much calculation power (processor time) but on the other hand this because less and less a problem with the advance of technology. Istn't it!?
check out "Luis von Ahn talk massive scale online collaboration" on TED for an interesting talk on captchas
This is one of the most comprehensive and interesting articles I have read in a while. And I loved the Pac-Man analogy because without it, I am not sure I would have been able to get a full grasp. I believe this reinforces the fact that as SEOs, we need to be more concerned with providing users with an improved overall experience from content quality, content placement and content design.
What a post sir! Super phenomenal... I can’t say Google lie, but this is common that Google manipulate /mislead from time to time but you are right this is something we should have our eye on.
The Super Googlebot (with headless browser) is good in a way that now the content that we hide from search engine through Java Script and others are no more hidden and one have to show Google exactly the same copy as they show it to users.
One part i would highly agree on is that SEO Tools should get smart!
WOW!! This is superb, i was reading your post on ipullrank yesterday and now i've read your post twice to get this hardcore techy stuff into my head. Its really mind shredding and informative, now SEOs must thing beyond what we've currently been thinking.
As far as Googlebot crawling pages that can only be found using Flash, I think it's more likely that those pages were found through Chrome and Toolbar data coming back from end-users. Highlighting sections of content in previews was something I had not noticed, very interesting!
Awesome post. Thanks for all the comprehensive research and analysis. This should certainly be a wakeup call to those in our industry, in regards to site quality and user experience and their effects on rankings.
Man you got a ton of your knowledge here appreciated!! and must say your last three conclusion points were true i think your post will be tweeted and share by all of Mozers ;)
Hey Mike, This is just an awesome post and an awesome research. I am not that much technical SEO person but I think this info can be useful for you.
I've just searched in google.co.uk for "tring office rental" and on instant preview I can see the cached copy of my page https://screencast.com/t/fiG2oLAP but in actual page at https://www.searchofficespace.com/uk/office-space/tring-serviced-offices.html shows as december. So by this can we suggest that if instant preview is actually caching websites and not showing the realtime page that means it acts as a crawler? Is this another proof of what you said in this article?
Kind Regards
Suren
You're dead on about not chasing Google. I have achieved my best results by placing transparent and relevant content in as many appropriate places as I can find. I have been leaving PR and "no follow" considerations behind and spenfing more time with onpage optimization.
It is gratifying to think that search robots are out there paying attention.
very good job done by you guys.. really appriciate and let Matt Cutt know this as well.. I would also like to tell this is a revolutionery post and i know many of us would agreed with me. As it's always too dificult to calcualt the process of google and you guys done tremendous job i must say.. hats off to you guys.. may google will hire you soon.. ;)
haha..they already tried.
hey tht's really great.. now don't tell me that you are not intrested!!! or ain't u ?
The wily Mr. Cutts and his team (and of course the rest of the search enginges) should go on and make their robots smarter and smarter - I support that.There is still enough crap around and I don't have the time to make spam reports the whole day trough.
Regarding SEO tools - glad to hear your opinion that SEOmoz is leading the pack.
Great analysis Mike. I think these are all things we've seen bits & pieces here & there throughout the years; Google just never confirmed anything 100%. You've done a great job summarizing everything & the examples are superb. Excellent write-up ~ thanks for sharing!
No wonder CLOAKING is so despised by Google. How long have they been hating on that?
Hats off to you Mike
Just enjoyed your super long analysis about the google bot.
What a final conclusion GOOGLEBOT + HEADLESS BROWSER = SUPER GOOGLEBOT
You are awsome
Micheal this post is BOSS. One thing I've looked into is Google Insights scraping with ImportXML and cannot be done as far as my experience goes. I compared two URLs, the "home" of GInsights and one with search parameters (preceded by # obviously): the scraping happens but it's the same result for the two urls, which suggests importxml can't go further than #...? It would be great if Richard (who's notoriously a importxml expert!!) could share with us! Thank you!
I like metaphors, but unfortunately you can't use it for projects, like, for example, gambling seo. I'm afraid, they will look ridiculous. The post is ok.
[link removed]
You nailed it Mike! This is one of the best articles I have read so far today. I have learned something new from your post. Anyway, I really like the pacman metaphor. It actually make your post more interesting. Looking forward to read more of your articles.
nice work, Michael (and Josh and Bill). I'm glad to see that Russ broached the patent-doesnt-equal-usage caveat. Your response is also valid - but I think perhaps Russ (and my) point there is that this type of headless broswer activity probably hasn't been occuring for years and years.
But as you said, there is quite a bit of evidence that it is now happening. The patents, the Webmaster videos and conference comments regarding Instant Preview, Javascript, and link/content placement on a page are all compelling arguments that Google has left the text-only Penguin behind and they are viewing web documents through a user's eyes.
Brilliant article and research.!
I especially love the direct quote "with great powers..." Your absolutely right, it seems like the most logical conclusion.
Great read Mike, It looks like small ebook for me. This is really useful info for all techy people all around the world. Search engine defines their parameter basis on various factors. Mike have pick up the crucial ones and i agree that he pulls the best one.
I love this because it is useful to both super advanced SEOs and everyone else. I've always thought "look at the page visually" because in the end Goog has the means to do this. But you gave me a great, concise way of understanding that.
A powerful read, thank you Michael!
Kudos Mike! Every now and again, I get lucky and find an SEO guy who authors an excellent article like this that I can share with my client's to .edu them of the many underlying mechanics that need to be understood if they want their content and site to get the props they think it deserves. BTW, your moniker iPullRank is pretty cool!
Wow. That literally blew my mind. I had no idea Google's ability to "see a website" was so advanced. This is goign to give me so many ideas for ways to imporove SEO.
Thanks for writing.
I like the Pacman metaphor on this article, makes the article more interesting and fun to read Mike.
I hope this information in the article is a wake-up call for some SEOs that may still believe Googlebot as a headless browser and I agree to what you write - it hasen't been for some time.
I'm excited about the user experience design and how important it will be ahead. I agree with you that the SEO tools must get smarter and I'm looking forward for a metric like UX quality score and some correlation data about that.
Nicely done, Michael! I dig the analogy (could also count my age on my hands when Pacman came around).
Frankly, I'm surprised that so many folks out there didn't understand that this is where Google is at (and has been for years). This is why it's increasingly important to optimize for humans moreso than for some antiquated notion of "Robots" or "Spiders"
For soon, Googlebot will be will more human than humans...
IMO Google can use the headless version of the googlebot, and will use it whenever they determine it's necessary, based on toolbar and Chrome data. The flag that activates this crawl may be also based on site traffic - you can see new ajax urls crawled in WMT when a sites traffic goes up.
How about Style Sheets? Google bot can indentify if an element is outside od the "visible zone" of my page? how deep is the processing of CSS?
Matt Cutts has said in the past that using the huge negative margins for CSS image replacement is frowned upon. Things like color-on-color backgrounds have been detected for a long time now. I think it's safe to say they have a great understanding of your CSS with or without a headless browser.
Ah, and what I don't like about this advertisment detection is that great content comes often with advertisment. So we might have less great content in the future when the guys with the great content earn less on advertisment because Google has punished them to have advertisement on their site.
Great post! It's all about building content and sites for people, I think the industry has been doing a good job at this for a couple of years now (although there is still some way to go). I think Matt is probably letting us know that Googlebot is getting much smarter, works much more like a browser and is able to look out for things that people are looking for in web content.
It's not really back to the drawing board, or even a wake up call, just a gentle reminder.
For too long there has been an attitude of us against them - both from the search marketing community and from Google's point of view. It's firmly rooted in the history of Google's relationship with SEOs however we are far closer now than ever before to being a purely marketing based industry and marketing relies on data to do what it does best. This is no longer about cheating Google, it's about best practices, promotion, analysing data and working out what to do that's best. Although the search marketing community has by the whole come to terms with it's status I think Google still views us as the enemy.
I once read an article in which the author referred to SEOs as parasites - and I think this is the way Google views us, but if the industry represents Google parasites we are the ones that helps Google digest its food.
Love the use of Pacman through this post!
Nice one Mike! My favourite bit:
While the headless browser and Googlebot as we know it may be separate in semantic explanation I believe that they always crawl in parallel and inform indexation and ultimately rankings. In other words it's like a 2-player simultaneous version of Pacman with a 3D Ms. Pacman and a regular Pacman playing the same levels at the same time. After all it wouldn't make sense for the crawlers to crawl the whole web twice independently.
Truedat
This is a great post Mike. Asked this via Twitter, but I'll also try here too; say Googlebot went away years ago, and Google replaced it with web user data from Chrome. Would this be consistent with the headless browser you talk about here; wouldn't the results be consistent with a headless browser, at least from an engine or searcher perspective? Would the end result be the same from an SEO POV?
Although what you've shared is speculation since you don't work at Google... I've heard even Google engineers don't know about what they're contributing and creating.
However, everything mentioned is highly plausible and definitely possible. If you are an SEO focused only on building links without caring about managing expectations once the visitor hits your site, your in for a surprise.
UX design correlates with managing user expectations. Its not how much traffic you bring, but what your actually doing with your traffic.
Or in this case, lets see what we're doing with GoogleBot.
By far the best line: "Any content you thought you were hiding with post-load JavaScript -- stop it. "
This is some really robust data in this post. I am sure Google has been refining the heck out of their search tool in order to fend off the evil spammers. I think they have some nice technology refinements happening right now and it will be interesting to see where things go from here.
Wow, another great one Mike. Thanks to all you guys for doing this research - making my job easier.
Really interesting point about the cached Flash previews Mike!
Got me thinking since all I've seen is Google compiling Flash through Adobe's SDK. I always thought it was only the links and approximate location of them on the page that were parsed from the ActionScript - not full CSS style colours etc.
Actually getting the layouts compiled, skipping the screenshot to the loading splash screen (intro homepage) - that's quite the feat!
They've come quite a way since 2008. I'm sure Bing must have too since they started working with Adobe around the same time.
How well do you think they handle ActionScript animations over time? - Obscuring links with other animated elements etc.
Great post Mike! +1 for the Awesome Graphics used!!SEO Tools will get smarter and more impressive especially once new Google analytics platform is fully released!
Mike, Awesome.. Just gratifying! Your described is may be meaningful, but I would like to say one of my recent experience from last google update: I have a AJAX based website (one of my client's)still it's caching regularly and with in 26 days I had got pr3 for each and every inner pages where as my home page is now pr5.Then how we can say google lie ?
I'll be honest, this article was more successful in making me want to pay some pac-man then destroying my established views. I /really/ want some pacman now.
For the most part, if your generaly not trying to game the system, it's just going to mean googles actuay considering if your site is fugly, and saving your customers the pain of leaving it by dropping it down the rankings.
I disagree (not with wanting to play Pacman): If this is true, and the argument backing up the thesis is certainly excellent, then we need to modify our understanding of how Google works, whether we're gaming the system or not. If we don't change our established views about how Google works, we'll think of Google in 2004 / 2009 / 2012 terms for far too long.
After all, we'd be horribly ineffectual SEOs if we did our job to satisfy Google circa 2002.
This is why I don't get why people get so upset when someone writes or says something about "black hat" SEO. You can learn a lot from just understanding something, whether you choose to use it or not.
I must admit I worded that poorly. It has changed how I'll be viewing google from now on, I don't really /have/ much of an established view with how fluid things are. It was more ment to be a comment on how much I liked the pac-man analogy.
It's actualy ment to be an agreement with the kind of thing your saying as well.
I'm certain black-hat techniques will evolve to deal with this. White-hat will have a few extra things to consider. If people do what they used to and just SEO the code, google will start 'seeing' it, and your SEO would suffer.
I hope thats clearer. I must admit, my first coment is slightly embarisingly open to intupritation.
there you go https://www.google.com/pacman/ ah, the irony!
"Basically they have trained us to believe that Googlebot, Slurp and Bingbot are a lot like Pacman in that you point it in a direction and it gobbles up everything it can without being able to see where it’s going or what it’s looking at."
What on earth are you on about? If that was true how would google know which sites rank where in the SERP's? That really makes no sense at all dude!
What Googlebot does is collect (eat) and somehow interpret the data. Not the analysis.
SEO is a science that is constantly changing and evolving from day to day. SEO should evolve into something more than cheating the system.
Thanks for a great post.
Awsome info Mike. No it is not like a science. A science does not change so oft as SEO. SEO has become more like a religion - at least for internet marketers. It has become like a very fickle religion.
Thank god the stupid bot is getting smarter! Yes,you can make it dude!!
Good post.but,boring metaphors,you should limit them.
I disagree about the metaphors, they make the post accessible to SEOs like myself, who don't come from a development or programming background - non-technical SEOs if you would.
I like the metaphors, they help with explaining and rememberring things... that's what metaphors are for.