Google has found an intelligent way to arrange the results for a search query. But an interesting question is - where we can find that intelligence? A lot of people have conducted research into the indexing process and even more have tested ranking factors on their weight, but we wondered how smart Googlebot itself is. To make a start, we took some statements and commonly used principles and tested how Googlebot handled them. Some results are questionable and should be tested on a few hundred domains to be sure, but it can give you some ideas.
Speed of The Crawler
The first one we tested was Matt Cutts on his following statement: “... the number of pages that we crawl is roughly proportional to your PageRank".
This brings us to one of the challenges large content sites are facing - the problem of getting all pages indexed. You can imagine if Amazon.com was a new website, it would take a while for Google to crawl all 48 million pages and if Matt Cutts’s statement is true, it would be impossible without any incoming links.
To test it, we took a domain with no history (never registered, no backlinks) and made a page with 250 links on it. Those links refer to pages that also have 250 links (and so on…). The links and URLs were numbered from 1 to 250, in the same order as they appeared in the source code. We submitted the URL via “addurl” and waited. Due to the fact that the domain has no incoming links, it has no or at least a negligible PageRank. If Matt Cutts’s statement is correct Googlebot would soon stop crawling.
As you can see in the graph, Googlebot started crawling the site with a crawl rate of approximately 2500 nodes per hour. After three hours, it slowed down to a crawl rate of approximately 25 pages per hour and maintained that rate for months. To verify this result we did the same test with two other domains. Both tests came up with nearly the same results. The only difference is the lower peak at the beginning of Googlebot's visit.
Impact of Sitemaps
During the tests, the sitemap manifested itself as a very useful tool to influence the crawl rate. We added a sitemap with 50,000 uncrawled pages in it (indexation level 0). Googlebot placed the pages which were added to Google by sitemap on top of the crawl queue. This means that those pages got crawled before the F-levelled pages. But what’s really remarkable is the extreme increase in crawl rate. At first, the number of visits was stabilized at a rate of 20-30 pages per hour. As soon as the sitemap was uploaded through Webmaster Central, the crawler accelerated to approximately 500 pages per hour. In just a few days it reached a peak of 2224 pages per hour. Where at first the crawler visited 26.59 pages per hour on average, it grew to an average of 1257.78 pages per hour which is an increase of no less then 4630.27%. The increase of crawl rate doesn’t stop by the pages included in the sitemap. Also the other F- and 0-levelled pages take advantage of the increase in crawl rate.
It’s quite remarkable that Google suddenly uses more of it’s crawl capacity to crawl the website. At the point where we submitted the sitemap the crawl queue was filled with F-pages. Google probably attaches a lot of value to the submitted sitemap.
This brings us to Matt Cutts’s statement. After only 31 days Googlebot crawled about 375,000 pages of the website. If this is proportional to it’s PageRank (which is 0) this would mean that it will crawl 140,625,000,000 pages of a PageRank 1 website in just 31 days. Remember that PageRank is exponential. In other words, this would mean you never have to worry about your PageRank even if you own the largest website on the web. In other words don’t simply accept everything Matt says.
Amount of Links
Rand Fishkin says: “…you really can go above Google’s recommended of 100 links per page, with a PageRank 7.5 you can think about 250-300 links” ( https://moz.com/blog/whiteboard-friday-flat-site-architecture )
The 100 links per page advice has always been a hot topic especially for websites with a lot of pages. The reason the advice originally was given is the fact that Google used to index only 100 kilobytes per page. On a 100 kb page the amount of 100 links seemed reasonable. If a page was any longer, there was a chance that the page would be so long that Google would truncate the page and wouldn’t index the entire page. These days, Google will index more than 1.5MB and user experience is the main reason for Google to keep the “100 links” recommendation in their guidelines.
As was described in the previous paragraph, Google does crawl 250 links, even on sites with no incoming links. But is there a limit? We tried the same set-up as the websites described with 250 links on it but instead we used 5,000 links per page. When Googlebot visited that website something remarkable happened. Googlebot requested the following pages:
- https://example.com/1/
- https://example.com/10/
- https://example.com/100/
- https://example.com/1000/
On every level Google visits, we see the same page requests. It seems like Googlebot doesn’t know how to handle such a large amount of links and tries to solve it as a computer.
Semantic Intelligence
One of the SEO myths used on almost every optimised website are the links placed in heading tags. Recently it was mentioned again as one of the factors of the “Reasonable Surfer patent”. If Google respects semantics, it definitely attaches more value to those “heading” links. We had our doubts and put it to the test. We took a page with 250 links on it and marked some with heading tags. This was done a few levels deep. After a few weeks of waiting nothing pointed in the direction that Googlebot preferred the “heading” links. This doesn’t mean Googlebot doesn’t use semantics in it’s algorithm, it just doesn’t use headings to give links more weight than others.
Crawling JavaScript
Google says it keeps getting better in recognizing and executing JavaScript. Although JavaScript is not a good technique to use if you want to be sure that Google does follow your links, it’s used quite a lot to reach the opposite goal. When used for PageRank sculpting the purpose of using JavaScript links is to make those links only visible for users. If you use this technique for this purpose it’s good to keep yourself updated on what Google can and can’t recognize and execute. To test Googlebot on it’s JavaScript capabilities we took the JavaScript codes as described in “The professional’s guide to PageRank optimization” and put them to the test.
The only code Googlebot executed and followed during our test was the link in a simple “document.write” line. This doesn’t exclude the possibility that Googlebot is capable of recognizing and executing the more advanced script. It is possible that Google needs an extra trigger (like incoming links) to put more effort into the JavaScript crawling.
Crawling Breadcrumbs
Breadcrumbs are a typical element on a webpage specially created for users. Sometimes they are used to support the site structure as well. Last month we encountered some problems where the Googlebot was not able to crawl it’s way up, so we did some tests.
We made a page a few levels deep with some content and links to higher levels on it ( https://example.com/lvl1/lvl2/lvl3/ ). We gave the page some incoming links and waited for Googlebot. Although the deep page itself was visited 3 times by the crawler, the higher pages didn’t get a visit.
To verify this result, we did the same test on an other domain. This time the test page was a few levels deeper in the site structure (https://example.com/lvl1/lvl2/lvl3/lvl4/lvl5/). This time Googlebot did follow some links which referred to pages higher on the site structure. Despite the fact that Googlebot does follow the links, it doesn’t seem to be a good method to support a site structure. After a few weeks Google still didn’t crawl all the higher pages. It looks like Googlebot rather crawls deeper into the site structure then higher pages.
Takeaways
In short, the lesson learned is that one can influence the crawl rate with a sitemap. This doesn’t mean that you should always upload a sitemap for your websites. You only want to increase the crawl rate if the bulk of your crawled pages get indexed. It takes longer for a crawler to return to an “F”-levelled page than to return to an indexed page. So if most of your pages get crawled, but dropped from the index you might want to consider getting more incoming links before using a sitemap. Best thing to do is to monitor for every page when Googlebot last visited it. With this method you can always identify problems in your site structure.
The amount of links isn’t limited to 250 links (even if you have no incoming links) although 5000 seems too much. We haven’t found the exact limit yet, but if we do, we will give you an update.
Links in heading tags for crawl purpose seems to be a waste of time. Though you can use them for usability purposes, because you’re used to it or because Wordpress does it anyway and maybe if you’re lucky it’s still a ranking factor.
Another conclusion we can make is that the Googlebot isn’t very good in crawling breadcrumbs. So don’t use them for site structure purposes. Google just doesn’t crawl up as good as it crawls down. In contrast to breadcrumbs, you can use JavaScript for site sculpting purposes. Googlebot isn’t top of the bill if we’re talking about recognizing and executing JavaScript links. Remember to keep yourself updated on this subject, but for now you definitely can use some “advanced” JavaScript to do sculpting.
A last result that came up while performing research on the crawl process was the influence of the URL length. A short URL gets crawled earlier than long URL’s, therefore always consider the need for indexation and the need to be crawled if you choose your URL.
Although I may not agree with all of your conclusions, I really liked the data collection attempts and the close look at crawler behavior. It's very important to keep in mind the 3 major steps in the process:
(1) Crawling
(2) Indexation
(3) Ranking
I will say that something I see a lot with XML sitemaps is that they seem to jump-start crawling, but, if the site is huge without the PR to support it, pages get crawled and indexed in the short-term but then quickly dropped. I don't think Matt's statement is all or none. The indexation cap is proportional to PR/authority, but it isn't directly proportional. Even new sites can get some love.
Regarding breadcrumbs, I have a hunch that what you might be seeing is that Google doesn't always crawl back up (at least, that you'd see in your log files), because it recognizes that those pages have already been crawled. Their may still be a PR-flow component to that link, though. Now that Google is using site breadcrumbs to produce mini-breadcrumbs in the SERPs, it's clear they're paying attention to these architecture cues.
Great point Dr. Pete! I would love to see a followup where Google's index is checked for the inclusion of the pages that were visited once (whether be by initial crawl or adding the sitemap) and then not crawled again.
At some point, I hope to test the value of a Tweet-link for crawling purposes. It's fairly low on my "must answer" list but I have a hypothesis that Tweeting a link to new content will be one of the fastest ways to get a page included in the index- faster than a sitemap, submitting a URL, getting a link from another site etc. Even though Tweet-links are nofollow, I think there can some serious value in them... but I have to prove it first :)
First great post i think it should be promoted... that said i have a few questions... what content was put on the pages? was it just the links? i think that the behavior that google is doing here
is them trying to test for "infinite space" https://j.mp/agCQPU (WMT blog post)
as far as breadcrumbs i think they may be like site-links... i have a hunch that list based navs have a lot to do with Google being able to identify navigational elements (might not be only signal but a strong one thats easy to identify) but haven't had time to crawl some results and compare their code for similarities
im lame and have no domains earning site-links so please let me in on whatever insight yall have :)
There was a little bit more content on the page then just the links. This was done to make every page unique. The only content on every page was a with the number of a page fully written out. So for example:
https://example.com/200/3/25/ the content is: / two hundred / three / twenty-five.
I don’t think Google was doing an infinite space test on that point. I think it was just being stupid and taking the first link of every URL length. (see the influence of the URL length ) link.
I think Google did do an infinite space test by requesting a bunch of self-made-up pages. (pages like: https://example.com/tpylnqbqyjlo.html ) Although it’s possible that Google uses that test for looking at the status codes, I’m not sure.
On all tests Google did get a 404 in return when it was requesting those pages. but we also made a test where it always got a 200 in return, even when it was requesting those self-made-up pages. The only difference was that Google didn’t stop making those strange requests, but the Googlebot came at least 18 levels deep on that test. So even if it is doing a infinite space test, it just keeps crawling anyway.
"two hundred / three / twenty-five"
Only returns this current SEOmoz page for me.
Sorry it was a fictive example, I like to use the domains for some other tests for a while.
You are right about the three steps in the process. For this post we only looked at the crawling part of it, but I like to give you some information about the indexing part:
Domain1 has now 4.310 indexed pages (no sitemap)
Domain3 has now 241.000 indexed pages (with sitemap)
The amount of indexed pages of both is still growing. I think that’s a whole lot of love for a new website and not really proportional to the content it has.
About the breadcrumb pages, I think Google recognizes that those pages are higher in the sitestructure, not that they were crawled before (because Google never saw a link to those higher pages before). We only linked from down to top instead of top-down. Of course there’s probably a PR-flow to that link, but then again the focus was crawling and not indexing or PageRank.
Very interesting study! In my YOUmoz post, I found that PageRank shared only a 0.30 correlation with the number of pages indexed by Google (although a 0.52 correlation with the number indexed by Bing). This relationship was approximated by an exponential equation with a base of 1.78^PageRank. Much of what I've seen suggests that PageRank transforms with a base somewhere between 1.5 and 2.5. I think you may be overestimating the difference in pages crawled between a PR = 0 site and a PR = 1 site. My study goes into detail about why PageRank may be a metric worth considering.
I definitely enjoyed your post and hope that you will continue to contribute to the community.
LOL. I love reading your comments Sean. Even though I can usually only understand the opening and closing sentences! You are a statistics animal dude!
Haha, thanks GNC. It's tricky to share statistical information in a way that satisfies the hardcore, methodology-critiquing data enthusiasts, while also being accessible to those who aren't as passionate about such things. I'm still trying to figure out the best way to satisfy both groups, while staying focused on the utility and practical applicability of research.
Not to bash academia, but it's part of why I left that world after grad. school. If we spend all of our time arguing and none of our time sharing and communicating the results, then what's the point? I have a much greater appreciation for the people who attempt to explain these things, even if they aren't 100% perfect.
It will be interesting to see how academia, and in particular, the journal/publishing/peer-review process evolves with the ever-increasing free flow of information on the web. I think that there is a huge opportunity for the current system to evolve into something that will truly benefit all mankind. Then again, I could say the same thing about journalism, and most of them are still dragging their feet...
As someone doing an MA in social media and working for an academic software company in that space, i can assure you, change will only come at a slow pace with rivers of tears and much screaming.
Having been part of the community here for a while, I can say I learn more in a daily basis here than clueless academics can teach in a year.
Try write a social media paper when published literature is 5 years old and written by someone with little understanding in the first place
/rant :P
Great post, found some gems in the comments as well.
Thanks for the subject matter.
Dr Pete, Id say there is an argument for adding seasonality as a (4) to your list
I have been trying to find out how or whether the Google index moves quite fast wrt seasonality
When a site is crawed and very soon pages drop out of the rankings that can be
a) your content sucks
or possibly?
b) there isnt enough traffic on your keywords for Google to keep your pages in the main index at the low point of season
That's really interesting, especially if you track it over a year or more (and it's not just an algorithm change). I think generally Google is getting harsher about low-value, long-tail pages, and that Mayday was part of that change in philosophy, but it would be interesting to see if traffic and other cues affected indexation.
TBH, my data in a growing site is very noisy (especially when I often have no idea what the noise is).
We certainly did get hit by the May update. Ive been tracking for 4 months now, will let you know in 20 months time ;)
Thanks loads for sharing the results of your tests with us. This is the kind of post that I thought of when reading Dr. Pete's post about 7 types of SEO evidence. This would fall into the Secondhand Evidence section for me.
While not exhaustively researched, it gives me more than enough direction to head towards while doing my own testing and research.
Thanks rolfbroer. Excellent post.
Thank you very much for this info. It certainly shows some insight.
Great post I love these test data examples thanks
whoah i didn't mean to comment 3 times. I clicked 'post comment' and the chrome wheels went turning. I click some more out of impatients and ended up commenting 3 times. Anyone know how to delete comments. If so please remove my redundancy and repetition. :)
I got ya covered :)
Hi, I have a website and I have created a links directory page. However, adding the links to the directory will take hours or DAYS or YEARS!! So anyway, I want to know if I can download or get a html code or something for a thing a bit like the Googlebot which automatically pulls in links. IS this possible, PS. I don't need to have a directory filled with hundreds of millions of links - just a few thou would do.
williamsjohn333
<Links removed>
Dear rolfbroer,
Thank you for the very interesting article. I have found some very good recomendation concerning the site better indexation and I am going to check how it is work with russian language sites.
Excellent post, really nice way to explain how google bot crawls web pages
Excellent analysis on how the google bot crawls the web pages. I really liked your insight about Google bot crawling links within Javascript in them. So with more incoming links, Google will put more focus on crawling Javascript eh?
here you are explaining in very nice way. any one can understand eaily.
Thanks Buddy
This is a great share. I think it sheds some lite on why some blog sites interlink as deeply as they do.
Really great information here - thank you for sharing your clearly extensive research with the rest of the SEO community.
It should be noted that not all Googlebot visits in your logs are actually from Googlebot.* We recently conducted a study on this subject and it showed that 16.3% of all Googlebot visits were fake, 75% of this were also harmful. You can read more here: Is this really a Googlebot crawling my site?
* Webmaster Tools is still reliable, of course.
I appreciate the section on the sitemap. We just added a second sitemap to help index the real estate properties we have recently included as content to our website. It's good to know Google will be influenced by this.
Food for Thought
Great summary, thanks
Thanks for the research on the sitemaps. Good to see it is a useful thing to do.
Not sure I agree about using javascript forsculpting purposes though. We need to think about usability. Write for the audience not search engines we always hear.
This would seem to make it clear an XML sitemap is one of these "more than 200" signals Googles ranking algorythm is using. Be interesting if SEOMoz (or one of the expert posters here) would be brave enough to come up with a best guess on the entire list of them ;)
Someone could probably come up with a list... then tomorrow, it'd be a different list.
OK, I probably oversimplified the statement. I agree it'd be different over time, but I doubt they ever really drop anything, it'd get longer (as it evidently just has with speed).
I really like this post - but I have to contradict to your recommendation not to use bread crumbs. You should take into account that bread crumbs are very useful for users ... if they don't help getting your pages indexed (although I made a different experience with one of my sites) - they certainly help users understanding your site architecture.
What i'm trying to say is that you shouldn't rely only on breadcrumbs for your sitestructure.
Sure they can help crawling your site and yes you definitly shoult always offer them to your users. But what we've experiences on a few large sites is that Google wasn't using the breadcrumbs very well during the crawlproces.
That's the reason why I tried to reproduce that. I'm not saying "don't use breadcrumbs" I'm trying to say: be aware of this fact and be sure to always link top-down as well.
Hello,
I would say its decent case study.But I have some different experience in terms of bread crumps...so when we use to make inner pages Googlebot allways fetch those pages through its "Real time search" method, but withinminmal span of time Googlebot fetch the higher pages as well.So according to my experience I would say Googlbot indexation depends on fresh content,images and videos...
Apart from that the case study..is appriciated..
Really informative post. If you find the total number of links that google allows, do post an update.
Thanks.
Asim
Appreciate the way this article was written and presented - thanks for taking the time to share your test data. Interesting about the increase in spidering per Sitemap submission :-)
Hi,
First, thank you for this lovely data and peace of work. Congratulation!
For this close watch of Google crawler, still I am little suspect about this data and your assumption......
https://example.com/1
https://example.com/10/
https://example.com/100/
https://example.com/1000/
Because, I have a site that is https://www.fragranceville.com/
the root domain of this site is crawl every 6-7 days, I have some pages of html in this site likes https://www.fragranceville.com/women-perfumes.html
https://www.fragranceville.com/miniature-perfumes.html
And some more didn't crawl last 2 months and other deep pages are also crawled by bot. So, what would you say about that, and what should i do for it?
thanks
Allen Pradhan