Over the past couple of days I have been putting together some internal guidelines on various aspects of our jobs. This should ensure that we are giving consistent information to our various clients. Most of these guidelines have been fairly straightforward with nothing in them to write home about. However, one of the hardest guidelines to write has been the one talking about xml sitemaps. So, rather than horde my thoughts, I'm going to open them up to all of you.

What are xml sitemaps?

Sitemaps are an easy way for webmasters to inform search engines about pages on their sites that are available for crawling. In its simplest form, a Sitemap is an XML file that lists URLs for a site along with additional metadata about each URL... https://www.sitemaps.org
On the surface this seems to be a great addition to any website's armoury. However, before you rush away and create your sitemap, there are a number of pros and cons you should be aware of.

Benefits to using a xml sitemap

The first set of benefits revolve around being able to pass extra information to the search engines.
  • Your sitemap can list all URLs from your site. This could include pages that aren't otherwise discoverable by the search engines.
  • Giving the search engines priority information. There is an optional tag in the sitemap for the priority of the page. This is an indication of how important a given page is relevant to all the others on your site. This allows the search engines to order the crawling of their website based on priority information.
  • Passing temporal information. Two other optional tags (lastmod and changefreq) pass more information to the search engines that should help them crawl your site in a more optimal way. "lastmod" tells them when a page last changed, and changefreq indicates how often the page is likely to change.
Being able to pass extra information to the search engines *should* result in them crawling your site in a more optimal way. Google itself points out the information you pass is considered as hints, though it would appear to benefit both webmasters and the search engines if they were to use this data to crawl the pages of your site according to the pages you think have a high priority. There is a further benefit, which is that you get information back.
  • Google Webmaster Central gives some useful information when you have a sitemap. For example, the following graph shows googlebot activity over the last 90 days. This is actually taken from a friend of ours in our building who offers market research reports.


Negative aspects of xml sitemaps

  • Rand has already covered one of the major issues with sitemaps, which is that it can hide site architecture issues by indexing pages that a normal web crawl can't find.
  • Competitive intelligence. If you are telling the search engines the relative priority of all of your pages, you can bet this information will be of interest to your competitors. I know of no way of protecting your sitemap so only the search engines can access it.
  • Generation. This is not actually a problem with sitemaps, but rather a problem with the way a lot of site maps are generated. Any time you generate a sitemap by sending a program to crawl your site, you are asking for trouble. I'd put money on the search engines having a better crawling algorithm than any of the tools out there to generate the sitemaps. The other issue with sitemaps that aren't dynamically generated from a database is that they will become out of date almost immediately.

XML sitemap guidelines

With all of the above in mind, I would avoid putting a sitemap on a site, especially a new site, or one that has recently changed structure. By not submitting a sitemap, you can use the information gathered from seeing which pages Google indexes, and how quickly they are indexed to validate that your site architecture is correct.

There is a set of circumstances that would lead to me recommending that you use a sitemap. If you have a very large site and have spent the time looking at the crawl stats, and are completely happy with why pages are in and out of the index, then adding a sitemap can lead to an increase in the number of pages in the index. It's worth saying that these pages are going to the poorest of the poor in terms of link juice. These pages are the fleas on the runt of a litter. They aren't going to rank for anything other than the long tail. However, I'm sure you don't need me to tell you that even the longest of the long tail can drive significant traffic when thousands of extra pages are suddenly added to the index.

One question still in my mind is the impact of removing an xml sitemap from a site that previously had one. Should we recommend all new clients remove their sitemap in order to see issues in the site architecture? I'm a big fan of using the search engines to diagnose site architecture issues. I'm not convinced that removing a sitemap would remove pages that are only indexed due to the xml sitemap. If that is the case, that's a very nice bit of information. *Wishes he'd kept that tidbit under his oh so very white hat*

So I guess let the discussions start: do you follow amazon.co.uk (who does have a sitemap), or are you more of an ebay.co.uk (which doesn't)?