A couple of large client projects we are working on at the moment have had me thinking about a tricky issue that rears its head in enterprise SEO projects especially. When large clients are making extensive website changes, our experience is that the section entitled '301 redirects' is inevitably the section that gets read quickly and then quietly shuffled out of scope. We have found we have to push hard to get large businesses to see the importance of permanent redirects.
We are currently working on two projects where we have been forced to think about the cheapest, most efficient ways of putting redirects in place on high-traffic websites:
- In helping with the relaunch of a newspaper website, we pushed hard enough for pretty serious changes to the scope in order to have 301 redirects for all old pages brought in-scope for the launch date
- While thinking about the migration of a large subsection of a Fortune 100 company's website, we discovered that they didn't have the resources to carry out the migration and build redirects from the old section
In the first case, the developers were on a very tight schedule and solving scaling issues with a clever mix of standard frameworks, content delivery networks (CDNs) and javascript for some personalisation. In the second case, we (half-jokingly) considered suggesting that the client point the old sub-domain at our servers and we would build the redirection engine (which would have similar scaling issues).
The problem with all of this is the lack of caching solutions for redirections. Each step in the cache sequence is poorly-designed for caching anything other than pages of HTML.
Before I get into the details, a very quick CDN primer. For high-availability websites, or those delivering large amounts of video / streaming content, the task of making a website is a separate task from making it available everywhere to everyone who needs it. CDNs attempt to abstract this problem and offer a bolt-on front end for content delivery independent of the beefiness of the underlying servers. Effectively what happens is that a company makes their website available to the CDN, who then replicate it and serve it to users - giving the ability to scale to millions of users without changes to the underlying hosting. Duncan wrote an article a while ago about some of the issues this can cause from an SEO perspective when they geo-deliver content to serve users from local servers.
How CDNs Break SEO Efforts
We have discovered over the past week that (at least some) CDNs only cache "200 OK"-status HTML pages. This means that the CDN is not much use when a massive architecture change takes place because (in the SEO-friendly case) there will be a large volume of 301 redirects to serve from your tiny root server, or (as seems to be more likely) there will be a lot of 404 errors where the CDN thinks there is no page because it hasn't cached a 200-status page. The end result is a failure to serve 301 redirects.
It makes some sense for this to happen - you can certainly see how the requirements for a CDN would include caching only successfully-delivered pages and checking with the base server about every error status (especially during roll-out, you would hope that 404 errors would be gradually caught and fixed and you want these fixes reflected in the CDN-delivered version - via tracking 404 error pages in Google Analytics for example). It is relatively easy to see how this evolves into refusing to cache any non-200 status pages, but this is definitely not ideal for SEO purposes.
It is even worse than this though - because as you look at most of the caching solutions built into major frameworks, you realise most of these are poorly-designed for caching redirects as well. Whenever a request for a non-existent page is made, not only does it miss the cache, but it typically cascades all the way up the stack until there is an authoritative "nope, definitely doesn't exist" (or "permanently redirected over there") response which involves quite a few database queries.
When you are talking about redirects on this scale, you are inevitably talking about serving them from a database-driven CMS - while apache can be quick at hard-coded redirects, config changes are never going to be the answer for large-scale architecture changes. At the very least in these scenarios, you are going to want rules-based redirection (and the thought of doing it with regular expressions within apache configs gives me a cold sweat).
So, while we aren't (yet) needing to specify the system that could handle millions of visitors all seeking non-existent pages (that would be needed to host the soon-to-be-defunct sub-domain described above), it has got us thinking.
Does anyone know of a good way of caching non-200 status-code pages as efficiently as the systems for caching successfully-delivered HTML pages?
In the meantime, hopefully our story will give you a moment's pause for thought to realise what needs to be done to bring the one-line specification "permanently redirect all old pages to their new equivalents" to life on large complex systems delivered via frameworks and over CDNs.
At Zappos we have a pretty robust URL rewriting solution running alongside Akamai quite nicely. Hit me up offline if you want details.
Thumbed up for this quote:
My god ain't that the truth!
A lot of your post is over my head (I dont really have experience with CDNs) but dont let that discourage you from writing more technical/advanced posts on SEOmoz!
I think there lies your problem. You know the solution is to modify Apache config, but you dread it and blame it on CDN.
Of course, you should generate the 301 redirects upfront instead of using the database. ;) That will speed up things and concurrency perhaps a few hundred times.
You don't mention the web programming language that you client uses, but presuming it is Apache, most likely it is PHP or RoR.
A change in URLs means you have to tell the CDN to immediately fetch the new content instead of delivering it from their cache.
301 redirects are efficient if you issue at the server level. If your client configures Apache right, it will also be efficient, i.e. not loading the entire PHP engine even just to return 301 redirects.
Alternatively, install nginx or other fast web server. Nginx is my choice because I can install it in front of the Apache server and use it as a reverse proxy.
WordPress.com uses nginx to serve more than 1.3Gbps of data per sec, or up to 8,000 requests per second from one box. They are serving images, which involves I/O, while 301 redirects are from memory.
If you still worry about scalability, nginx is able to load balance too.
By my reckoning an OC12 could handle 864 Million 301's per hour ... how many you got? :0)>Apache allows you to map the redirects with a db, so your main db doesn't need to touch these: https://httpd.apache.org/docs/2.2/mod/mod_rewrite.html#RewriteMap.
Never done this but I imagine you could parse your logs for resource misses (ie opposite of "hits") use the keywords to find the best hits in the content db, trace the URL for that content and add it to your rewrite-map against the missed URL. Your first user per resource will be a long wait that way (but if they don't follow on they may still use sitesearch to get where they wanted to).
Sounds like fun!
1. Wait, are we talking about a web site taking a full OC12 capacity?
2. 8,000 requests / sec * 86400 = close to 700 millions.
And 8,000 requests / sec is for request for images (gravatar). Their stats, not mine.
That's for one server.
I was starting at an 320 byte header send and guesstimating (allowing for overhead) the number of 301s that you could stuff back in the pipe.
I completely agree with Hendry.
The CDN should not be used to handle 301s. Unless the URL structure of the site is completely fubar'd you should be able to write a handful of regular expressions to handle the redirects.
Apache config is definitely the way to go, and don't even think about using .htaccess. Or as Hendry pointed out, run Nginx infront which is a simple task for most SAs.
Since 301s are only pass the response headers and there isn't an actual payload it should be a very fast and trivial task for any major site that has more than 1 server in their load balancer.
Lastly, CDNs are meant to replicate large payloads to the edge for faster access. There's no need to replicate small head requests out to the edge. The time over the wire is minimal since the data payload is typically only ~500 bytes.
Yeah - well therein lies the problem. That assumption is the root of the problem!
Incidentally, all of the development here is being done by a third-party so yeah - if we were developing in-house we would have built something (most likely using apache capabilities - just because it gives me a cold sweat doesn't mean our devs can't do it - many kinds of programming give me a cold sweat!). The simplest way of thinking about the problem is that the underlying site works well, but the CDN breaks it. It seems to me that is a problem at the CDN level...
I obviously trust your assessment b/c I'm not too familiar with the special implementation.
But it must be a strange / exotic way that they're using the CDN. Meaning, requesting CDNs cache 301 redirects should not be common a solution.
CDNs should quickly replicate or in this case, quickly remove all the content from the edge servers.
As I mentioned above, the 301 request payload is ~500 bytes. So that data really doesn't need to be pushed out to the CDN. Also it should be a trivial task for most consumers to hit the main datacenter or load balancer.
As Hendry mentioned above, you really shouldn't be doing a DB lookup when the page request comes in. Although I'd advise against it for any enterprise level site, even an .htaccess file with 100 regEx would be faster than allowing the redirect to be handle by a CMS and perform a DB lookup to verify the absence of the record in the DB.
I would suggest talking with your client again about the various options that don't include the CDN. Then determine the # of visitors that have the URLs bookmarked + the SEO traffic. That should give you the bulk of the scope of load that you can use as your initial stress test.
I'm going to assume they've got a dev server.
In Dev, create the regEx, and have them implement it at the CMS level AND at the Server level. Then have them benchmark the response times and stress test the amount of redirects they can handle without impacting the performance (e.g. sucking down too much memory or leaving too many open connections). Compare the results.
If that benchmark is less than the Direct + SEO traffic on a daily / monthly basis, you should be fine. You can even divide that by the # of production servers they're using and assume that the load balancer will distribute the 301 load evenly or via cookie (which is more unlikely, but I've seen this implemented many times).
If by chance the traffic is well above the max 301 load, then they've got more issues than just SEO. At that point you can suggest they take a serious look at Nginx .... it's low cost (open source) and relatively easy to install (even compared to Apache). The only cost is capEx which might be a chore to requisition, but you won't need to many additional boxes to handle the load as demonstrated by Hendry and pbhj above.
Cheers, and good luck ... I know working with client dev isn't always the easiest thing to do. ;)
This actually sounds like a smarter time to go ahead and simply use the new Canonical Link tag. The scope and cost of modifying either the way CDN's handle redirects or the risk of feeding a different redirecting method that bypasses the CDNs directly to the search engines is not worth it when a simple, albeit new, standard exists for handling this exact type of issue.
While normally I prefer 301s (as the end-user lands on the page which we would prefer them to use if they choose to link back to the story), there is something to be said about loss-of-users due to opaque redirects. Some people just don't like to their browser to be tossed around from url-to-url.
Yep - we have thought of that. The main problem is that you have to keep the old CMS running alongside the new one in order to serve up the old page with the canonical tag...
I'm also yet to be convinced that the canonical tag is going to pass link juice correctly about the place.
I second the concern that the canonical link tag won't pass on all of the link juice, although I am fairly certain that this is the case with 301 redirects as well. Your point about running the old CMS concurrently though does cause concern. If you have a thorough redirect map in place though, you could ditch the old CMS and have the new CMS check the old URL list for a match and, if so, just grab a copy of the new URL and insert in the canonical tag on-the-fly. Still kinda hackish, but I think caching 301s will be a nightmare of epic proportions.
--- talk to the Zappos guy below, sounds like they might have a solution.
The canonical tag is completely the wrong thing to be using here.
It's for resolving duplicates, not for using as some sort of quasi-pseudo-redirect system. That thought scares me. A lot.
I certainly agree that it is a hackish / less than ideal solution - but so is the idea of mapping tens if not hundreds of thousands of URLs and redirecting them all. Users really don't like to be shuffled around via redirects and the canonical link method allows users to continue to use their old bookmarked links without feeling as if the site has taken over their browser.
Think about it from a user perspective. Which would you prefer, to find that all your bookmarks redirect somewhere else? or that the pages are right where you left them? The canonical link tag, much like the nofollow tag, is transparent to the end user. Redirects (either 301ing or link-redirects to prevent link-juice loss) are not.
Upvote for the "quasi-pseudo-redirect" statement though. It's not every day that I hear about fake fake fake urls.
There's four main ways of redirecting old to new:
- put a single line in .htaccess or httpd.conf for each affected URL (very much not recommended for more than a few dozen moves).
- identify with a simple regex, all the affected URLs, and have a small number of rules that perform those redirects within the .htaccess or httpd.conf file (this is often the most efficient method).
- identify with a simple regex, all the affected URLs and rewrite them to a script that then does a database lookup to obtain the new URL - with the script then sending the redirect header.
- rewrite all requests for all URLs to a handler that looks up what to do (this can get very messy, and will lead to server inefficiency for all valid requests).
Which one is best, very much depends on the format of both the old and the new URLs, how many are involved, and so on.
It's not often a very easy solution, unless the original designers had a simple and clear URL structure from the beginning - one that you can match with simple rules.
Yeah, we had to do this with a site with over 300,000 pages and little to no discernable patterns (old item IDs did not match new item IDs) --- headache to say the least.
I loved this article, well done! I like it when my mind gets stretched a little :-)
I wanted to take a moment to address the canonical tag idea because I have some recent experience with the tag working.
1) I think it would be very hackish to keep the old CMS running just to use the canonical tag. You would have to keep it up indefinitely in order maintain any link juice pass-through which just seems crazy.
2) I have a client that is on the Yahoo Shopping platform which is allergic to 301 redirects... unbelievable I know - they don't allow them!! Anyway, he switched his URLs and essentially mirrored his content while using the canonical tag to get Google to register the new URL as the main one.
Well there was a lot of nerves around whether this would actually work because many of the older pages were top ranking, top delivering pages so he couldn't afford to have his pages lose traffic. When the Canonical tag was implemented there was a space of about a week, I believe, where rankings simply dissapeared... our stomachs were at our knees. Then just a week later the rankings had reappeared and they were all referencing the new URLs... in some very competitive terms. By all accounts the link juice WAS passed effectively. I honestly do not think they would have ranked as well if the juice had not been passed.
So, the canonical tag turned out to be the ultimate solution to Yahoo's lack of support for 301 redirects. Unfortunately the other system will have to remain online to keep the canonical tags in place and keep passing the juice but we are about to change that as well through a convoluted domain switch (sigh).
Just thought you would be interested in that
Cheers, Ross
hi!
1, i would say, there is no reason that an CDN should not cache 301 redirects, cause they are permanently. even more permanently than a 200 url. so talking to the CDN support is the first thing i would do.
2. it should be no problem to custom code a redirection cache for the cms in use. which means that you have a database table which does the mapping an is called on every request BEFORE the cms core is loaded. of course it would be best if you could specify some cases where you could bypass the additional db-query (maybe subfolders only used be the new/old version). but a 200-only-CDN would catch many of the queries where a redirect is not needed.
and one other thing: the primary purpose of a CDN is not to deliver high traffic site with a single root server. its purpose is to deliver sites geograpicly near the customer without any internet bottlenecks which could descrease ping & througput. if your client fears to have performance issues due to 301redirect, they should invest in their infrastructure.
could you please post what your CDN and CMS are?
Just did a major URL rewrite for a client and ran into this exact problem - the 301 requests came all the way back to a single server and eventually crashed the server - lesson learned there :(
No experience with CDNs but the client's IT guys did find some software to help manage this problem - will have to find out what they used and let you know.
Yes, its happend for me also.
Whoo Hoo Will et al! I had to read and reread twice this entire postings with comments just to understand the gist of what y'all were saying.
I LOVE being stretched like this. Keep these kinda posts coming!
One trick I did is modify our 404 error page. Before sending the actual 404 code, I check the requested URL to see if this one is include in a array of old url->new url. If so, I do a 301 redirection to the new url and return a 301 code instead. It's easier to modify the array list from my CMS (1 page to modify) than modify the Apache config file/restarting the web server or create a new .htaccess file.
taking a step back to break down the problem:
* lots of different organisations/departments/applications handle the request/response at various points in the stream
* lots of different stakeholders (organisations/departments/applications) need to be able to specify 301's
So my solution ...
* Have a system to submit potential 301 redirect 'entries' - manually by staff/dynamicly via miss logs etc
* Have a single system to manage validatation of potential 301s into validated redirects - (admin panel in an intranet/svn repository)
* Reporting app outputs list of valid 301 redirects
* Publish it to all interested/authenticated parties
* Each stakeholder along the stream can 'claim responsibility' for a particalar redirect entry (or set of them)
* Each stakeholder along the stream deals with it in their own way
* Over time - all 301's should gradually move as far up the stream
* Everybody knows whats gwan on.
---
I am a one-man-band - so my current solution is to have my single place as a simple block of text in format for Apache ModeRrewite301 Map on a Trac wiki page - with commentsto explain
https://trac.edgewall.org/
https://httpd.apache.org/docs/2.0/mod/mod_rewrite.html#rewritemap
Then, after a change, copy and paste it into a processing tool that removes comments and picks out regex patterns - ofr speed.
Then just upload it into a defined location on the server/servers
just my thoughts
I have seen some of the online seo audit website shows your website have IP canonical issue. Is IP canonical is also effects the website ranking & traffic, if yes how we solve it.
Hi,
I've worked at a major publisher for a couple of years and this kind of thing always comes up. People are always buying a brand new shiney CMS or deciding to rebrand onto another domain. It's a fact of life.
My take on this is that there is pretty cool comercial opportunity here. Providing a redirect service in a proper SOA way with calling modules which could plug into a variety of CMSs would be a great service for agencies to provide and would make the migration of domains/platforms much easier to handle when talking to a client. An SOA approach would allow consistent redirects between all the CDNs, and also would be allow for a central rationalistion of 301 redirects- i.e. redirects A->B and B->C become A->C and B->C. We built a prototype for this kind of system, but never really had the time/funds to go further.
I can identify partially to your situation as I am currently facing a client's webmaster who refuses to create 301 redirects in order to keep the server configuration clean. It's like talking to a brick wall.
Wow... I think I just got smarter...
Great post. It would be a lie to say I got all that, but you clearly expressed a problem I have been fast approaching. Thanks for taking the time to post!
Good stuff Will. I have been working with a large news site and the CDN issues have been tricky to say the least. We are using the canonical tag to deal with the mirror url issues and it seems to be working but the jury is still out.
Great post will, Amazing resource of information and very helpful, and I'm sure I will make it a reference in dealing with any clients who use CDNs.
Thanks for sharing!!! :)
Even though this post is over my head, I am glad you wrote it. The IT guy at my company hates 301s, and he doesn't see them as permanent. He thinks they should only be there for a few months. He talks about space and clutter being a problem. So, I'm keeping a file of everything I can find about 301s and attempting to understand all of the technical jargon. This post, while it concentrates on a massive re-write solution, further proves my point that we need to keep the 301s up forever.
Yeah - I've had that conversation a few times. Client: "how long do I need to keep 301s in place?". Me: "the clue's in the 'permanent' bit of the name".
It's a nice line but - for those others reading that may not know - the permanent bit means that the resource bearing the link should update to link to the new location. As opposed to temporary (302) where the link should not be updated as the "final" location of the content has not been established.
301 is moving house
302 is moving into a bed&breakfast whilst you're looking for somewhere permanent to stay
If all resources updated their links then you could genuinely drop your 301 at some time in the future. Though you'll always miss a Christmas card from your long-lost cousin somewhere down the line.
I've seen requests for redirected URLs still being made at least three years after the redirect was put in place.
I am guessing that once you are absolutely sure that no other sites link to an old URL, and no-one has the URL bookmarked in their browser, that the redirect could be dropped in favour of a plain 404 for the URL - but I wouldn't want to take that chance.
In any case, most times the set of redirects for a whole site amount to only a few dozen lines. As long as they are very specific in their scope, there's no real performance loss by leaving them there.
Thanks for all of the comments on the issue. It only backs up what I already knew to be true. There are a few people on my team about this, so I think it is a battle we will win. =)
Google and other search engines put a lot of importance on online media reports. If a website was featured in CNN.com, WSJ.com or other major online publications, it will rank much higher than the sites which have hundreds of link-farm-generated links. Does anybody know of a good list of PR firms specializing in online publicity? Publicity Guaranteed (PublicityGuaranteed.com) looks like an attractive option, as they only charge for the results, without any retainers or hourly rates; however they don’t do online-only publicity. AllPublicits.com seems to be the most comprehensive database of publicists and I posted there my RFP for publicists to bid on, but all resulting bids included traditional media. If I had a larger budget, I wouldn’t mind good coverage in traditional media, but I only seek publicity for my SEO campaign, so I need online publicity only