Late Sunday night through Monday, the harddrives responsible for most of our rankings system failed. These failures have resulted in slower rankings collection, but has not resulted in lost data.
Update for 5/7/2012-
Rankings are back on-track and up-to-date.
April monthly reports are re-generated, and last weeks weekly reports have been re-generated.
We are back to normal. Thanks for your patience over the last week.
What does this mean for customers?
If you’ve set up a customized report, you might not have received your report as soon as you expected to receive it. If you did receive a report and it included a rankings section, one or more of the subsections might have been blank:
These features were intermittently failing because the data was on the affected drives, but as of right now, these should now be functioning, albeit possibly more slowly than normal:
- Campaign ranking history
- Campaign ranking history CSV exports
- The Keyword Analysis Tool
- Custom Reports that include rankings data
For more technical details, please see this post on the dev blog.
When will it be fixed?
We are currently prioritizing the collection of keywords in the order they would normally be collected, which means all rankings collection will be delayed for customers this week. We expect this catch-up process to complete by the end of the week, returning us to normal at that point. Our current expectation is that rankings will be delayed 2-3 days for all campaigns, until we can catch up this weekend.
We are still working to resolve this issue and will update this post every morning until the issue is resolved. We sincerely apologize if these failures affect your work as we know how much you depend on our tools. If you have any questions, please leave a comment below and we’ll try to respond quickly.
Thanks for the info and update.
Just one question, which I know it is not just mine: the csv export from OSE are getting really slow in these last days, sometimes so much that - personally - I find myself deleting my requests.
Is this issue, which honestly is quite painful especially when you have to audit sites, related to this problem or it is caused by something else? Because something like this happened also a couple of month ago, but it was related - as told me - to outstanding volume of links export request.
Thank you
(being not a problem I had only, but others, I preferred to use this post occasion instead of the feedback).
Hey Gianluca! The OSE export slowness is an unrelated known issue that we're looking into. It's been going on for a couple of days and we're not yet sure of the root cause, but once we find out, speeding it back up will be a top priority! Thanks for asking. I know it's really painful to have to wait a long time.
Oh so there is an OSE Export issue in there? I was trying to export data for a site yesterday and it was like hanged for quite long, and I started to doubt my browser and my net connection, thought there may be some issue with the net connection.
I hope you find the fault and repair it back. Best of luck!
Couple of days? Since the new interface more like
Adding an update:
The slowness on OSE was caused from an unbalanced load across our API cluster after launching the new index on Tuesday. The spike in traffic brought to light a small configuration change that needed to be made.
Our engineers were able to resolve this around 5pm PST last night, so you should see better page load and export times today!
There was a separate, unrelated, isssue on Sunday night when our machine writing and exporting the CSVs fell over. Our ops team made some quick recovery, but, unfortunately, there was a big backlog to work through which caused some weird behavior on reports in flight.
If you're still having any issues with reports not finishing up, let us know and we will look into them!
Thanks for the quick updates!
Your experience with SSD is very interesting. I hadn't realised that they were so bonded to the number of read write cycles, I'd always taken them to be a 'rough guide'. Your mistake will stop me making the same in a few months when we change our network drive over to a RAID NAS box. SEOmoz teaching me stuff I didn't know again :-)
I genuinely find it incredible that a mechanical drive can outlast an SSD! But anyway, thanks for the update.
I've been experiencing extreme problems with OSE as well - I end up deleting about 90% of the reports I'm running because they're just hanging for 1+ days.
Hope you get it fixed soon!
Update for 5/3/2012- Monday's rankings are collected and Tuesday's are progressing as expected. We have started reprocessing the custom reports, and we expect them to be completed and back to normal by Saturday. We will update this post again tomorrow morning.
Update for 5/4/2012- Monday and Tuesday's rankings are collected and Wednesday's are over 80% complete. April monthly reports and last weeks reports are being re-created slower than normal, and we expect them to be completed and back to normal by Sunday night.
We will update this post again Monday.
Update for 5/7/2012- Rankings are back on-track and up-to-date. April monthly reports are re-generated, and last weeks weekly reports have been re-generated.
We are back to normal. Thanks for your patience over the last week.
Drive failure? Hey.. that's one of our keywords!
Congrats on the funding. It's great to know that SEOMoz will continue to grow.
Lovely way to recover from the hangover of celebrating the investment round! ;) You guys are really under it at the moment and from what I can see the root cause of this is storage that keeps failing...
Out of interest, why isn't it just a case of using RAID storage to make sure you have ample redundancy? (i.e. is it down to cost, i/o, etc.?).
Indeed. :) One of major projects right now by our egineering team is fixing our current storage problem.
Great Question! I had the same thought when the first problem initially occured. We hadn't been using RAID in these machines because the data is redundant across the cluster, and we don't need the RAID for performance because we are using the SSDs. Interestingly enough, had we been RAIDing the drives together, we would likely have the same problem. :-(
The issue lies in using many SSDs with the same load as redundant drives for each other: they will all fail at the same time. SSDs have a very bounded number of R/W cycles until they fail. In a normal RAID configuration, they would get the exact same number of R/W usages, leading to failures at nearly the same time (which was our problem).
Fundamentally, the architecture for storage that we were using was one made for spinning disk drives, and we wanted it to be faster, so we used SSDs, however that actually introduced a reliability risk, in that our load is very balanced among the servers, so the SSDs (unlike spinning disk) all "wore out" at the same time.
the end of month reports should come out on the first day of the new month. stop selling new subscriptions and clean up your systems before you grow anymore. all well and good to keep bringing on new customers but servicing existing customers should be done first.
Thanks for the heads up, just noticed this as I had my email informing me of my new keyword rankings but once logged in was seeing old rankings from 26th April. Hope you fix it soon!
You're welcome. It should be fixed by the end of this weekend.
I am hereby blaming this on Penguin (it seems like a good bandwagon to jump on at the moment). Curse you, Google, for taking away our SEO tools!
For those wondering about SSDs and failures, you need to understand that SSDs are just large Flash RAM arrays, just like USB thumb drives. They are designed to be semipermanent (meaning that, unlike their regular RAM brethren, they don't need power to store data) and, because they're not spinning platters, they are vastly faster than HDDs. The tradeoff is that they have a life span. So for something that high I/O but needs speed, you're looking at replacing new SSDs on a regular basis.
Technical stuff https://www.storagesearch.com/bitmicro-art1.html
Appreciate the transparency, problems happen but thumbs up for your efforts to resolve them!
Whats happening with the Anchor text in OSE? It still shows an oticed that the data is from Feb.
In order to fix a problem and ship the index sooner rather than later, we decided not to update the anchor text this time around. We figured that our community would rather have fresh data with old anchor text than no update at all. Currently, the anchor text is scheduled to update between 4/30 and 5/9. I apologize for any problems this may have caused you.
Perfectly understood!
Thanks for the update - will get round to looking over the massive index soon.
Thanks for the update. I was trying but with no joy. I had no worries though, I knew you folks were on top of it. It's one of the bumps on the road when you're dreaming big, don't stop.
Hi Thomas,
Thanks for keeping us up to date on what is happening.
To be honest, with all the Google rollouts and volatility in the SERPs during the past couple of weeks, the new timing of reports for one particular client was actually a Godsend :)
Sorry it caused a lot of extra work for you guys and anxiety for some, but I'll happily grab the silver lining which will give me updated rankings exactly when I need them this week!
Sha
Hello,
I was wondering if this issue extends to the crawler diagnostic tests as well?
i created a campaign last week and made the necessary changes based on what the crawler found. However, when the crawler ran again this week, there was no change whatsoever in the reports. It is totally flat.
More specifically, with the duplicate content and duplicate titles errors is there something I have to do so that rogerbot sees these changes? Most if not all missing page titles and duplicate page titles are definately fixed. Does rogerbot crawl fresh each week or do it crawl old pages that it indexed before? I was expecting the number to change a little at least.
Just wondering if these reports are delayed as well or if there is something I am missing so that rogerbot does not crawl old or removed pages from the site conitnuously.
Thanks!
Hi Erik,
Best to send an email direct to the Help Team if you have any problems with your campaigns. Send email to help at seomoz.org and be sure to give them the Campaign number (in the URL when inside your campaign) or the domain so they can take a look at it for you.
Hope that helps
Sha
Thank you Sha. I appreciate the info.
With the recent round of funding and the growth you guys are experiencing, I'd love to see some serious money invested in technology. Recently it seems like a lot of services are down, delayed or slow. It would be really awesome to see fewer hiccups.
We are in complete agreement. Pre-funding we started adding staff to reduce customer impacting issues. Further, we are in the process of building our own data center at a co-location facility, which will give us better control over the quality of hardware.
This is our first week since signing up, and I am not impressed. I cannot understand why you have no back up facility for such and event. all serious businesses run a risk assessment, and what if scenario, which you seem to have failed to do
Thanks for your comment FFTCOUK. Our engineering team is working hard to improve the process at which we collect and store ranking for all our customers. We understand that an outage like this can effect our customers and that's why we are working hard to solve the issue and prevent it from happening again.