Hello, I am Anthony Skinner, the CTO of MozLand!
Many of you were affected by several SEOmoz tool issues that happened last week, unfortunately all colliding into one colossal day of craziness on May 3rd. We want to first apologize for any inconveniences or problems that these issues caused you.
The good news is that our awesome engineers fixed these problems quickly, but we want to share an update of what happened, how we fixed it, and what we’re doing to prevent "colossal days of craziness" from ever happening again. So, here’s the inside scoop (y’all know we like that whole transparency thing 'n stuff ;-)).
So, down to the nitty gritty of what happened last week and where we are now....
Status |
Issue |
Fixed |
Rankings – Rankings were delayed by a couple of days for all customers due to some intermittent outages in our database. This delay caused custom reports without rankings data. Fix: After trying it the hard way, we had a eureka moment (in the shower, no less) and promoted our back-up disks to primary, resolving the problem almost immediately. Why it won’t happen again: We had planned for SSD failures, but did not expect to see a full cluster failure at one time. Going forward, we’ll be looking at making sure we’re using SSDs appropriately, and, when we do use SSDs, having more robust failover plans in place. We’re also changing the way custom reports are built to speed up the process, and enhancing custom reports to wait on dependencies. |
Fixed |
Slow Open Site Explorer CSV Reports, and Mozscape API calls were failing – The Mozscape API was running noticeably slower and reports weren’t finishing. We found two export jobs that were continually requeuing themselves, severely backing up the CSV reports queue. Fix: We fixed the condition causing the queueing and made some adjustments to the load balancing on the servers. Why it won’t happen again: To prevent the queues backing up in the future, we’ve added a hook to prevent failed jobs from re-queuing. Monitoring and alarms have also been added to notify our ops team if these queues start backing up. |
Fixed |
Campaign Setup and Custom Crawl – Users were running into an error message when trying to create new campaigns, and some users were seeing a dramatic reduction in the number of pages crawled. Fix: With some creative ops magic, our engineers were able to configure the proper permissions and get campaign creation working again. Truncated crawls were caused by a race condition. We also made the transition between finalizing the crawling of a campaign and scheduling the next crawl smoother, which resolved this race condition. Affected campaigns were re-crawled so users could receive a full weekly crawl. Why it won’t happen again: We’re working to do better testing at scale and to create more defined unit tests to catch these types of race conditions that don't appear in small scale testing. We’re also working on better monitoring around the campaign crawl service and decoupling campaign creation from the custom crawl service so back end crawler problems will not have such a dramatic affect on the usability of the rest of the SEOmoz PRO app. |
Fixed |
Delay in SEOmoz PRO Web App picking up the new index - Our latest index update wasn’t reflected in the SEOmoz PRO web app right away. Fix: We redeployed an old endpoint in our API that we had been using for campaigns to pick up the new index metrics. We also updated the PRO software to use the new endpoints that Mozscape API now supports. Why it won’t happen again: We updated our release procedures, and also updated the PRO app to use a new Mozscape API endpoint that publicizes the index launch date. This improvement will mean much smoother updates to Mozscape API campaign metrics in the future. |
Fixed |
Social – PRO users trying to connect their Facebook accounts were receiving an error message. We were getting odd data back from the Facebook API indicating users' authentication data expired - like 25 years ago :). Fix: We’ve updated the Facebook connection to return the correct time format. Why it won’t happen again: To be honest, we’re not sure it won’t... We’ll try to stay on top of changes in Facebook and update our app before the changes affect our users. |
We’re also going to be putting some of the new funding (read the memenouncement here) towards making sure things like this do not happen again. We’re investing in infrastructure improvements (blog post to come) to both help keep things running smoothly, and bring you new features and improve stability all around. We’re also hiring... if you’re a brilliant, motivated SEO-lover, apply here.
Again, many apologies for the inconvenience this caused all of you. We’ve learned a lot in this process and will keep doing our darnedest to keep things running smoothly.
It’s good to see the transparency from your side and I am glad to know that how much value you provide to customers. Good work… Keep it up!
Thank you for sharing the back story, exports are much smoother now! Appreciate the honesty :)
So how are things looking for May 29th? Expecting any delays with the next update ?
Hey Kevin,
We are on track. Knock on wood and Three Hail Marys!
Good to hear. Thanks!
Thanks, Anthony. I'm going to save this post as a positive example of how to go about addressing brand community/consumers in the wake of an 'issue.' What's better than knowing something won't happen again is knowing it will be transparently, speedily, passionately and efficiently addressed. Another hat tip to the Mozsquad.
this is great, I am very glad to see that I am using the plateform where nothing is hidden. Its eally create a lots of faith in community.
Honesty is the best policy!
Thanks for the update guys, very transparent (and helpful).
I have always liked the way SEOMOZ answers things truthfully. They don't try to hide things. This isn't the first time something went wrong, but at least they admit it and take action. Good job guys.
"The truth will set you free."
This is great. I appreciate your honesty and forthrightness.
Thanks for the update Anthony! I was wondering what was happening with the campaigns.
Just letting us know ... this is why you guys ROCK!
You're right about us liking the transparency. While it promotes confidence, to me it is enlightening to have an insight into problem solving process and and the development of safeguards for future
While we all have a great appreciation for massive data stream SEOMoz is able to harvest and provide, it is fascinating (at least for me) to have a glimpse into how that magic is accomplished.
Thanks for the update. Glad to know it's not just my projects that have hiccups
I was one of those affected by the ranking issues. I had a heck of a time explaining to my clients why the reports would be late, but on the other hand, it gave me a chance to expound on the fact that rankins are not a great API, that we should look at increases in traffic instead.
Thanks for the explanation!
Hello Zeph,
We do not want to put you in the situation of having to explain. The team is working damn hard to ensure it doesn't happen again.
I know from experience that the kind of work you do is not easy. You guys do a great job. BTW I meant to write KPI, not API (facepalm). Looking forward to meeting you all in July!
Now that proves that Moz players are smart!
I am glad that you noticed and fixed that OSE report generation was really slow, infect I left the ticket regarding that on the help desk…
Great work MozTeam.
Yeah, that's one thing I really appreciate about Seomoz: transparency if something went "wrong". Thanks for the update, Anthonny and keep up the good work. :-)
Cheers from Germany
Sven
Thanks alot for sharing
I just can't get over the transparency of this compnay- I only wish more companies were honest about their mistakes and fixed them in a timely manner.
Thanks everyone for all the positive feedback! We will always be forthright and will update everyone on our progress. #TAGFEE
Thanks for the update Anthony.
Transparency is always appreciated. Thanks.
Thanks -I still think there is a glitch though - page authority for fairly new pages is not being updated - I don't mean a page created last week, I am talking about pages created maybe 3 months ago - anyone else got this issue and any update from Moz ? Not seen this before, pages which are 2-3 months old, with links, still showing PA1, and lots of them.
Hey! Interesting - would you be able to provide a few examples? If you don't feel comfortable posting here, feel free to email me at [email protected]!
Thanks!Carin
Will do Carin, no one else mentioned this then ?
I haven't heard anyone contact us with that issue - I'm interested to investigate your examples!
Hi Carin
This page https://www.darlingtons.com/blog/ has been around for over 3 months, is well linked from other pages on site, still showing PA1.
Same on another site I work on, been live for 2 months now, https://www.forthepeople.co.uk, all pages showing as PA1.
Look forward to hearing from you, can give lots more examples.
Thanks
I searched for www.darlingtons.com/blog/ in Open Site Explorer and you can see that we have not actually crawled the page. Pages that have not been crawled will have a PA of 1.
Looks like the same is happening for www.forthepeople.co.uk - check out the Open Site Explorer link.
Unfortunately, our crawlers have not found these pages to crawl - this latest index was our longest processing index, taking three months total. The crawl data will be from mid-December to mid-February meaning the data can be stale. We do have a much fresher index in processing right now. Its scheduled to launch 5/29, but we'll release it as soon as it's ready!
Thanks Carin, appreciate the honesty though am a bit concerned these sites and pages are taking months to get crawled by your tool, it didn't use to be like that in my experience, and both sites and others have links goiung to them, so not sure why this is happening.