When we’re working on fixing an immediate problem, especially one that’s affecting customers, it’s difficult to stop and take a breather. But sometimes, a breather is exactly what is needed to solve the issue.
One Step Back
Last month was a bit rough for our Big Data team. We spent most of the month heads-down fixing issues with Rankings and Keyword Difficulty, and our technical debt was creeping up on us. I wanted to give into my natural urge to hunker down, chew on the issues, and come up with a plan that would fix as much as I could. However, I had a weekly 1 on 1 meeting scheduled that seemed to be getting in the way of my plan to lay low and problem solve.
Here at Moz, each employee attends weekly or bi-weekly 1 on 1 meeting with managers or teammates to help keep our goals on track. 1 on 1 meetings are a chance for teammates to act as soundboards for project ideas and idea generators for solutions to issues. These meetings are an important part of our culture, but on this particular day my focus was elsewhere and I didn’t feel I had time for my 1 on 1 with Matt Peters, our rock star data scientist. Realizing that we had missed our last meeting, I begrudgingly made time to fit the meeting in. After our usual good talk on algorithms, correlations, and next steps for growing his team, we started bouncing ideas off each other on how to save money on processing. We were spending $800,000 on processing and not really getting anything for it. The current plan was simply unsustainable.
Matt, in his very scientific way, broke down the problem in exact numbers. I, however, will break them down for you in a very Anthony way:
- Long-term, we knew we needed to fix the issues we were having with Amazon, but we were reacting to missing our index release date instead.
- Short-term, it seemed sensible to spin up more servers and get the index done more quickly.
- In reality, spinning up more servers at Amazon was only increasing our costs, and our server failures. The current solution was not only not addressing the problem, but in some ways it was making the problem worse by taking time away from the team’s efforts to fix the long-term issues.
Taking a step back from the immediate problem made it clear that our current approach wasn’t working.
*Server photo by Kim Scarborough used through creative commons license.
Coming Up with a Better Plan
After the insight I gained in my 1 on 1 with Matt, it was clear we needed to change our approach. Matt and I and outlined a high-level plan for lowering our costs with the added potential bonus of getting indices out on time. We figured it might be a hard sell after telling the team, “Don’t miss the date at all cost,” for the last two months. They'd spent hundreds of hours trying to keep all of those servers up, and we weren't sure how open to this change they would be.
However, Carin, our stellar Manager of Big Data, brought the team together and we all agreed on the plan. Carin outlined the issues and then proposed the new approach in this snippet from her email to Rand:
The New Plan:
- Run two indexes at most in AWS:
- One cluster on 80 cc2.8xlarge machines - these are HUGE and more expensive, but should complete an index in less time, making them cheaper over the month.
- If necessary, run a backup index on 200 smaller c1.xlarge machines (current setup).
- Continue to maintain an index size of 60 - 70 billion URLs to keep processing time reasonable.
This plan allows for engineering time to tackle the larger problems: develop a testing environment and improve the Mozscape code base. Most importantly, though, we can distribute PLDs across processing shards in a more efficient manner, which could lead to significant time savings in processing.
Two Steps Forward
Luckily, Rand approved the plan, and the time and energy spent to take a step back really paid off. Newer, better, bigger equipment did the job, with no server failures and no operational headaches. The October index release is the result of the change. It finished in record time and only cost $100,000, compared to the $800,000 spent last month.
*Server photo by Kim Scarborough used through creative commons license.
We learned quite a few things from this experience, but this was our most important takeaway: the times when you feel like you don’t have time to step back and reassess are exactly the times when you should. It may not always save you $700,000, but there is a chance that it might. The time spent gaining a new perspective can bring solutions to light that you’d have never seen if you’d kept that nose to the grindstone!
We are hopeful that future indexes run as smoothly as October, and if they don’t, we'll remember our own advice and take a step back before moving forward.
"We're going to spend $700,000 less per month with you than we used to" would be a fun conversation to have with AWS.
Thanks for the update. Index freshness is extremely important to us.
We understand the importance of freshness. The Big Data team is putting in hours to improve what we have and architect a new system that is updated daily and not monthly. A blog post is soon to come!
"Luckily, Rand approved the plan" -- Great when the decision makers, are quick to understand and approve such measures. +1 Rand.
For sure! Took Rand all of about three minutes to approve. Rand has been extremely supportive.
Thanks for your transparency on the issue and the reflection. Hopefully Moz has a program in place to reward team members who save the company money :)
Agreed. Rand is Mr. TAGFEE, he is incredible at rewarding all of us!
I realize that it was the time of confusion and reflection that always "generated" better direction and confidence in my life and business.
After the storm there comes the clear sky and new meaning to what you do.
Max,
I am in total agreement. We are doing are best to implement better systems and best practices. We are looking forward to fewer storms in the near future.
You were spending $800k in a month? Wow. If I've got my maths right, that's over EIGHT THOUSAND* PRO members needed to sustain it?
* Sorry but this came to mind as I typed that...
Seriously though, $700k per month is an unbelievable saving. Sounds like Matt is the SEOmoz hero. Equally though, good on you, Anthony, and the management at SEOmoz in general for taking the time to listen to your employees' suggestions and feedback. With the way things are these days, it's often the staff - not the management - who have more "hands-on" experience in things and are therefore possibly in a better place to advise and to give recommendations. It's not always the case, but I've certainly seen it happen a few times IRL. Either way, imagine if he hadn't passed it on or you hadn't taken it on-board? I bet it's scary to think about now!
So, what are you going to do with your $700k? Create a real-life Roger? Just putting it out there...
Don't want to change the vibe here, but Steve has a great idea here! You should really create Roger action figures (or even fully-operational robots) for us to give our SEO friends for Christmas...
Also, this post is great in that it describes dealing with an elephant in the room instead of letting it hang around to continue wreaking havoc. It's much better to tackle the problems when they first hit even if it means taking a step back :).
Great news, thanks for sharing with us!
Nice video. I like the Oprah version as well. You are correct we need a lot of users to support that kind of spend. However, our main goal was to ensure we didn't disappoint any of our customers with late updates regardless of cost. Thus, we tried to hedge our bets with more servers. Turns out all we really needed was for Amazon to provide really big quality servers!
Roger action figures would be pretty cool. I know my kids would love them. We might have miss the holiday season but I will pass on the idea to Rand.
Thanks for the great insight into the inner workings. I love reading these kinds of posts, especially as we go through our own growing pains.
As I tell my kids, sharing is caring. LinkedIn has been doing a great job of sharing on https://highscalability.squarespace.com. I have reached out to them to understand some of their pain. If you want to share we are here! Love to here more.
"It may not always save you $700,000, but there is a chance that it might."
Sentences like this never fail at reminding me how small of a business I run...
Hey Kade,
No such thing! Many of us have tried running a small business and have not been able to pull it off. Congrats for running a successful business.
I love that technical solved a problem money was having trouble with. Sometimes it's as simple as backing up, moving forward, and learning from that mistake. As you said, the important lesson is to do the same thing next time you're at one of these headbanging moments.
+1 for technology!
This is a great article on taking the time to understand a problem and then fix it rather than just throw money at it and keep banging your head on the wall.
We did try to spend our way out of the problem first. It was simply unsustainable. We only have one index out using the new machines. Now we much play the wait and see game to ensure if subsequent releases are as successful.
Amazing to see the cluster effect on savings.
Awesome article and incredible outcome. Sometimes it takes courage to pause, take a step back and then propose bold, fresh ideas in the midst of business stress striving to deliver the best service possible. Sounds like the reward in this case was well worth the risk. Congrats to you and the SEOmoz for having the courage to take the risks necessary to reap the substantial rewards.
Hey Rick,
Risk vs Reward is something the Big Data Team and Carin discussed on multiple occasions. As a team they really hate having a late index due to hardware issues out of their control. They take it pretty hard.
I think we found a good cost effective balance.
Part of the challenge with keeping your nose to the grindstone, is that you're busy planting trees and can't ever see what the forest looks like. I tell executives I coach that they have to learn to fly at 30,000 feet so they can see the forest and get the big picture and after they see what their efforts have produced to this point, they can swoop back down and keep on planting they way the had or change direction.
Nice post.
Good advice! I think many times in technical leadership roles it is hard not to do what comes naturally. We love being in the weeds and knowing every detail. Luckily, we have a great staff which makes it much easier to fly at 30,000 feet.
Throwing money at problem does not ensure it goes away. But most of the times, there is a money component to solve the overall problem. This is a great example of real life problem solving. Would love if you would share more examples like this!
We are going to share more about what we are doing on the tech side. More post are scheduled. Admittedly, I was afraid of boring everyone.
Not boring at all. There are many big data articles, but they never include as much insider access.Thanks Anthony.
I'd love any insight you had to share about the issues you were having on AWS. I presume you had a fleet of c1.mediums? Were the issues related to the instances themselves or the code you were running on them?
Hey Dave!
We actually were running multiple clusters of the c1.xlarge machines and the chances of a machine failing in one cluster was really high.
We had a few problems with file corruption when uploading to S3, but for the most part our problems were not software related - mostly hardware failures. We have run into a few problems with the software not scaling as linearly as we would like as we keep increasing the index size. The team is focused on improving our current software as much as possible while also architecting a more modern system that will have much faster update cycles.
Thanks,Carin
This article is well explained, thank you very much for the information, it is one of the most relevant that I could find
Awesome article. Learned few things this morning.
I love these kind of post! Thanks for giving me a glimpse of a "major" problem / solution - it makes my world seem a lot more manageable!
Great story Anthony and happy to hear it paid off. Excellent advice with "the times when you feel like you don’t have time to step back and reassess are exactly the times when you should.".
Thanks for sharing.
Thank you. With a few more runs on the new servers I think we will know if it is a success we can count on long term. Currently the index is flying along on the new hardware.
Lion goes one step back, not because of fear but it's for attacking. I really impressed by your ideas about business and make the perfect planning for it. Thanks a lot.
This should go on businessinsider rather than an SEO focused blog :)Great one.
Two things stand out Max: First the message you send - take a break to allow the clutter to fall away and inspiration to fill the void. Second but not lessened, the continued MozOpenness - it appears to be the standard MO... ;)Thanks, good post
Amazon (AWS) team is really happy now :)
In the middle of economic crisis, we should have in mind that we never do tho steps together, but only one at the time. Nice article