Update: The tool mentioned below is probably more the work of the rest of the dev staff: Jeff, Mel, Mike, and Timmy. They put in a lot of long hours while I was doing stuff like this post.
Here at the mozPlex we've been hard at work on some new tools (see below for exclusive early preview screenshots!). One of the things we try to do is incorporate some of the great advice that's already out there. At the latest SMX Advanced I heard from SEOs and Search Engine reps alike that we should focus on data, and for goodness sake not just to check your rank! So we've been crunching some numbers... a lot of numbers. And all that data can start to get a little confusing. We've employed a few techniques, including regression analysis, to help us make sense of our data. I actually whipped up an online regression tool to help out.
The particular issue we're trying to understand is illustrated in the image below:
You can see that we're pulling some great data together from some very authoritative sources. However, if you notice the numbers along the right-hand side, you can see that there's a lot of different scales, and if you know anything about any of these metrics you know that the difference between 5 url mentions and 50 is quite a lot different than the difference between 20,005 and 20,050.
Suppose we're trying to understand the importance of the number of domain mentions as reported by Google. Likely, more domain mentions across the web means your domain has a greater ranking strength and influence. But how can we make that more precise? Is 100 mentions good? Is jumping from 1000 to 1100 a big jump? What should be 10% and what should be 90%? If you're a savvy SEO (and I actually know a few around here and abroad) you can come up with some examples:
Domain Mentions | Value |
---|---|
1890000 | 100 |
1280000 | 100 |
866000 | 100 |
659000 | 96 |
584000 | 94 |
247000 | 80 |
115000 | 65 |
32500 | 45 |
13400 | 30 |
11300 | 28 |
6590 | 15 |
218 | 5 |
4 | 1 |
The idea is to come up with some equation, a model, that matches the pattern we observe -- in this case, what you might intuitively believe as a smart SEO. You can check out what my tool will suggest for this data.
You'll notice that, in addition to the specific model I recommend, I also include a couple of graphs. If you've done this before, you know how important it can be to get a feel for what your data looks like and how your fitted model compares. I also include a graph of the "residuals," which are the errors in the model's estimates. For instance, if you ask for 80% at 247,000 domain mentions, and the model predicts 75.55, the residual is 4.45. It's often valuable to think about the square of the residual for statistical reasons (that's the "residual sq" column in the table). The square of the residual also helps to emphasize larger errors.
Intro stuff done, let's get advanced! For this particular data you'll notice (go ahead and open that "suggest" link above in a new tab), that our observations are capped at 100. This is an artifact of the 100 point scale, and poses some mathematical difficulties for simple modeling techniques. You'll also notice that the suggested model does quite poorly in the middle range. So here are a couple of tips:
If you have a truncated scale, like our 100 point scale here, it's valuable to let your model predict outside the range and truncate its estimates later. To do this you can, for example, drop the extra truncated observations from your data (1,890,000 -> 100 and 1,280,000 -> 100). Just delete those rows from the text box and click that "fit model" button and you'll get better results.
Also, the algorithm I'm using tries to match all the data points as if they are all equally important. If you feel that some range of your data is more important, just add more observations in that range. For instance, we might want to come up with some more examples between 30% and 65%, since this range is not well fitted by our initial models, and most of our users will probably fall into this range. For example, we might add 26,900 -> 38% and 47,000 -> 50%. With these new observations the model will emphasize this range of the data.
You should also consider the other possible models, even if they are not "suggested." You might like the behavior of a power model over a logarithmic, for instance. Just click the links for the other models and see the graph of the model and its residuals.
We're using these techniques in-house to score our users' pages, domains, and blogs (check out the screenshot below!), but you can use these techniques to better understand the behavior of data. The next time someone says, "What will happen if we double factor X?", you can turn to your body of experience and mathematical model and tell them, "In general, the importance of factor X falls off on a logarithmic scale. Let me show you graphically..."
Here's just one example of how we're going to use all these nifty models:
So keep your eyes peeled for evidence that college stats still matters!
Playing with large amounts of data is fun, the problem is extracting information that is really useful.
When using nonlinear regression the equations that best fit the data very rarely correspond to a meaningful model.
Having been through this exercise myself I came to the conclusion that with some data sets a different and more modern approach is required.
I suggest you look closly at using an adaptive heuristic search algorithm. I wrote my own (which is also fun!) but there are many programs available, search for "genetic algorithm software".
- Michael
You've got a good point. Often with lots of data and complex relationships, if you really want an automated inference algorithm you might be better off with some other approach (such as Genetic Algorithms, Neural Networks, or Support Vector Machines).
However, the advantage of the models of a tool like this is that they are simple. Using these you can often get a good feel for trends in the data. More complex models can often over fit your data or are very difficult to understand intuitively.
Playing with large amounts of data is fun, the problem is extracting information that is really useful.
w3rd!
ps. how do you do the quotes?
indent button, to the right of strikethorugh
This is a terrific tool, and one I think was sorely needed. Prior to it, I'd been using this Simple Regression Utility, but it only dealt with a few different ways of creating the formula, whereas this one goes all out. Way to make the web a better place, Nick!
BTW - For SEOs wondering what to do with something like this, one application that I've done is to grab data about sites, match them up to rankings, and see how the correlation curve fits :) You'd really need a lot more data and a system that took multiple sets into consideration to be truly valuable, but it can be fun to see PR matched up to rankings or link numbers of Alexa data.
p.s. The tool Nick's teasing above is called "Trifecta" and should be launched a week from today. It's replacing our Page Strength tool and will let you compare blogs, pages and domains separately, based on the factors that affect each.
You make a good point about needing a lot of data and taking into account multiple variables. This tool does single regression. What you're describing is multiple regression. But that requires some linear algebra and a couple of Guassian eliminations. So I skpped it... for now ;)
Can get kind of complicated but the multiple regression is nice because you can see the correlation between efforts and results.
Good work on explaining what can be a mind-numbing and hard to understand subject!
Well, some SEOs already do maintain trended data (e.g. PR, ranks, links of different types and other metrics) for their projects and competitors... So aside from looking forward to Trifecta, this new regression tool unto itself: word.
Yup, we've been playing that within the company I work for, and even hired some econometrists to do the cool multiple regression analyses and more. Problem always remains that for the "coolest" SERPs you want to run tests like that on, there's a LOT of hand editing going on, and so the data turns out to be less useful...
My head hurts! As the line graph points upwards though, it must be good :)
Just got done with a class for my MBA called "Advanced Statistics" that makes all of this sound really familiar! :) This was a nice analysis and I can't wait for the new tool.
Who knew that they taught us stuff in school that's like USEFUL! ;)
Thank god they only require me to take stats in my first year of the MBA program....it nearly killed me! :P Such valuable stuff but man all mighty its hard to wrap your brain around. Good job on surviving "Advanced" Stats!!!!!
I suspect you'll garner quite a few edu links with this little tool, that's for sure. Nice work.
There are plenty of other online regression tools out there. I kind of feel bad adding another one to the fray. But honestly, none of the other ones did any graphing, and most wouldn't show you a model's predictions with residuals.
1. I hope this is useful
2. If it's useful, maybe I will get a few edu links ;)
Wow, you guys really have been busy :)
Will this tool be available free or will it just be for Pro members?
Like Page Strength, it will be free to run 1-2X per day, then require a PRO account to access more.
Nick, you rock. Great post and great tool.
If you want to come and work in London at any point, let me know (sorry Rand).
You've now killed my productivity twice in one morning - between this and your email.
Have you guys checked out SEOintelligence? It's another great tool to take a look at for regression analysis and SEO tracking purposes.
Aren't a lot of these variables co-linear, and as such "hard" to calculate with, as individual predictors may have weird results on the outcome?
I wish someone would answer this question from joost.
Hi,
Do you have another example in concrete, for example putting the metrics together from comparision of domains by linkscape?
An step by step help page can be helpful.
Thanks,
you could really just grab excel and do some regressions with the data points made available via the tools on here. a quick excel regression tutorial like the one at https://phoenix.phys.clemson.edu/tutorials/excel/regression.html can get you rolling.
I did not understand, I have a bit of confusion.
Regression analysis tool is available in microsoft excel as well. Under Tools ---> Data Analysis ---> Regression. Why are you creating a new one?
While using regression analysis apart from standard error, you should look at the significance number as well. And make sure not to include 2 factors that have a high level of correlation between themselves like pagerank and number of links. Else you may end up with absolutely wrong conclusion.
You make a great point about Excel's tools (as long as you install the data analysis pak). I've done a lot of work with it.
However, I found that for the application where I'm trying to take a set of know datapoints and extrapolate between them, it's nice to have a single, free tool which computes several models, along with RMSE, and graphs the functions with one click.
I know this is an old thrad now but.... do you knwo of (or are you developing) a non-linear regression tool that does high (6) order polynomials?
I would love to find some PHP app app the does high order polynomials.
Sounds really interesting, Nick. I'll be curious to see how you work the results into the new tools. I've actually been playing around with some multivariate regression models for SEO lately, and it's really starting to sink in for me just how much the impact of different variables varies widely across industry, competitive landscape, site size, and other situational variables.
This takes me back to 2nd year stats analysis class.
Who know I'd ever need to use that stuff again and here you go talking it all that step further.
Fab! Well done and THANKS!
This is great... of the charts in mindnumbingness, but in a good way. And I'm sure this is just the start of seeing the tools stepping up to the next level.
Although this stuff has always given me brain cramps it is extremely valuable (don't tell my stats prof I said that :) ) It's invaluable having a way to graphically express your point, especially to those who may not have the same in depth understanding of the environment.
Some may argue with me on this one...but I always found excel to be pretty easy to use for this type of analysis as well.
Great tool!!
Hi Rand, I don't know if it is just for me, but the results I have been getting from the SEO Pro Tools are very poor. For example the backlink anchor text analysis only gives a few results (50) for a Site that has thousands of links, and the same with others, I have mentioned this to you before. Please check on this issue. Thanks.
We have been looking into it and I think our solution, at least with backlink anchor text, is going to be to use some different systems for grabbing data. That's part of our October release as these things take a long time to develop if you want them scalable.
In the meantime, though, the Trifecta tool is launching next week, and to date, it's given us extremely fast, very accurate results, which we expect will make it one of the most valuable tools in the collection very quickly.
Hmm... interesting