Given this blog's readership, chances are good you will spend some time this week looking at backlinks in one of the growing number of link data tools. We know backlinks continue to be one of, if not the most important parts of Google's ranking algorithm. We tend to take these link data sets at face value, though, in part because they are all we have. But when your rankings are on the line, is there a better way to get at which data set is the best? How should we go about assessing these different link indexes like Moz, Majestic, Ahrefs and SEMrush for quality? Historically, there have been 4 common approaches to this question of index quality...
- Breadth: We might choose to look at the number of linking root domains any given service reports. We know that referring domains correlates strongly with search rankings, so it makes sense to judge a link index by how many unique domains it has discovered and indexed.
- Depth: We also might choose to look at how deep the web has been crawled, looking more at the total number of URLs in the index, rather than the diversity of referring domains.
- Link Overlap: A more sophisticated approach might count the number of links an index has in common with Google Webmaster Tools.
- Freshness: Finally, we might choose to look at the freshness of the index. What percentage of links in the index are still live?
There are a number of really good studies (some newer than others) using these techniques that are worth checking out when you get a chance:
- BuiltVisible analysis of Moz, Majestic, GWT, Ahrefs and Search Metrics
- SEOBook comparison of Moz, Majestic, Ahrefs, and Ayima
- MatthewWoodward study of Ahrefs, Majestic, Moz, Raven and SEO Spyglass
- Marketing Signals analysis of Moz, Majestic, Ahrefs, and GWT
- RankAbove comparison of Moz, Majestic, Ahrefs and Link Research Tools
- StoneTemple study of Moz and Majestic
While these are all excellent at addressing the methodologies above, there is a particular limitation with all of them. They miss one of the most important metrics we need to determine the value of a link index: proportional representation to Google's link graph . So here at Angular Marketing, we decided to take a closer look.
Proportional representation to Google Search Console data
So, why is it important to determine proportional representation? Many of the most important and valued metrics we use are built on proportional models. PageRank, MozRank, CitationFlow and Ahrefs Rank are proportional in nature. The score of any one URL in the data set is relative to the other URLs in the data set. If the data set is biased, the results are biased.
A Visualization
Link graphs are biased by their crawl prioritization. Because there is no full representation of the Internet, every link graph, even Google's, is a biased sample of the web. Imagine for a second that the picture below is of the web. Each dot represents a page on the Internet, and the dots surrounded by green represent a fictitious index by Google of certain sections of the web.
Of course, Google isn't the only organization that crawls the web. Other organizations like Moz, Majestic, Ahrefs, and SEMrush have their own crawl prioritizations which result in different link indexes.
In the example above, you can see different link providers trying to index the web like Google. Link data provider 1 (purple) does a good job of building a model that is similar to Google. It isn't very big, but it is proportional. Link data provider 2 (blue) has a much larger index, and likely has more links in common with Google that link data provider 1, but it is highly disproportional. So, how would we go about measuring this proportionality? And which data set is the most proportional to Google?
Methodology
The first step is to determine a measurement of relativity for analysis. Google doesn't give us very much information about their link graph. All we have is what is in Google Search Console. The best source we can use is referring domain counts. In particular, we want to look at what we call referring domain link pairs. A referring domain link pair would be something like ask.com->mlb.com: 9,444 which means that ask.com links to mlb.com 9,444 times.
Steps
- Determine the root linking domain pairs and values to 100+ sites in Google Search Console
- Determine the same for Ahrefs, Moz, Majestic Fresh, Majestic Historic, SEMrush
- Compare the referring domain link pairs of each data set to Google, assuming a Poisson Distribution
- Run simulations of each data set's performance against each other (ie: Moz vs Maj, Ahrefs vs SEMrush, Moz vs SEMrush, et al.)
- Analyze the results
Results
When placed head-to-head, there seem to be some clear winners at first glance. In head-to-head, Moz edges out Ahrefs, but across the board, Moz and Ahrefs fare quite evenly. Moz, Ahrefs and SEMrush seem to be far better than Majestic Fresh and Majestic Historic. Is that really the case? And why?
It turns out there is an inversely proportional relationship between index size and proportional relevancy. This might seem counterintuitive, shouldn't the bigger indexes be closer to Google? Not Exactly.
What does this mean?
Each organization has to create a crawl prioritization strategy. When you discover millions of links, you have to prioritize which ones you might crawl next. Google has a crawl prioritization, so does Moz, Majestic, Ahrefs and SEMrush. There are lots of different things you might choose to prioritize...
- You might prioritize link discovery. If you want to build a very large index, you could prioritize crawling pages on sites that have historically provided new links.
- You might prioritize content uniqueness. If you want to build a search engine, you might prioritize finding pages that are unlike any you have seen before. You could choose to crawl domains that historically provide unique data and little duplicate content.
- You might prioritize content freshness. If you want to keep your search engine recent, you might prioritize crawling pages that change frequently.
- You might prioritize content value, crawling the most important URLs first based on the number of inbound links to that page.
Chances are, an organization's crawl priority will blend some of these features, but it's difficult to design one exactly like Google. Imagine for a moment that instead of crawling the web, you want to climb a tree. You have to come up with a tree climbing strategy.
- You decide to climb the longest branch you see at each intersection.
- One friend of yours decides to climb the first new branch he reaches, regardless of how long it is.
- Your other friend decides to climb the first new branch she reaches only if she sees another branch coming off of it.
Despite having different climb strategies, everyone chooses the same first branch, and everyone chooses the same second branch. There are only so many different options early on.
But as the climbers go further and further along, their choices eventually produce differing results. This is exactly the same for web crawlers like Google, Moz, Majestic, Ahrefs and SEMrush. The bigger the crawl, the more the crawl prioritization will cause disparities. This is not a deficiency; this is just the nature of the beast. However, we aren't completely lost. Once we know how index size is related to disparity, we can make some inferences about how similar a crawl priority may be to Google.
Unfortunately, we have to be careful in our conclusions. We only have a few data points with which to work, so it is very difficult to be certain regarding this part of the analysis. In particular, it seems strange that Majestic would get better relative to its index size as it grows, unless Google holds on to old data (which might be an important discovery in and of itself). It is most likely that at this point we can't make this level of conclusion.
So what do we do?
Let's say you have a list of domains or URLs for which you would like to know their relative values. Your process might look something like this...
- Check Open Site Explorer to see if all URLs are in their index. If so, you are looking metrics most likely to be proportional to Google's link graph.
- If any of the links do not occur in the index, move to Ahrefs and use their Ahrefs ranking if all you need is a single PageRank-like metric.
- If any of the links are missing from Ahrefs's index, or you need something related to trust, move on to Majestic Fresh.
- Finally, use Majestic Historic for (by leaps and bounds) the largest coverage available.
It is important to point out that the likelihood that all the URLs you want to check are in a single index increases as the accuracy of the metric decreases. Considering the size of Majestic's data, you can't ignore them because you are less likely to get null value answers from their data than the others. If anything rings true, it is that once again it makes sense to get data from as many sources as possible. You won't get the most proportional data without Moz, the broadest data without Majestic, or everything in-between without Ahrefs.
What about SEMrush? They are making progress, but they don't publish any relative statistics that would be useful in this particular case. Maybe we can hope to see more from them soon given their already promising index!
Recommendations for the link graphing industry
All we hear about these days is big data; we almost never hear about good data. I know that the teams at Moz, Majestic, Ahrefs, SEMrush and others are interested in mimicking Google, but I would love to see some organization stand up against the allure of more data in favor of better data—data more like Google's. It could begin with testing various crawl strategies to see if they produce a result more similar to that of data shared in Google Search Console. Having the most Google-like data is certainly a crown worth winning.
Credits
Thanks to Diana Carter at Angular for assistance with data acquisition and Andrew Cron with statistical analysis. Thanks also to the representatives from Moz, Majestic, Ahrefs, and SEMrush for answering questions about their indices.
Hi Russ,
OK - you outsmarted me. I massively respect your ability to analyse data... so please be polite if my comments show a slight misunderstanding of the post... Hopefully I have some insight to add though.
Majestic doesn't use any of those four methodologies for assessing which links to follow next per-Se and I do think our methodology (Trust Flow) is absolutely about quality over quantity. One link can have a huge impact under Trust Flow logic and we can then use Flow Metrics to determine which pages to crawl more often than others. (That is not our only criteria, but I have some secret source to protect of course.)
It is perhaps because of this approach that we may not pick up more "link pairs" as you describe, because we know that picking up more is not as effective as spending the same crawl resource analysing other stuff. As you suggest, it is not just a numbers game. A dumb crawler (OK, all crawlers are dumb) can easily get fixated on link pairs and start seeing1,000, 2000, 10,000 link pairs between two sites, but after enough to say "that's site wide" I think all the insight has been gleaned that will be gleaned.
That means Majestic generally crawl more intelligently by being able to know what is important and what is not. That is certainly not to say we don't want to also have the largest data set... we do want to... and we are not doing so badly. In the example you gave of mlb.com the absolute link counts cited to mlb.com in Moz, Ahrefs and Majestic are:
Majestic: Links 47,869,070 from 124,299 referring domains.
Ahrefs: Links 27,000,000 from 114,000 referring domains.
Moz: Links: 92,105 (established) links from 1,167 referring domains.
So if Majestic loses on the link pairs you tested, the elephant in the room is... how come Majestic has more links? I am hopeful that we know better than most when to stop following a branch in the crawl and move on to new ground. Here I totally agree that the crawl priority is the key.
From what I understand, Moz also has a way to prioritize that is not based on quantity. I certainly admire their metrics, in spite of the size differential. I do not think we would be able to mimic their methodology even if we wanted to, in part because we do not do rank checking at all. Now that Ahrefs has started rank checking again, that may help them improve their metrics as well - but I remain fiercely proud of Trust Flow as something truly independent and unique right now.
Our objective is not to create the same index as Google's. but to produce a GOOD index, that has insightful data. This is increasingly Trust Flow and Topical Trust Flow as better indicators of quality than the link counts or pairs themselves.
Dixon (Majestic).
Dixon, thanks for the response! I will do my best to answer in-line so others can follow. I think it is very important people understand the limited scope of this analysis and its very specific recommendations.
Quote: Majestic doesn't use any of those four methodologies for assessing which links to follow next per-Se and I do think our methodology (Trust Flow) is absolutely about quality over quantity...
The examples I provided were simply for illustration. I imagine Majestic, Moz, AHrefs and SEMRush have far more sophisticated crawl prioritizations than those discussed. However, I think you and I would agree that none of these indices have an identical crawl strategy to that of Google. Over time, the larger the index, the more the crawl prioritization differences will produce different results. It isn't an indictment of the methods, or the index, or really of the quality metrics - it is specific in answering the question of which relativistic metric, if available for a URL, is most similar to that produced by Google. It is important to remember that for huge swaths of URLs, Moz, AHrefs and SEMRush will have 0 scores, which make them undifferentiated when doing analysis, making Majestic the right tool the majority of the time, just not the first one to check!
Quote: So if Majestic loses on the link pairs you tested, the elephant in the room is... how come Majestic has more links?
Perhaps I didn't explain in the article well enough on how the link pairs matter in the analysis. Let's say for example that I only analyzed 1 site, joe.com. According to Google Search Console, joe.com has 3 referring domains. We will call them JohnA.com, JohnB.com and JohnC.com each with 100, 200, and 300 inbound links respectively back to joe.com (A ratio of 1->2->3).
Now, we line up Moz and Majestic against the Google Search Console data. Moz finds misses JohnA, but gets JohnB and JohnC because they have a smaller crawl. Moz ends up with 0 for JohnA, 100 for JohnB and 200 for JohnC (A ratio of 0->1->2). Majestic finds all 3, but they find 150 for JohnA, 250 for JohnB, 300 for JohnC, and on top of that, they found a JohnD and a JohnE with 10 and 30 links respectively (A ratio of 1.5->2.5->3->.1->.3. Moz's list is more proportionally representative, even though they clearly don't have as much data as Majestic. Majestic even hit one of the domain pairs right on the head (300 for JohnC), but that doesn't mean the index as a hole is proportional.
The only case where this matters, though, is in the creation of relativistic metrics. Moz's crawl size is a small fraction of Majestics, so the majority of the time you have to use Majestic. But, if the handful of URLs you have access to are all in Moz's data set, you should rely on PA or DA.
Of course, all of this also assumes that the calculation of the relativistic metrics are themselves similar to the methods used by Google. Maybe Google has wholly abandoned the PageRank model, maybe they use something else more like TrustFlow. But what does remain true is the smaller the crawl, the more likely that it will randomly produce a more proportionally representative database.
Quote: From what I understand, Moz also has a way to prioritize that is not based on quantity...
On the contrary, Moz's performance may be primarily based on their index size. As I showed in the tree diagram, if you only make it up the tree 2 branches, you are likely to have very similar results regardless of the climb strategy you choose. It is possible that AHrefs, Moz, Majestic and SEMRush have very very similar crawl strategies, but that index size and random noise (like what links happened to be on the homepage of reddit when you crawled them last) alone greatly distorts the end product. Once again, this is not an indictment of Majestic at all.
Quote: but I remain fiercely proud of Trust Flow as something truly independent and unique right now.
As you should be. If you want to get a trust metric for a large set of URLs, Majestic is the only game in town. It is a great metric produced for a larger number of URLs than any other provider. I endorse it with a big subscription and API payment every month :-)
Quote: Our objective is not to create the same index as Google's. but to produce a GOOD index, that has insightful data. This is increasingly Trust Flow and Topical Trust Flow as better indicators of quality than the link counts or pairs themselves.
And this is certainly something you have done. Majestic is very good, always has been and always will. I stand by my conclusion that every SEO should invest in all the major link indices as they each offer many great insights.
AH! OK. So it is the overlap of absolute number of domains linking in Google's index vs the others that you are comparing. GOT IT now :) [It's the end of a hot day in England here!]
(Aside: But Google does not report all the referring domains. They limit their list to 1000 referring domains.)
There is another way to spin the debate of whether Big is a) Beautiful or b) hiding the wood for trees - which all comes down to the quality metric. There was a study which used the pure (mathematical) Page Rank algorithm but only on Wikipedia (because presumably the researcher didn't have enough Crays or Watson computers to use the whole web). They found that Carl Linnaeus (yeh who?) was more famous than Jesus. Clearly in the case of the Page Rank algo, having an index that was too small (just wikipedia ratherbtan the whole web) gives less accurate data than if it is larger. Both Majestic and Moz (hopefully correctly in most people's eyes) put the order the other way around.
So my point is... whilst size certainly can decrease correlation in your test, it does not have to decrease quality. If the quality metric on each individual URL crosses a quality Rubicon, then at that point, size can improve understanding. You are old enough to remember our old metric, AC rank. I think it is fair to say that AC Rank did not pass the quality Rubicon.
Quote: (Aside: But Google does not report all the referring domains. They limit their list to 1000 referring domains.)
Oops, I didn't mention this in the post. We only chose sites with fewer than 1000 referring domains in GSC so we knew it wasn't getting truncated!
Quote: They found that Carl Linnaeus (yeh who?) was more famous than Jesus.
This is absolutely true, but if Google screws up and thinks Carl Linnaeus is more important than Jesus, then I want a link from Carl Linnaeus before I want one from Jesus. #goingtohellforthatone
It is entirely possible that Majestic could build a better, more accurate link graph than Google, in relation to the web as a whole, but that is a different question than building one comparable to Google.
Quote: but if Google screws up and thinks Carl Linnaeus is more important than Jesus,
No - Google didn't screw up... Stanford universtity's PageRank algorithm made the unlikely conclusion when ONLY used on Wikiedia pages. It would get it right if it crawled the whole web. That's my point. The (original) PageRank Maths NEEDS breadth to work.
Quote: No - Google didn't screw up... Stanford universtity's PageRank algorithm
I understand this. But if Moz's corpus is more proportionally representative to Google's corpus than is Majestic's, it is likely to produce more similar results. It is more likely to produce the correct and incorrect conclusions as Google's. No one has a perfect index of the web, and if you want to produce a metric that predicts how Google will judge a particular URL, then your best bet is to start with a data set as proportionally relative to Google's as possible.
"All we hear about these days is big data; we almost never hear about good data."
Amen! That is probably the biggest problem in SEO right now.
Thank you for doing this analysis, it is really interesting.
But I have a question for you:
"Compare the referring domain link pairs of each data set to Google" - am I assuming correctly that you measured the total overlap? So if an Index has a linkpair that is not in the search console that would be a demotion for that index?
In that case any bigger Index must statistically have a lower relevancy. This is immanent, as the Search Console only gives us a control sample of the known links. So the link pairs are, by their very nature, limited. Any Index bigger than the amount of link pairs in the search console would be demoted. This is an inherent problem with these kind of datasets.
I have a different idea, but I do not know if it is feasible. If you could check for every Link from the Index if the linking page (the "from" page) is in the Google Cache, it would show us that Google knows that link (or not, if it isn't cached). Of course, this would still not give as any information regarding how and if Google counts these links.
Anyway, it is an interesting thought experiment: Is the proportional representation to Google Search Console data more important than sheer size?
Quote: In that case any bigger Index must statistically have a lower relevancy. This is immanent, as the Search Console only gives us a control sample of the known links. So the link pairs are, by their very nature, limited. Any Index bigger than the amount of link pairs in the search console would be demoted. This is an inherent problem with these kind of datasets.
Thank you for your thoughtful critique. This was certainly something I considered and attempted to address.
First, I wanted to make sure that Google's sample data provided via GSC was sufficiently large to indicate only a moderate amount of sampling. We know that Google representatives have said that everything you need to do a link cleanup is available in GSC, which indicates that the sampling can't be so stark as to miss the bad links which might be causing penalties. However, I wanted to take it a step further to be careful. Using the domain data that we did have, I was able to infer that the link graph represented by GSC is roughly 800,000,000,000 URLs in size. This puts it on par with Majestic Fresh, the largest link fresh link index. If we were looking at a much smaller sample from GSC, let's say the size of SEMRush or Moz, we would have more to worry about.
Second, the technique I used pitted one link index against another to determine the number of "wins". A win occurs when the one site's link disparity is greater than anothers. This created a Price-is-Right style result where indices that had just 1 more link than the other could win. It also created an interesting scenario where small indices, like SEMRush, would nearly always have no links for a domain-pair when the other index did not have the GSC reported domain pair either. So, while SEMRush might have won a lot of head-to-heads, their cumulative wins were much lower because they tied so often.
Finally, when a link index found a link that GSC did not, we assumed that there was at least 1 real link and it was a failure on Google's behalf. This helped remove some of the bias as well.
I think I did a reasonably good job of addressing this particular issue, but I won't pretend it was perfect.
As for your alternate experiment, you read my mind. I already began the process here with this report on the age of Google Search Console data and its cache status.
This issue is also one of the Google new update called "Quality Update". After all that all we need is to get natural organic links instead of buying thousands of risky links. After reading the post what we understand is that, Fashion of buying thosands of messy links is past away.
Thank for the post
Hi Russ,
Good point about the quantity vs quality of link indexes. To be honest, I never through about it that way. I assumed Google was at a point where they’re capable of crawling all the important and the a bit less important parts of the web. If I check WMT at any client website I see a lot of crawl activity and I can’t imagine they miss a link on these websites. Of course Google won’t be able to crawl everything, but every normal website does get crawled unless you really screw up.
Since WMT shows only a part of your backlinks can’t it be that link indexes are still so far behind on the amount of pages that Google does crawl that investing in quantity would suppress the need for quality at this point in time and space?
Besides that, I think a quality product would be a good fit for a specialist. I’m just afraid there will be a lot of folks that just make their judgement (buying decision) based on the amount of links an index returns.
Love to hear your view on this.
Certainly one of the assumptions i had to make was that GWT data was reoresentative of Google's data. It is possible there is a bias in that data, although detecting such a bias would prove difficult.
Regarding crawl activity, Googke exhausts far more resources revisiting pages than do the link indexes. Google may visit your homepage several times a day or week while a link crawler may only once a month. Unless there are new links, that crawl activity from Google may go a long way to keep their search index fresh without impacting the link graph dramatically.
such a great piece of link analysis. And how these crowler index the web world.
Hi Russ,
Thumbs up for all the work you’ve done! I can only imagine how much time and effort it took you to analyze data received from 5 separate tools and put together this in-depth comparison :)
And though I'm sorry not to see our backlink index, WebMeUp (https://webmeup.com), included in this comparison, our team would definitely like to make some further research based on your article and see if we could contribute to all the impressive work you’ve done. Though I do have a few questions regarding the methodology used and especially the resulting chart (just like all of the guys analyzed here - Moz, Majestic, Ahrefs and SEMrush :)), I totally understand how hard it is to put all your thoughts and statistical calculations into one single "edutaining" blog post – great work!
But to let our team see if we can use the same approach in our research, and to demonstrate its statistical validity etc, could you please share the set of initial data you used for the analysis - i.e. "the root linking domain pairs and values to 100+ sites" you took from Google Search Console? (hope this won't be a problem since these are supposed to be pure numbers that carry no commercial information)
Hi Aleh,
When I first began this analysis, I actually was using WebMeUp data, but the research was put on hold for a few months and subsequently we dropped our subscription during that time. More importantly, the primary reason proportional representation matters would be for the calculation of relativistic stats like MozRank or CitationFlow, which currently (AFAIK) WebMeUp does not do. That being said, we did include SEMRush simply because we are customers for other reasons and the data was readily available. I am interested in adding WebMeUp's data now though.
Unfortunately, I cannot share the actual domain pairs because that would reveal our client, customer, and internal properties list, as we were limited to domains for which we had access to Google Search Console. You are right, they don't carry commercial information, but they would disclose the companies that have worked with Angular over the years.
I will try and add WebMeUp to the analysis in the next few weeks.
Dear Russ,
Thanks for being interested - I’m always happy to hear from data scientists who’re eager to keep updating and expanding their research! I’ve just located your account with WebMeUp (registered under your Gmail email) and added a year of Enterprise plan to it - so that you can easily get all the data you need for the research. I’ll be thrilled to hear about the results you achieve with WebMeUp!
P.S.: Please PM me if you need API access to the system to automate the process - I’m sure we can arrange that as well.
Thanks again!
Well, Majestic Fresh surprised me... Did you use the largest type of report there?
Thanks for this comparison.
Majestic has the most aggressive crawler (which makes it an indispensible tool in my opinion because it can help find links even before google does), but it causes it to falk victim to the phenomenon I described above where crawl depth exacerbates differences.
As for Majestic Historic, my guess is its size is having an undue influence over the regression model, but you cant really throw awya outliers when you onky have a few data points. If I log() the index size, it reducss that influence a bit, but it doesnt pull away from the regression line so much as to reconsider our conclusions.
One of the best blog I have been through so far. You showed a depth of SEO and specially link building. Just link building to hundreds of sites does not help in SEO you need to have quality with these as well. Thank you for writing this blog and sharing.
Perfectly executed advanced SEO information! Well done sir.
Thanks!
Thank you, Russ, for the very interesting article!
I have some questions, though.
You state that google and all other services cannot index the whole Web, so they index only a part of it, each service crawling some part different from the other, which results in mismatch in relative authorities of pages and domains btw Google and these services. Do I get it right?
If so, than I'll try to argue. True, that's impossible to index the whole Web - it is infinite given dynamic content generated by dumb scripts. However, if we are looking at relative authority calculation algorithms, what seems more important is not the quantity (do we index all the content?) but quality (do we index content that is significant in terms of our authority calculations?). Do you agree? Let's call such content 'important' (significantly affecting authority calculations).
If two services both manage to index a big part (say 80%) of important content (and given they both use same authority calculation algo), than whatever mismatch in their content coverage won't cause significant mismatch in calculated authorities.
Now, could Google cover 80% of important content? First, it's crawling strategy must be based on that very authority, otherwise it a waste of resource. Second, I'd guess that important content tends to live around 'something that has non-zero cost', and any authority algo should revolve around this concept (does it?). An additional page on a site costs 0 (that's why we can't index all the Web), but paid domain costs real money, and an IP, and a subnet. I'd once again guess there are less than 1 billion paid domain names registered. Is it realistic that google did cover them all (given it's crawling strategy should strive to do it)? I feel it is. What do you think?
Now, if important content is all around 1 billion paid domains, it seems much more realistic that all link graphing services could cover them all. If so (would be cool to hear Aleh Barysevich and Dixon Jones on this), it would mean that any mismatch in authorities is caused by algo mismatch, rather than content coverage mismatch.
Love this discussion!
1. Yes, the biased samples of the web produced by incomplete, non-random crawls create disparities in relative authority metrics.
2. Unfortunately, the crawl determines the authority, not the other way around. While subsequent crawls may build off authority metrics from previous ones, they aren't given as initial conditions at the first crawl.
3. If 2 services came close to indexing the whole of the web, then yes, they would converge in our calculations as their samples of the web would grow more similar.
The problem seems to be this - you can't sample the web. You have to start somewhere and then make decisions on where to go from there. If everyone doesnt agree on where to start, and how to proceed from what you find, they will diverge over time until they begin to approach the full theoretical index size.
Now, you proposed an interesting point. What if we started off with all domains that are registered as an assumption of important content? While this method may produce more similar results if everyone had agreed to follow it, there would still be big disparities pretty quickly. For example, in what order do you crawl those? If it takes 3 days to crawl every domain homepages, how many of them will have changed? Reddit? Slashdot? Every news outlet? What about random link blocks? Logged-in-now links on forum homepages? Or bigger yet, what about all the subdomains that are missed? espn.go.com? en.wikipedia.org?
So, I don't disagree with many of your thoughts, I think the current state of crawl-based-indexation still generally lends itself to my analysis that as indexes grow, they will likely deviate more from one another.
Hi Russ, I have to admit that I am pretty new to this aspect of big data analysis, but I think the idea of having search engine that crawls for things like content freshness, uniqueness and value is a great idea. Are there any resources you recommend to those of us who are trying hard to better understand Google's algorithm and how backlinks influence it? Also, in regards to Google's influence...Do you feel Google should have the influence it has right now in the market? I have spoken with a lot of people who are frustrated with Google's seeming monopoly. Although Google's talent with tackling data is impressive, many people simply feel that Google, especially the Google Search algorithm, has too much influence on the online successes and/or failures of businesses around the world. Do you believe that one company, without oversight, should be allowed to arbitrarily pick who is or isn't worth my time, your time or the time of Internet users everywhere?
Are there any resources you recommend to those of us who are trying hard to better understand Google's algorithm and how backlinks influence it?
Here is a good recent article from Moz on the relationship of backlinks and rankings. It is easiest to think of backlinks as votes, except some votes matter more than others because the voters are voting for one another too.
Also, in regards to Google's influence...Do you feel Google should have the influence it has right now in the market?
I think Google has undue influence over the search landscape right now, but I don't think they are exploiting it.
Do you believe that one company, without oversight, should be allowed to arbitrarily pick who is or isn't worth my time, your time or the time of Internet users everywhere?
I certainly think there should be oversight, especially from the FTC and FCC, but I do think it is reasonable that a search engine can produce results that they choose. As consumers, we need to do a better job of shopping around our search traffic, rather than always defaulting to Google. It is hard to blame Google for our own laziness in that regard.
I agree we should be after "good" data, not just big data - but I do see one problem with this approach. When I do a disavow, I'm always going to use Majestic Historic first. Yes, it may contain links that don't exist. Yes, it may have branches of links that even Google doesn't have - but my disavow is going to be the most complete - now and always.
In terms of SEO value and competitor analysis, these tools became useless the day the disavow tool came out because you can't know what a competitor has disavowed. Thus, the index I want IS the largest - so I can do the most complete disavow, not necessarily the "right one today."
(Edit to add: when I do a disavow I'm going to use ALL lists available to me - but if I had to choose ONE, I mean.)
Hi Matt,
This isn't a problem with my approach. I completely agree with you that there are situations in which the larger the database the better. If you go back and read my "So What Do We Do" conclusions, the example I give is "Let's say you have a list of domains or URLs for which you would like to know their relative values."
That is what this analysis is for - to show you which data set will provide you results more similar to Google's in terms of relativistic metrics (Like PA/DA, MozRank, MozTrust, AhrefsRank, CitationFlow, TrustFlow, etc). Nothing more, nothing less.
The conclusions of this study are restricted to a very particular question, as any good study should be, which is simply: which data set is most proportionally representative of GSC data. What you do with that information is a different question altogether.
Great article! Very intelligently written
Thanks! I had a lot of help from Diana Carter and Andrew Cron which allowed me to give a little more polish. Plus Trevor at Moz is always helpful with last minute editorial improvements.
Thank you very much @Rush Jones.
That's a great research. What I could deduct from it is that my best bet would be a combo of Moz OSE + aHrefs. Would be also interesting to see how CognitiveSEO performs in this analysis.
Actually, you can't ignore Majestic. You are missing out on half of the links if you only go with OSE and AHrefs. Imagine you had 100 randomly selected links from the web for which you would like to compare their relative authority. The chances that all 100 would be in Moz's index, or AHref's index would be much lower than in Majestic's. You should try Moz first, then AHrefs, but more often than not you would end up needing Majestic.
Plus, Majestic has some other awesomely useful metrics like Topic-based measures. You really need all 3.
I have used them before and I didn't like the experience. I think that you can only get the real value only from the OSE and aHrefs. But thank you for the suggestion.
Intelligent,, Yes, we need good data. Problem is , How can I measure crawl rate of another website which already have backlinks to my website?
To my knowledge, there is no easy way to do this, but I could imagine a fairly straightforward methodology...
1. Spider the site with some sort of tool like ScreamingFrog, Microsoft IIS Toolkit, etc. to get a list of pages.
2. Using proxies, check the cache date of these pages every day/week in Google.
3. Store the cache dates and use them to determine, over time, the rate at which Google crawls their site.
I'm not sure how this would be valuable to you, but it seems possible.
Yes, Russ. its pretty straight . One more thing, Is it advisable to set crawl rate for a website which is hosted on the shared server with multiple other websites? What if the site was down while crawling? Most of the small business websites are having shared server with different field websites. I think crawling rate schedule is improper option for them.
Quote: How can I measure crawl rate of another website
in the Pages tab of Majestic Site Explorer, we record the last crawl date for each page (and the response code), so you can use that - but of course, all the crawlers crawl differently.
Thank You Dixon, I will definitely use it soon
Thanks for sharing this information.
hi dear , Today now i joined to your site , i am very happy , because your content is best , i am Persian and in my language content about seo is very cheap ! my imagine is learn seo tactic more and more and teach it to all of persian people . Good luck - anyone can tell me , How Can I Start reading Step by Step from Where ?/
Best Regards / if you want see my website , i have course and content about social Media and digital life
Digimehtod
You should definitely check out our beginner's guide to SEO.
One thing people can try is Bing Webmaster Tools - they show you the links they have found, and they are likely to be more similar to Google's than any of the other link checkers
Hey folks, I'll do my best to try and answer every question here today. You can also oing me on twitter (@rjonesx) if you like. Thanks in advance for your comments and questions.
Thanks for the good explanation about the comparison of link indexes in big data. I took big data online training in Intellipaat. I loved it all and I am still doing so. My journey for big data training with Intellipaat is fabulous.
Really enjoy this article! Data speaks loud here.
Also, I really appreciate the comments about how to use different tools together!
Great outline of these different link indexes and how they work. I always start with Moz and move to the other tools as I need more information to take action on.
Web https://searchengineland.com known no studies on the Big Data for sure you are interesting. I recommend !! ;)
Russ, as you know keywords are my specialty. Having said that, I was one of 10 employees in SEMrush for 14 months I saw as the link index went from infant to adolescence. I'd imagine they're happy to shed more light on their process. Send tweet to @RadioMS, US Marketing Dir. SEMrush is so very new to the link analysis game, I have to believe they're attempt to scale their db in a short time plays a big part in their process.
Thanks for sharing advanced SEO information! superb sir.
Quality theme relevant links are important for ranking ...
Your analysis was fantastic. You know a lot about backlinks and we can see it in your post. I have a question, if I have more data is more easy that I can lose the reference about my backlinks?
It has been a very interesting and engaging article. Thank you.