Linkscape Index Update - With Focus on Quality

Comments 28

Please keep your comments TAGFEE by following the community etiquette.

E-mail me when new comments are posted

Sort by:

Comments are closed on posts more than 30 days old. Got a burning question? Head to our Q&A section to start a new conversation.

ao.com

2009-03-02T01:45:04-08:00

I would love the ability to compare old reports with new ones for the same sites so i can see the growth trends.

Is this something you will be considering to do in the near future?

Would anyone else like that feature?

5 0

I would love the ability to compare old reports with new ones for the same sites so i can see the growth trends. Is this something you will be considering to do in the near future? Would anyone else like that feature? 
Cancel
- Jordan Ryan
 
 2009-03-02T03:18:46-08:00
 
 I'd second that request. I'm really interested in seeing the growth trends as some of the campaigns I am managing progress.
 
 1 0
 
 I'd second that request. I'm really interested in seeing the growth trends as some of the campaigns I am managing progress.
 Cancel
- trentiles
 
 2009-03-02T07:25:20-08:00
 
 Third that! I was a tiny bit saddened upon using the Linkscape tool for the first time the other day, and saw that you can't compare old to new reports.
 
 trentiles edited 2009-03-02T07:26:01-08:00
 1 0
 
 Third that! I was a tiny bit saddened upon using the Linkscape tool for the first time the other day, and saw that you can't compare old to new reports.
 Cancel
- Nick Gerner
 
 2009-03-02T09:17:34-08:00
 
 This is definitely something we want to address (see point 4 in our goals).
 
 Right now the comparisons are a little tricky because we're grooming our data-set. For instance, the first update doubled the size of our index. If you got a larger link count, is that because of index growth or real link growth? For now I would say, focus on relative differences within an index (e.g. do you have more links or fewer than your next best competitor?)
 
 We will NOT come out of beta (whatever that means) without some kind of solution for this on the data side. So expect to see more around this in the next few months. But keep the feedback barrage on this front coming.
 
 1 0
 
 This is definitely something we want to address (see point 4 in our goals). Right now the comparisons are a little tricky because we're grooming our data-set. For instance, the first update doubled the size of our index. If you got a larger link count, is that because of index growth or real link growth? For now I would say, focus on relative differences within an index (e.g. do you have more links or fewer than your next best competitor?) We will NOT come out of beta (whatever that means) without some kind of solution for this on the data side. So expect to see more around this in the next few months. But keep the feedback barrage on this front coming.
 Cancel
Laterdayz173

2009-03-01T21:42:55-08:00

Woo hoo! Congrats guys, I've been waiting for this update for a long time. Maybe I am just blind from excitment, but I can't find out how to delete my old linkscape reports. I want to re-run some old URLS through this update, but I keep getting redirected to the older (saved) versions. Any direction you can offer would be helpful.

Thanks!

4 0

Woo hoo! Congrats guys, I've been waiting for this update for a long time. Maybe I am just blind from excitment, but I can't find out how to delete my old linkscape reports. I want to re-run some old URLS through this update, but I keep getting redirected to the older (saved) versions. Any direction you can offer would be helpful. Thanks! 
Cancel
- Nick Gerner
 
 2009-03-01T21:46:23-08:00
 
 Thanks for the report! We forgot to roll over a sticky bit. Bit is unstuck, try those tubes again!
 
 1 0
 
 Thanks for the report! We forgot to roll over a sticky bit. Bit is unstuck, try those tubes again!
 Cancel
 - Stephen Tallamy
 
 2009-03-02T01:25:47-08:00
 
 Damn those sticky bits... they get everywhere and are hard to clean off.
 
 1 0
 
 Damn those sticky bits... they get everywhere and are hard to clean off.
 Cancel
Associate
RobOusbey
Associate

2009-03-02T04:16:15-08:00

I'm interested by the implied averages of the number at the top of the page:
- Average number of links per URL: 11.2
- Average number of URLs per subdomain: 171
- Average number of Sub domains per domain: 4.2
I realise that the Linkscape crawl began at the strongest parts of the internet, but these are higher that I would have guessed. Let's see what happens as more of the internet's backwaters get included in the numbers.

3 0
I'm interested by the implied averages of the number at the top of the page: <ul><li>Average number of links per URL: 11.2</li><li>Average number of URLs per subdomain: 171</li><li>Average number of Sub domains per domain: 4.2</li></ul> I realise that the Linkscape crawl began at the strongest parts of the internet, but these are higher that I would have guessed. Let's see what happens as more of the internet's backwaters get included in the numbers. 
Cancel
- Nick Gerner
 
 2009-03-02T09:21:22-08:00
 
 Working with these aggregates is so difficult because of the vast long-tail nature of the web. The average is VERY different from the median here. Our feeling is that we're already including too much of the backwaters (by some definition of "backwater").
 
 This gets at the interplay between points 2 and 5 of our goals: Coverage vs Quality.
 
 That number 11.2 is actually too low for most pages you'll ever encounter. But you're right, if we consider many of those "backwater" pages, many of whose content/links we do not include in our index, the number does come down.
 
 1 0
 
 Working with these aggregates is so difficult because of the vast long-tail nature of the web. The average is VERY different from the median here. Our feeling is that we're already including too much of the backwaters (by some definition of "backwater"). This gets at the interplay between points 2 and 5 of our goals: Coverage vs Quality. That number 11.2 is actually too low for most pages you'll ever encounter. But you're right, if we consider many of those "backwater" pages, many of whose content/links we do not include in our index, the number does come down.
 Cancel
- Ben Hendrickson
 
 2009-03-02T11:06:46-08:00
 
 Interestingly, I think the "average number of inlinks per url" generally increases with more crawling. This statistic is related to the ratio of crawled URLs to discovered URLs. Consider after linkscape had crawled just one page. It had found about 30 unique links, learning about 30 urls, and so the average number of inlinks per url was exactly 1 (from 30 divided by 30). This ratio increases as one crawls more, and finds an increasing portion of links pointing to URLs that have already been discovered.
 
 As an intermediate datapoint, when the crawl for linkscape was only in the hundreds of millions of pages, the average number of links per URL was more like 3. Now, as you noted, it is more like 11. So it would seem to be increasing as we crawl more.
 
 However, it could be leveling off. One never gets the ratio of discovered to crawled to be 1, because a lot of URLs are uncrawlable. The are non-html content, they are robots.txt out, they are clearly spam you don't want to crawl, they are to 404 pages, etc. And the second factor, the average number of links per-page, has declined somewhat as we crawled more, which is also contributing to the leveling off of the average number of links per URL. So you might be right, and this statistic could now be decreasing.
 
 3 0
 
 Interestingly, I think the "average number of inlinks per url" generally increases with more crawling. This statistic is related to the ratio of crawled URLs to discovered URLs. Consider after linkscape had crawled just one page. It had found about 30 unique links, learning about 30 urls, and so the average number of inlinks per url was exactly 1 (from 30 divided by 30). This ratio increases as one crawls more, and finds an increasing portion of links pointing to URLs that have already been discovered. As an intermediate datapoint, when the crawl for linkscape was only in the hundreds of millions of pages, the average number of links per URL was more like 3. Now, as you noted, it is more like 11. So it would seem to be increasing as we crawl more. However, it could be leveling off. One never gets the ratio of discovered to crawled to be 1, because a lot of URLs are uncrawlable. The are non-html content, they are robots.txt out, they are clearly spam you don't want to crawl, they are to 404 pages, etc. And the second factor, the average number of links per-page, has declined somewhat as we crawled more, which is also contributing to the leveling off of the average number of links per URL. So you might be right, and this statistic could now be decreasing. 
 Cancel
pbhj

2009-03-01T22:19:17-08:00

"[...] how many external links to a whole site, we report, on average, 90% of what Y!SE does. This number is quite dramatic IMHO. Here we're probably seeing less of the nofollow bias. This might also reflect a crawl bias, and some canonicalization differences. But again, a strong statistical correlation suggests that we are giving an accurate site-wide link profile."

Presumably, you've done some qualitative analysis to look at the discrepancy and see what sort of links those 10% are? You'd have to look through a few sites by hand but surely that would show immediately whether the 10% can be written off as uncrawled domains linking in or whether it's duplicate content or if Y! is not reporting unique links, etc.?

Unrelated, but the geek in me is interested in the [on disk] size of your database, the time and bandwidth to create it and any details about your crawler you care to share?

3 0

"[...] how many external links to a whole site, we report, on average, 90% of what Y!SE does. This number is quite dramatic IMHO. Here we're probably seeing less of the nofollow bias. This might also reflect a crawl bias, and some canonicalization differences. But again, a strong statistical correlation suggests that we are giving an accurate site-wide link profile." Presumably, you've done some qualitative analysis to look at the discrepancy and see what sort of links those 10% are? You'd have to look through a few sites by hand but surely that would show immediately whether the 10% can be written off as uncrawled domains linking in or whether it's duplicate content or if Y! is not reporting unique links, etc.? Unrelated, but the geek in me is interested in the [on disk] size of your database, the time and bandwidth to create it and any details about your crawler you care to share?
Cancel
- Nick Gerner
 
 2009-03-02T09:11:34-08:00
 
 We have taken a look at (some) of that 10% (our study covered millions of links). It's still a little early to say, but the issues revolve around freshness and "depth". Some of the links are from the last two or three weeks, which is just outside our resolution. Also, some of those links apparently don't have any links themselves. They are only discoverable through RSS and browsing history.
 
 We're working on both issues, but the freshness one is a little bit trickier to address.
 
 1 0
 
 We have taken a look at (some) of that 10% (our study covered millions of links). It's still a little early to say, but the issues revolve around freshness and "depth". Some of the links are from the last two or three weeks, which is just outside our resolution. Also, some of those links apparently don't have any links themselves. They are only discoverable through RSS and browsing history. We're working on both issues, but the freshness one is a little bit trickier to address.
 Cancel
- Nick Gerner
 
 2009-03-02T09:14:23-08:00
 
 Let me figure out the right way to release info about our back-end architecture. For now, suffice it to say, it's BIG. This is bigger than any project I've ever worked on (and I worked on something affectionately, if a little inaccurately, called the "Petabyte Project" at Cornell).
 
 When we spoke with large-scale db vendors, we got the impression that we would be the largest client they had in terms of size and data turn-over frequency. Hence, we built an almost entirely home-grown solution.
 
 We're big enough that we see machine failure and data corruption routinely. The data is big enough that develop-deploy-debug cycles take days. Both Ben and I have lost more than a few nights of sleep dealing with these issues.
 
 1 0
 
 Let me figure out the right way to release info about our back-end architecture. For now, suffice it to say, it's BIG. This is bigger than any project I've ever worked on (and I worked on something affectionately, if a little inaccurately, called the "Petabyte Project" at Cornell). When we spoke with large-scale db vendors, we got the impression that we would be the largest client they had in terms of size and data turn-over frequency. Hence, we built an almost entirely home-grown solution. We're big enough that we see machine failure and data corruption routinely. The data is big enough that develop-deploy-debug cycles take days. Both Ben and I have lost more than a few nights of sleep dealing with these issues.
 Cancel
JoshuaSciarrino

2009-03-02T21:51:29-08:00

Fantastic update Nick.

With Google officially coming out and saying they have 1 Trillion URL's, do you think that is your target goal for the data of linkscape's index?

(I think it's safe to approximate that G has 1 Trillion+ maybe 1.2 Tril. by now...who knows, with the web's reach growing, it's assumed to grow).

May I ask, where are all your URL's by geography?

It's is predominately in China/Japan/India? Or is it solely the US/UK/AU?

I think some difficult goals (maybe....at least from my perspective):
- Continuously fighting the duplicate battle
- Spidering PDF's
- Spidering Flash
- Spidering through Javascript
- Finding pages that have few/no links to them (the crap people submit to SE's).
- Links that are generated based of IP (geo-targeting links)
- (Maybe) Links within iFrames

I watched this video about Google's evolution and they talked about how they could spider the entire web in a couple of months and now they can spider the web in a few minutes, WTF! Gezz. (I can't find the specific link but hopefully the next paragraph will help bring some weight to my statement).

I doubt Moz has the depth of servers as Google but they use 1,000 servers to process a single query in .12 seconds.

Keep up the fantastic work. I'm going to try to make our own tool but that would take at least 6-12 months before I could even enter into the game of Crawling the Web. By then, who knows what the web will look like. ;) [I make no promises on that tool but heck, by then you'll probably devour more of the web than Google. :D And if that's the case, I might just leave it up to the experts. :D]

Great job, I look forward to seeing how Linkscape develops.

2 0

Fantastic update Nick. With Google officially coming out and saying they have <a href="https://googleblog.blogspot.com/2008/07/we-knew-web-was-big.html" rel="nofollow">1 Trillion URL's</a>, do you think that is your target goal for the data of linkscape's index? (I think it's safe to approximate that G has 1 Trillion+ maybe 1.2 Tril. by now...who knows, with the web's reach growing, it's assumed to grow). May I ask, where are all your URL's by geography? It's is predominately in China/Japan/India? Or is it solely the US/UK/AU? I think some difficult goals (maybe....at least from my perspective): - Continuously fighting the duplicate battle - Spidering PDF's - Spidering Flash - Spidering through Javascript - Finding pages that have few/no links to them (the crap people submit to SE's). - Links that are generated based of IP (geo-targeting links) - (Maybe) Links within iFrames I watched this video about Google's evolution and they talked about how they could spider the entire web in a couple of months and now they can spider the web in a few minutes, WTF! Gezz. (I can't find the specific link but hopefully the next paragraph will help bring some weight to my statement). I doubt Moz has the depth of servers as Google but they <a href="https://www.labnol.org/internet/search/google-query-uses-1000-machines/7433/" rel="nofollow">use 1,000 servers</a> to process a single query in .12 seconds. Keep up the fantastic work. I'm going to try to make our own tool but that would take at least 6-12 months before I could even enter into the game of Crawling the Web. By then, who knows what the web will look like. ;) [I make no promises on that tool but heck, by then you'll probably devour more of the web than Google. :D And if that's the case, I might just leave it up to the experts. :D] Great job, I look forward to seeing how Linkscape develops. 
Cancel
- JoshuaSciarrino
 
 2009-03-02T23:19:35-08:00
 
 Just thought this was a good find about young Google (something to add to my original comment).
 
 2 0
 
 Just thought this was a good find <a href="https://www.networkworld.com/community/node/39101" rel="nofollow">about young Google</a> (something to add to my original comment).
 Cancel
- Nick Gerner
 
 2009-03-03T08:05:01-08:00
 
 Joshua,
 
 The 1 trillion number is a bit of hoax. As it turns out we've "seen" many more urls as well. The question is, what is rank-able content? There are a lot of people that acknowledge that the GG index doesn't have anywhere near all of these URLs.
 
 We've seen that same video. And actually, between it and the research above, it makes us think we are much closer to the engines than any of us previously thought. If their index fits in memory on those 1000 machines (not saying that it does), and each has say, 16GB of RAM (keep in mind that GG uses commodity hardware) then we're talking about ~16TB of data. This is the neighborhood (bigger or smaller, depending on how you count it) of our dataset. And we don't have a keyword inverted index or a page cache. Even at 64GB of RAM each we're still on the same order of magnitude.
 
 All of this (especially the research above) suggests that we are doing the right things and are providing a very reasonable view of the web.
 
 1 0
 
 Joshua, The 1 trillion number is a bit of hoax. As it turns out we've "seen" many more urls as well. The question is, what is rank-able content? There are a lot of people that acknowledge that the GG index doesn't have anywhere near all of these URLs. We've seen that same video. And actually, between it and the research above, it makes us think we are much closer to the engines than any of us previously thought. If their index fits in memory on those 1000 machines (not saying that it does), and each has say, 16GB of RAM (keep in mind that GG uses commodity hardware) then we're talking about ~16TB of data. This is the neighborhood (bigger or smaller, depending on how you count it) of our dataset. And we don't have a keyword inverted index or a page cache. Even at 64GB of RAM each we're still on the same order of magnitude. All of this (especially the research above) suggests that we are doing the right things and are providing a very reasonable view of the web.
 Cancel
 - JoshuaSciarrino
 
 2009-03-03T13:17:20-08:00
 
 Hmm. Sounds like you've done your homework. :)
 
 I trust your judgement.
 
 1 0
 
 Hmm. Sounds like you've done your homework. :) I trust your judgement. 
 Cancel
 - Nick Gerner
 
 2009-03-03T13:23:28-08:00
 
 Ultimately it comes down to if the tool is useful for your purposes.
 
 So really it's I who have to trust your judgment :)
 
 I hope we can live up to that standard.
 
 2 0
 
 Ultimately it comes down to if the tool is useful for your purposes. So really it's I who have to trust your judgment :) I hope we can live up to that standard.
 Cancel
 - JoshuaSciarrino
 
 2009-03-03T13:40:11-08:00
 
 Haha. Your too kind being all 'servant-ish'. :D
 
 I was just throwing out some ideas. But I guess you already tackled them pretty easily, if you have the whole web in your hands. :)
 
 The web being 16 TB? That's sounds pretty small. With memory doubling every year we'll hold 16 TB's in a few years. :P
 
 I love linkscape. I really do. I just think there might be some data missing. Many times I use a linkscape credit and I compare it to Y!SE it's missing ~20-30% of the links. And with Google being a bigger dog than Yahoo, it could be more.
 
 But again. Maybe it's just politics. Maybe G is trying to impress everyone. Idk.
 
 Just one micro-perspective from a little guy. :) I trust your judgement. Your a far smarter guy than me about SE's. You've actually read some of the patients, I've read the summaries. :P
 
 Keep up the great work.
 
 2 0
 
 Haha. Your too kind being all 'servant-ish'. :D I was just throwing out some ideas. But I guess you already tackled them pretty easily, if you have the whole web in your hands. :) The web being 16 TB? That's sounds pretty small. With memory doubling every year we'll hold 16 TB's in a few years. :P I love linkscape. I really do. I just think there might be some data missing. Many times I use a linkscape credit and I compare it to Y!SE it's missing ~20-30% of the links. And with Google being a bigger dog than Yahoo, it could be more. But again. Maybe it's just politics. Maybe G is trying to impress everyone. Idk. Just one micro-perspective from a little guy. :) I trust your judgement. Your a far smarter guy than me about SE's. You've actually read some of the patients, I've read the summaries. :P Keep up the great work.
 Cancel
 
 Nick Gerner
 
 2009-03-03T13:47:21-08:00
 
 This is actually a really good discussion to have :)
 
 We're definitely still missing plenty. When I say we're in the neighborhood, I mean we have a big chunk of the data; we have enough data to be extremely relevant (see the numbers above). But we are still growing to be sure.
 
 Please, keep the feedback coming!
 
 1 0
 
 This is actually a really good discussion to have :) We're definitely still missing plenty. When I say we're in the neighborhood, I mean we have a big chunk of the data; we have enough data to be extremely relevant (see the numbers above). But we are still growing to be sure. Please, keep the feedback coming!
 Cancel
 
 JoshuaSciarrino
 
 2009-03-03T14:09:09-08:00
 
 That's all folks! :)
 
 JoshuaSciarrino edited 2009-03-03T14:10:28-08:00
 1 0
 
 <a href="https://www.youtube.com/watch?v=hbb7EY_R9mU&feature=related" rel="nofollow">That's all folks!</a> :)
 Cancel
jamespie

2012-06-26T23:16:12-07:00

This site is very good and very useful..

1 0

This site is very good and very useful..
Cancel
Stuart P Turner

2009-03-02T12:47:20-08:00

Hey Nick,

Excellent post on the update, we (along with many other people no doubt) have been waiting for some data refinement for a while now. It's excellent to see Linkscape evolving; even at it's inception it proved to be one of the most useful tools available despite requiring a bit of calculated guesswork.

Once the data is satisfactory (in terms of size, quality and accuracy) I would also love to see some historical comparison between reports.

1 0

Hey Nick, Excellent post on the update, we (along with many other people no doubt) have been waiting for some data refinement for a while now. It's excellent to see Linkscape evolving; even at it's inception it proved to be one of the most useful tools available despite requiring a bit of calculated guesswork. Once the data is satisfactory (in terms of size, quality and accuracy) I would also love to see some historical comparison between reports. 
Cancel
David Carralon

2009-03-02T04:56:46-08:00

hey Nick, those are great Linkscape improvements, well done, I love the data Linkscape returns.

You mentioned:

'Of course, we discount nofollows, while Y!SE (in our experience) includes them.'

In my experience I have also noticed YSE returning results for no follow links, even for poor quality nofollow links. I think it is great that Linkscape discounts nofollows as they can hamper clarity when analysing inbound links, unless those theoretical 'nofollow' links actually turn out to be not so 'nofollow' and more 'dofollow' than we think they are. : )

Lastly I agree with Andrew that it would be good to ahve the ability to compare several reports for the same website over a set period eg: 6 months... otherwise i end up making local copies of Linkscape and even printing out reports and archiving them for better benchmarking

1 0

hey Nick, those are great Linkscape improvements, well done, I love the data Linkscape returns. You mentioned: 'Of course, we discount nofollows, while Y!SE (in our experience) includes them.' In my experience I have also noticed YSE returning results for no follow links, even for poor quality nofollow links. I think it is great that Linkscape discounts nofollows as they can hamper clarity when analysing inbound links, unless those theoretical 'nofollow' links actually turn out to be not so 'nofollow' and more 'dofollow' than we think they are. : ) Lastly I agree with Andrew that it would be good to ahve the ability to compare several reports for the same website over a set period eg: 6 months... otherwise i end up making local copies of Linkscape and even printing out reports and archiving them for better benchmarking
Cancel
- Nick Gerner
 
 2009-03-02T09:25:16-08:00
 
 This is a good point. I know I've seen some good discussion around how nofollow is interpreted. We will continue to refine how we interpret it to match the "relevant web" (i.e. rankings).
 
 I want to leave myself plenty of wiggle-room for how we treat all of these things.
 
 Will wrote a pretty good post regarding the same (and lots of other stuff).
 
 1 0
 
 This is a good point. I know I've seen some good discussion around how nofollow is interpreted. We will continue to refine how we interpret it to match the "relevant web" (i.e. rankings). I want to leave myself plenty of wiggle-room for how we treat all of these things. Will wrote a pretty <a href="https://www.seomoz.org/blog/nofollow-is-dying-the-impact-of-microblogging-and-nofollow-on-seo" rel="nofollow">good post</a> regarding the same (and lots of other stuff).
 Cancel
DataEntryServices

2009-03-02T07:22:54-08:00

Very helpful tool and glad to see improvements.

1 0

Very helpful tool and glad to see improvements.
Cancel
simon stratford

2009-03-02T07:34:39-08:00

I'm new so not see or used this tool before, just playing with it now.

1 0

I'm new so not see or used this tool before, just playing with it now.
Cancel
- Nick Gerner
 
 2009-03-02T09:22:47-08:00
 
 Feel free to shoot a PM or twitter message my way if you have any questions or need some tips.
 
 That goes for everyone. I've seen Q+A submissions about "how to use Linkscape". No need to spend a credit.
 
 1 0
 
 Feel free to shoot a PM or twitter message my way if you have any questions or need some tips. That goes for everyone. I've seen Q+A submissions about "how to use Linkscape". No need to spend a credit.
 Cancel

Post Analytics

Comments 28

Log in to Moz

Don't have an account?