Ever since I first saw Matt Cutts, Google's head of search quality, "investigating" sites through his super-secret application (during an SES conference in NYC); calling out spammers and identifying crawl and ranking issues for curious site owners, I've wondered about the content of his tool collection. What secrets can Googlers pull up on command? How much do they really know about you, your sites and the web? In order to answer these questions, I've put together my own speculative guide about the features, operations and data points available to Mr. Cutts and his team, along with my guess about the relative liklihood that each element is actually available on demand (high, moderate or low probability).
All the things that could be in a Googler's toolbox (in Rand's opinion):
- General Site Data
- How many pages Google estimates you have plus how many they've crawled and indexed | HIGH
- Crawl rates for your site and how often they think you update your content in various sections | HIGH
- Registration date and first-crawled date | HIGH
- Server codes your pages return (or have returned in the past) - 200s, 404s, 301s, 302s, 500s, etc. and any server errors or crawl problems Google's encountered | HIGH
- Anything you've submitted to Google's Webmaster Central (aka sitemaps) | HIGH
- Any bans, penalties or "red flags" Google has given the site | HIGH
- Domains/pages that have been 301/302'd to your site | HIGH
- An estimate of visitor traffic to your site in total (I'd think it would be easy for the search giant to use analytics data and compare that against their own search data and toolbar information to come up with a good formula for virtually any site - they almost certainly beat the pants off Alexa) | MODERATE
- How well you rank for specific terms | MODERATE
- An estimate of how much traffic Google sends to your site | LOW
- Any advertising you buy/sell through Google (Adwords & AdSense accounts) | LOW
- The top search terms that bring your site traffic | LOW
- Site Owner Information
- How many sites you (or entities sharing your name, phone number, address or other registration data) own and what number of these are active | HIGH
- Length of domain registration | HIGH
- Where your domains are hosted | MODERATE
- Adsense/Adwords accounts you run or have access to (via the Google universal login protocol) | MODERATE
- Analytics, Toolbar, Sitemaps, Web Accelerator or other Google accounts you might have/use | LOW
- IP addresses from which you've logged into a Google service | LOW
- Link Information
- Complete list of sites and pages on the web link to your site | HIGH
- Pagerank data, probably with additional components that weren't in the original formula | HIGH
- Percentage of your links that come from suspected manipulative sources (paid links, linkfarms, comment spam, ad networks, etc.) | HIGH
- Complete list of links to other sites/pages from your domain | HIGH
- Quality, trust, authority and relevance of sites you link out to | HIGH
- Where the majority of links point (home page, internal pages, a few particularly popular pages, etc.) | MODERATE
- Temporal data on links - the rate at which new links are coming to your site now and what those rates looked like in the past | MODERATE
- Variation of anchor text in links that point to your site | MODERATE
- Percentage of your links that come from blogs, wikis, forums, guestbooks or other potentially self-created sources | LOW
- Historical Data
- Historical PageRank data | HIGH
- Site ownership/registration changes in the past | HIGH
- If/when any of Google's algorithmic updates affected your site's rankings/traffic/crawl-rate (this would be a great signal to help identify why your site might have been penalized as the various updates all had particular foci) | HIGH
- How well you've ranked in Google over time | MODERATE
- Hosting/IP changes your site has made over time | MODERATE
- Historical traffic levels to your site | LOW
What do you think? Any other signals that Matt might be accessing on his laptop during site reviews?
Don't forget favorite color and personality profile. Keep those suggestions coming. ;)
We do what we can. :)
BTW - How about that traffic estimate number? I gave it a "moderate," but as I'm looking at your profile picture, it seems to nod when I ask it if you've got that one working...
Completely off topic but any chance you can make your site play nicer with co.mments? please pretty please.
I can read the full comments other place I make comments but here I get "reply to so and so". click the plus next to the orange thing to see what I mean.
https://co.mments.com/people/graywolf
Quality rater's comments
I never thought "How many sites you (or entities sharing your name, phone number, address or other registration data) own and what number of these are active", is important this much to Google. Any one can explain more about this. Are these factors still valid. I know basically Google concerning more than 200 signals when rank a website on there index. In a doubt Are these still valid ?
# How well you rank for specific terms | MODERATE # The top search terms that bring your site traffic | LOW
Maybe I'm misreading this but these two are inside the sitemaps console.
Yeah, but would they really help with spam detection? I'm not saying that they can't access them, I'm just trying to predict what's actually in Matt's console.
A high organic ranking but a low organic CTR would seem to be a good indicator of low quality. So have a list for the top 100 terms you rank for, match them to positions, and determine if you have a higher or lower CTR based on average CTR rates for those positions.
https://www.jimboykin.com/click-rate-for-top-1...
OK, good point - I conceed that it would be valuable. Let's see what the next commenter has to say.
Rand,
Not only do they do this ....
"Temporal data on links - the rate at which new links are coming to your site now and what those rates looked like in the past |"
But I believe that they also compare your link growth to your industry (competitors). This is how they know what is natural for your industry.
-Google can also learn the visitor behaviour on the SERPs (High) -their experience with AdWords (moderate) -how often they return from your site back to the SERPs (moderate/low) -how often they use navigational searches to your website (either by general, low competiton or brand/company name)(low)
Potentially, they can analyze what others say about the site product or the company (whether good or bad), but that may not be 100% complete (low)
Overall, I think you have left out the visitor factor of the website. They do have lots of personal information about your site visitors (along with the information about you). They can tell how good your business is doing, not just traffic stuff.
How about ....
- duplicate content ratios (per site, per page): in-site duplicated content + cross-site duplicate content (+ perhaps if those other sites belong to the same link-circles or not): very high, imho
- content concentration ratio: how well the content is similar across the whole site (perhaps also taking into account the links). Is the site targeted or very broad?
- link-circle analysis: which sites are forming link-circles with the site + how large those circles are + which kinds of sites belong to the same circles
- full penalty information: I believe Google has a lot of small "penalties" that influence almost every site.
- nofollow-analysis: what is the site linking to but trying to block? (On purpose or not?)
- potentially misleading content / code: flags for potential hidden text/links (CSS triggers), flags for potential misuse of javascript (hidden text, sneaky redirects, etc)
- full cache data: see any cached page at any known time in the past (archive.org on steroids). How many times have you heard "It isn't listed and I have no idea why" only to see that they just removed 200 spammy footer links, etc ...
excellent additions - particularly on duplicate content, content focus and link relationships (reciprocal, circular, etc.). I need to go edit the post :)
I'd have to agree, I think even most if not all of your lows are moderates to high as well.... it may be more of a matter of how much is important enough to be a top level view versus a drill down.
So does this make Matt the modern version of James Bond, with some behind the scenes Q character who puts together this whizz-bang arsenal... "Now Matt, be very careful with this."
Even the data you have classed as "LOW" must be available as a drill down from the main console page.
I expect a probability score on whether the site sells links or buys links may be a recent addition.