It's ten o'clock. Do you know where your logs are?
I'm introducing this guide with a pun on a common public-service announcement that has run on late-night TV news broadcasts in the United States because log analysis is something that is extremely newsworthy and important.
If your technical and on-page SEO is poor, then nothing else that you do will matter. Technical SEO is the key to helping search engines to crawl, parse, and index websites, and thereby rank them appropriately long before any marketing work begins.
The important thing to remember: Your log files contain the only data that is 100% accurate in terms of how search engines are crawling your website. By helping Google to do its job, you will set the stage for your future SEO work and make your job easier. Log analysis is one facet of technical SEO, and correcting the problems found in your logs will help to lead to higher rankings, more traffic, and more conversions and sales.
Here are just a few reasons why:
- Too many response code errors may cause Google to reduce its crawling of your website and perhaps even your rankings.
- You want to make sure that search engines are crawling everything, new and old, that you want to appear and rank in the SERPs (and nothing else).
- It's crucial to ensure that all URL redirections will pass along any incoming "link juice."
However, log analysis is something that is unfortunately discussed all too rarely in SEO circles. So, here, I wanted to give the Moz community an introductory guide to log analytics that I hope will help. If you have any questions, feel free to ask in the comments!
What is a log file?
Computer servers, operating systems, network devices, and computer applications automatically generate something called a log entry whenever they perform an action. In a SEO and digital marketing context, one type of action is whenever a page is requested by a visiting bot or human.
Server log entries are specifically programmed to be output in the Common Log Format of the W3C consortium. Here is one example from Wikipedia with my accompanying explanations:
127.0.0.1 user-identifier frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326
- 127.0.0.1 -- The remote hostname. An IP address is shown, like in this example, whenever the DNS hostname is not available or DNSLookup is turned off.
- user-identifier -- The remote logname / RFC 1413 identity of the user. (It's not that important.)
- frank -- The user ID of the person requesting the page. Based on what I see in my Moz profile, Moz's log entries would probably show either "SamuelScott" or "392388" whenever I visit a page after having logged in.
- [10/Oct/2000:13:55:36 -0700] -- The date, time, and timezone of the action in question in strftime format.
- GET /apache_pb.gif HTTP/1.0 -- "GET" is one of the two commands (the other is "POST") that can be performed. "GET" fetches a URL while "POST" is submitting something (such as a forum comment). The second part is the URL that is being accessed, and the last part is the version of HTTP that is being accessed.
- 200 -- The status code of the document that was returned.
- 2326 -- The size, in bytes, of the document that was returned.
Note: A hyphen is shown in a field when that information is unavailable.
Every single time that you -- or the Googlebot -- visit a page on a website, a line with this information is output, recorded, and stored by the server.
Log entries are generated continuously and anywhere from several to thousands can be created every second -- depending on the level of a given server, network, or application's activity. A collection of log entries is called a log file (or often in slang, "the log" or "the logs"), and it is displayed with the most-recent log entry at the bottom. Individual log files often contain a calendar day's worth of log entries.
Accessing your log files
Different types of servers store and manage their log files differently. Here are the general guides to finding and managing log data on three of the most-popular types of servers:
- Accessing Apache log files (Linux)
- Accessing NGINX log files (Linux)
- Accessing IIS log files (Windows)
What is log analysis?
Log analysis (or log analytics) is the process of going through log files to learn something from the data. Some common reasons include:
- Development and quality assurance (QA) -- Creating a program or application and checking for problematic bugs to make sure that it functions properly
- Network troubleshooting -- Responding to and fixing system errors in a network
- Customer service -- Determining what happened when a customer had a problem with a technical product
- Security issues -- Investigating incidents of hacking and other intrusions
- Compliance matters -- Gathering information in response to corporate or government policies
- Technical SEO -- This is my favorite! More on that in a bit.
Log analysis is rarely performed regularly. Usually, people go into log files only in response to something -- a bug, a hack, a subpoena, an error, or a malfunction. It's not something that anyone wants to do on an ongoing basis.
Why? This is a screenshot of ours of just a very small part of an original (unstructured) log file:
Ouch. If a website gets 10,000 visitors who each go to ten pages per day, then the server will create a log file every day that will consist of 100,000 log entries. No one has the time to go through all of that manually.
How to do log analysis
There are three general ways to make log analysis easier in SEO or any other context:
- Do-it-yourself in Excel
- Proprietary software such as Splunk or Sumo-logic
- The ELK Stack open-source software
Tim Resnik's Moz essay from a few years ago walks you through the process of exporting a batch of log files into Excel. This is a (relatively) quick and easy way to do simple log analysis, but the downside is that one will see only a snapshot in time and not any overall trends. To obtain the best data, it's crucial to use either proprietary tools or the ELK Stack.
Splunk and Sumo-Logic are proprietary log analysis tools that are primarily used by enterprise companies. The ELK Stack is a free and open-source batch of three platforms (Elasticsearch, Logstash, and Kibana) that is owned by Elastic and used more often by smaller businesses. (Disclosure: We at Logz.io use the ELK Stack to monitor our own internal systems as well as for the basis of our own log management software.)
For those who are interested in using this process to do technical SEO analysis, monitor system or application performance, or for any other reason, our CEO, Tomer Levy, has written a guide to deploying the ELK Stack.
Technical SEO insights in log data
However you choose to access and understand your log data, there are many important technical SEO issues to address as needed. I've included screenshots of our technical SEO dashboard with our own website's data to demonstrate what to examine in your logs.
Bot crawl volume
It's important to know the number of requests made by Baidu, BingBot, GoogleBot, Yahoo, Yandex, and others over a given period time. If, for example, you want to get found in search in Russia but Yandex is not crawling your website, that is a problem. (You'd want to consult Yandex Webmaster and see this article on Search Engine Land.)
Response code errors
Moz has a great primer on the meanings of the different status codes. I have an alert system setup that tells me about 4XX and 5XX errors immediately because those are very significant.
Temporary redirects
Temporary 302 redirects do not pass along the "link juice" of external links from the old URL to the new one. Almost all of the time, they should be changed to permanent 301 redirects.
Crawl budget waste
Google assigns a crawl budget to each website based on numerous factors. If your crawl budget is, say, 100 pages per day (or the equivalent amount of data), then you want to be sure that all 100 are things that you want to appear in the SERPs. No matter what you write in your robots.txt file and meta-robots tags, you might still be wasting your crawl budget on advertising landing pages, internal scripts, and more. The logs will tell you -- I've outlined two script-based examples in red above.
If you hit your crawl limit but still have new content that should be indexed to appear in search results, Google may abandon your site before finding it.
Duplicate URL crawling
The addition of URL parameters -- typically used in tracking for marketing purposes -- often results in search engines wasting crawl budgets by crawling different URLs with the same content. To learn how to address this issue, I recommend reading the resources on Google and Search Engine Land here, here, here, and here.
Crawl priority
Google might be ignoring (and not crawling or indexing) a crucial page or section of your website. The logs will reveal what URLs and/or directories are getting the most and least attention. If, for example, you have published an e-book that attempts to rank for targeted search queries but it sits in a directory that Google only visits once every six months, then you won't get any organic search traffic from the e-book for up to six months.
If a part of your website is not being crawled very often -- and it is updated often enough that it should be -- then you might need to check your internal-linking structure and the crawl-priority settings in your XML sitemap.
Last crawl date
Have you uploaded something that you hope will be indexed quickly? The log files will tell you when Google has crawled it.
Crawl budget
One thing I personally like to check and see is Googlebot's real-time activity on our site because the crawl budget that the search engine assigns to a website is a rough indicator -- a very rough one -- of how much it "likes" your site. Google ideally does not want to waste valuable crawling time on a bad website. Here, I had seen that Googlebot had made 154 requests of our new startup's website over the prior twenty-four hours. Hopefully, that number will go up!
As I hope you can see, log analysis is critically important in technical SEO. It's eleven o'clock -- do you know where your logs are now?
Additional resources
- Log File Analysis: The Most-Powerful Tool in Your SEO Toolkit (Tom Bennet at BrightonSEO)
- SEO Finds in Your Server Log (part two) (Tim Resnik on Moz)
- Googlebot Crawl Issue Identification Through Server Logs (David Sottimano on Moz)
- More information on the Logstash and Kibana parts of the ELK Stack (Logz.io)
What a fantastic post Samuel.
So few SEO's have practical guides for log file analysis, and although I know there are some out there, this is about as comprehensive as I have read. There is nothing more that I can add to this as it is pretty much what I already do.
About to share on Twitter for you too :)
-Andy
Thanks for the nice comment -- and the Twitter share!
For any marketing tool or process, there are almost an infinite number of potential uses, so I'm happy to hear from another log analyzer that I covered them all. I was worried I might have missed a few! :)
Hello Andy and Samuel,
First of all, I totally agree that this is a great post. Whilst there's a lot of information you can gather from Google Analytics and the Webmaster Tools ("search console" sorry), sometimes it's good to get your hands dirty and get down to the real nitty-gritty. It's amazing what you can find.
One guide which I've used before is this 2009 webmaster world thread on "log walking": https://www.webmasterworld.com/google_adsense/3830... The thread is old, but the practice and principle of digital dumpster diving" still hold true.
Hi, I wrote a post, how log analisys, but is in spanish :(
You can use Google translator, the post here
https://www.mecagoenlos.com/Posicionamiento/usar-lo...
And a video example :)
https://www.mecagoenlos.com/logs.avi
Excellent article.
Your article is also good Errioxa, it is best that is in Spanish and can read fluid.
This is one of the techniques I have not seen many SEOs do, largely because it involves getting your hands dirty, thanks for sharing.
greetings.
Great post! As long-term developer i know that log files contain everything and they're only one source that we can trust.
Because log analysis can be boring procedure and external tools can be used. One of them is well known Sawmill . There are also other analysis software on market. But all of them are expensive.
For small companies there is solution too! Using *nix tools (sort, uniq and awk/sed) can save lot of time. Few examples:
awk -F\" '{print $6}' access.log | sort | uniq -c | sort -fr
this will sort all user agents on their frequency
awk '{print $9}' access.log | sort | uniq -c | sort
this will show all HTTP statuses
awk '($9 ~ /404/)' access.log | awk '{print $9,$7}' | sort
this will shown only 404 requests
awk -F\" '{print $2}' access.log |sort|uniq -c|sort -nr |head -10
this will return top 10 URL requested
Of course combinations is almost endless. But sad news is that only *nix users (Linux, BSD, OSX and rest) can use commands native. For Windows users they must install cygwin package and then use that commands.
Great article, as I should be able to create my own system, that updates me on google's bot on-site activities. Bit of data harvesting that I had not considered. Turn that into a dashboard component of some description. It should not be that hard. Excellent thanks for the insight.
Great post Samuel!
It is one of the best guides of log file analysis I've read.
Thanks to your article I have new knowledge for the seo world.
Thank you!!!
Amazing all the information I have achieved thanks to your indications, not yet to start. I 've never trusted a lot of the results of the google tools , now if I have reliable data. Thank you very much for the information. regards
Great post...i do agree many of us don't really focus on log and try figure out what is going on with the search engines bot before hand until we get some alert.
good work.
Hi Scoot, very good post. A little note: Google seems don't consider prioriry and frequency on sitemap https://www.seroundtable.com/google-priority-chang...
For log I created a simply windows tool called "GrepMrx": if you want you can try it (it's fully free) to grep big log files, filtering by regexp.
Thanks
Thanks for the reference! To be honest, I had never seen John Mueller's statement on that before. As a result, I don't know enough have an opinion -- I never really want to give thoughts on something I don't know.
So, I'll throw it open to Mozzers! Thoughts, anyone?
I think sitemap today has those values:
Priority, Frequency, ... are all by navigation tree, internal links and backlinks.
Thanks!
This post is great! Although it's very technical, it's really easy to understand. Thank you for sharing this valuable information :)
Great article. I'm keen to try the ELK stack and see if it is scalable for analyzing very large websites (with billion+ pages).
I'm also wondering about the same questions as @Oneclickhere... are you recommending blocking CSS resources? Google has made it clear that we should keep JS and CSS files crawlable.
Thanks for the comment! That's an important point -- I should have been more precise with my example.
Google DOES need to crawl JavaScript and CSS -- but other things such as infinite calendar scrolling should be blocked.
Hi Samuel.
great post! Thank you.
Are you able to find out exact paths that Google follows, to identify issues with internal linking? Or, are you able to check each page crwaling vs. it's status of "noindex" ?
Are you able to "draw" website's map/structure graph with this tool?
Are there any other seo-related benefits from such analysis, that you didn't mention i nthe article?
Thanks in advance :)
Br,
Roman
Roman, thank you for the comment. I really don't want to bore the community by talking about my company's product. So, feel free to e-mail me at samuel (at) logz.io about the first three questions.
However, I'm always open to questions about log analysis in general! Regarding your last question, I think I covered every SEO use of log analytics -- but I'm always open to suggestions from Mozzers in case I missed anything. :)
You say:
'"GET" is one of the two commands (the other is "POST") that can be performed.'
However, there a more than just those two "methods" in the HTTP-Protocol. The most important one in this case being the HEAD Request. It is often used by crawlers if they only want to check the availability of a page.
You can find more Information about HTTP Methods at https://www.w3.org/Protocols/rfc2616/rfc2616-sec9.h...
Yes, I should have been more precise in my description there. Thanks so much for the clarification! :)
great article, it has given me a long list of what I know need to check.
I know need to convince the developers to give me access to the log files - that will be a bigger challenges that analysing the data.
Yes, just getting the data can be a chore -- especially if you want to review the logs on a regular basis. It's why I recommend that one use either proprietary software or the ELK stack and set them up once with log-file feeds.
Great analysis Samuel, I have a question here: We know that we cannot control Google's crawler delay from robots.txt what about the option giving in the GWT? Does it work from there?
This is an interesting article about SEO Log analysis. Thank you so much four your clarity and easy way to explain this process.
Thanks a lot, this is a great!
Nice post !
if you are familiar with PIWIK, also may try to use a feature : the 'Log Analytics'
Crikey, this is serious stuff!
Great article, it has given me a long list of what I know need to check.
Great Post! You can find a lot of SEO related posts out there, but this one is pretty technical and talks about aspects you don't easily find elsewhere. Thanks!
Im not in my IT department and they tell me that our site files are too big to pull. I am not an expert so is this something you think could happen?
The log files themselves might be very large indeed -- it's why the best solution is to setup automatic, continuous feeds one time into one's desired proprietary log analysis software or the ELK Stack.
What's amazing ! log analysis is very important for the lager website, web log explorer is very good for log analysis.
Good article . It focuses a lot on a technical analysis of the data but does not hurt to remember periodically to data analysis is an essential part of SEO . Without a good analysis of the data of our strategy it will be very difficult to get results.
These files exist typically for technical site auditing & troubleshooting, but can be extremely valuable for SEO auditing as well.
I agree -- it's why log analysis is one of the most under-discussed topics in SEO!
Hi Samual, I haven't read the entire article yet but the headline intrigued me so I bookmarked it for later :) However, as I skimmed through the article, i noticed you mentioned the words "Crawl budget wasted" on a .css file... so, to draw a quick conclusion to this, one would block that .css file from being crawled, but then Google has a fit when doing a PageSpeed Insights test and says your website is blocking css or js files. How do we "safely" block css and js files from being crawled without annoying Google?
Thanks for the comment! I'm copying in part my response to a Mozzer with a similar question.
I should have been more precise with my example. Google DOES need to crawl JavaScript and CSS -- but other things such as infinite calendar scrolling should be blocked.
Hello, Mozzers! I'm the author of this essay -- I'd love to hear any thoughts and comments you have on the post. Feel free to post, and I'll respond as soon as I can. :)
great article Samuel, thank you for sharing! just a question about wasting resources, in your opinion there is a way to reduce but not stop crawler hits on a certain resource? I usually prefer to avoid robots.txt rules just to be compliant with jonh mueller advices...
I'd suggest looking at the crawl-priority settings in your XML sitemap. You can set different resources to be crawled daily, monthly, or at other intervals. Of course, these are only suggestions -- Google can still decide on its own to crawl at its own rates.
Hi Samuel,
Great post about using server logs for SEO. Nice and Interesting to see the 'hook' to clients. To many errors, duplicate content issues are wasting your (daily) crawlbudget and stand in the way of your website getting indexed right.
Thanks!
-Bob
I always question about servers site Log entry.
We know the Client site logs but we did not identify the effect of our work on server site.
What actually server do after our work was done. What types of effect of our content on serverside.
Main thing about website caching of our website content.
Hello,
Can i do all of this from the Cpannel of my host?
I have www.CatalogServicii.com hosted at www.claus.ro
THANK YOU for including Nginx links! I moved my personal site's web server to Nginx last year, and this is the first article I've not had to follow with immediate research :)
My pleasure! I wanted to include three of the most-popular types of servers for exactly that reason. :)
Great article. I never thought about analyzing the response codes of my 'crawl budget'.
thank you for giving me the most amazing thing I've ever seen
Server log files are the most amazing thing you've ever seen?