As the web gets more complex, with JavaScript framework and library front ends on websites, progressive web apps, single-page apps, JSON-LD, and so on, we're increasingly seeing an ever-greater surface area for things to go wrong. When all you've got is HTML and CSS and links, there's only so much you can mess up. However, in today's world of dynamically generated websites with universal JS interfaces, there's a lot of room for errors to creep in.
The second problem we face with much of this is that it's hard to know when something's gone wrong, or when Google's changed how they're handling something. This is only compounded when you account for situations like site migrations or redesigns, where you might suddenly archive a lot of old content, or re-map a URL structure. How do we address these challenges then?
The old way
Historically, the way you'd analyze things like this is through looking at your log files using Excel or, if you're hardcore, Log Parser. Those are great, but they require you to know you've got an issue, or that you're looking and happen to grab a section of logs that have the issues you need to address in them. Not impossible, and we've written about doing this fairly extensively both in our blog and our log file analysis guide.
The problem with this, though, is fairly obvious. It requires that you look, rather than making you aware that there's something to look for. With that in mind, I thought I'd spend some time investigating whether there's something that could be done to make the whole process take less time and act as an early warning system.
A helping hand
The first thing we need to do is to set our server to send log files somewhere. My standard solution to this has become using log rotation. Depending on your server, you'll use different methods to achieve this, but on Nginx it looks like this:
# time_iso8601 looks like this: 2016-08-10T14:53:00+01:00 if ($time_iso8601 ~ "^(\d{4})-(\d{2})-(\d{2})") { set $year $1; set $month $2; set $day $3; } <span class="redactor-invisible-space"> </span>access_log /var/log/nginx/$year-$month-$day-access.log;
This allows you to view logs for any specific date or set of dates by simply pulling the data from files relating to that period. Having set up log rotation, we can then set up a script, which we'll run at midnight using Cron, to pull the log file that relates to yesterday's data and analyze it. Should you want to, you can look several times a day, or once a week, or at whatever interval best suits your level of data volume.
The next question is: What would we want to look for? Well, once we've got the logs for the day, this is what I get my system to report on:
30* status codes
Generate a list of all pages hit by users that resulted in a redirection. If the page linking to that resource is on your site, redirect it to the actual end point. Otherwise, get in touch with whomever is linking to you and get them to sort the link to where it should go.
404 status codes
Similar story. Any 404ing resources should be checked to make sure they're supposed to be missing. Anything that should be there can be investigated for why it's not resolving, and links to anything actually missing can be treated in the same way as a 301/302 code.
50* status codes
Something bad has happened and you're not going to have a good day if you're seeing many 50* codes. Your server is dying on requests to specific resources, or possibly your entire site, depending on exactly how bad this is.
Crawl budget
A list of every resource Google crawled, how many times it was requested, how many bytes were transferred, and time taken to resolve those requests. Compare this with your site map to find pages that Google won't crawl, or that it's hammering, and fix as needed.
Top/least-requested resources
Similar to the above, but detailing the most and least requested things by search engines.
Bad actors
Many bots looking for vulnerabilities will make requests to things like wp_admin, wp_login, 404s, config.php, and other similar common resource URLs. Any IP address that makes repeated requests to these sorts of URLs can be added automatically to an IP blacklist.
Pattern-matched URL reporting
It's simple to use regex to match requested URLs against pre-defined patterns, to report on specific areas of your site or types of pages. For example, you could report on image requests, Javascript files being called, pagination, form submissions (via looking for POST requests), escaped fragments, query parameters, or virtually anything else. Provided it's in a URL or HTTP request, you can set it up as a segment to be reported on.
Spiky search crawl behavior
Log the number of requests made by Googlebot every day. If it increases by more than x%, that's something of interest. As a side note, with most number series, a calculation to spot extreme outliers isn't hard to create, and is probably worth your time.
Outputting data
Depending on what the importance is of any particular section, you can then set the data up to be logged in a couple of ways. Firstly, large amounts of 40* and 50* status codes or bad actor requests would be worth triggering an email for. This can let you know in a hurry if something's happening which potentially indicates a large issue. You can then get on top of whatever that may be and resolve it as a matter of priority.
The data as a whole can also be set up to be reported on via a dashboard. If you don't have that much data in your logs on a daily basis, you may simply want to query the files at runtime and generate the report fresh each time you view it. On the other hand, sites with a lot of traffic and thus larger log files may want to cache the output of each day to a separate file, so the data doesn't have to be computed. Obviously the type of approach you use to do that depends a lot on the scale you'll be operating at and how powerful your server hardware is.
Conclusion
Thanks to server logs and basic scripting, there's no reason you should ever have a situation where something's amiss on your site and you don't know about it. Proactive notifications of technical issues is a necessary thing in a world where Google crawls at an ever-faster rate, meaning that they could start pulling your rankings down thanks to site downtime or errors within a matter of hours.
Set up proper monitoring and make sure you're not caught short!
This article will improve our insights analysis. Thanks! Actually, we use more simple practices. For example, we ask the server for notice us when the domain is down and for how long. Many times it involves suggesting the client to change the provider. Sometimes, it has helped to improve some keywords in the rankings.
Hi there, nice tips! That happens with one of my clients. I'll suggest changing the server and see if that improves the technical SEO.
Good article, Pete, and very timely.
It's not just the increased complexity with fonts, JS libraries etc., all over the place. It's also the additional risk. Each time you rely on a third party to serve something to your visitors you're taking on more risk. True, it's unlikely Google's going to fail to serve the fonts your page needs, or that the path to the js libraries will fail because of a DNS problem somewhere .... but these could happen, they are not impossibilities.
Such problems won't, of course, be captured in the logs. What's your suggestion for monitoring these and/or getting some early warning?
BTW, installing something like Piwik goes a long way to collecting and analysing the log data and it saves some of the steps you describe above.
Pretty well post
Thank You, It help Me for learning SEO.
Hi Pete, beautiful article.
But to be honest I am not that techy guy per say. I do these things only if I must and I can't delegate someone else atm. What do you think about using a service like Pingdom for example? Wouldn't that be useful for someone like me? :D
Thanks for being an informative article that will be of benefit to the readership of all seo specialists in a beautiful post.
Sounds like a good read to me. We've seen huge impacts of SEO from little things. Ultimately it all adds up overall. Thanks for sharing your post.
Thank´u Pete, it´s a really good post, It´s good help for all. :)
Automation is one of my favorite subjects. And linking it with SEO is like adding an another dimension to automation. I liked the pointers mentioned in the post, the different status codes to trigger notifications, parsing URL patterns to identify probable attacks or broken links. It seems like a set of requirement specs that even a Wordpress developer would want to look at. Anyways, logs are a great tool to reveal the real health of any web application which many of us may not give due credit. For many of my cloud hosted web apps, I began exploring Papertrail which is a nice tool for automated log analysis.
Thanks for posting such an informed article.
Great post! You have described beautifully about SEO. I think this article is useful for everyone.
Thanks for sharing technical post. Really it will be very helpful for SEO Professionals.
Thanks a lot Pete, it´s a really well written article, It´s good help for all.we can save our valuable time using automatic method
valuable post, it would be greatly helpful for the seo professionals to save time. Thanks for sharing.
Splendid article Pete! Could i ask the tools /softwares you use for this, especially the Dashboard creation tool you prefer.
Thanks
Thanks... really it is very useful.
Great advice Pete, thanks for sharing. This is something that SEO specialists will need to ensure they're up to speed with, as it does seem to be getting ever-more technical.
Great to see a technical article like this. I think as technology grows to help us with so many amazing tools out there it is still important to understand the basic functionality of how to do it manually and that is exactly what I took from this article. So many new opportunities to develop amazing stuff online means an increase of various technologies we have to crawl and analyze as marketers.
There have been many articles lately speaking of how React for example is good or bad for SEO, and also with Angular they were focused on addressing some SEO issues with the new release.
Wonderful technical contribution to the community thanks!
Thanks Pete, good post. I concur with the above comment. There are a lot of tools in the market which help you pull data and summarise them pretty well. Then as an SEO professional your job is to just evaluate the logs and gain relevant insights.
Excellent information thanks for the post.
regards
Hi Pete
Thanks... really it is very useful. With some technical details, those kind of records facilitates individual for better knowledge of search engine optimization and there nature of operating.
Very interesting... How can I get these log files? Can I do it from the cpanel?
Cheers,
David
Good article and very interesting
You need to have control of all this, and it is clear that problem notifications are necessary.
Great and informative post. I am an SEO Expert and I think, this post is must read for SEO professionals.
Thanks!!
There are days when I read the MOZ blog, the only thing I know is that I do not know anything and I still have to learn a lot on this subject. Many thanks Pete.
Every blog make me feel like i know nothing...
Adorei o post, vou começar a frequentar mais este site..
eadesign.art.br