My September 2016 MozCon presentation and notes. If the embed below does not appear, you can see it directly at this link.
Server Log Files & Technical SEO Audits: What You Need to Know
Server log files contain the only data that is 100% accurate in terms of how Google and other search engines crawl your website. Sam will show you, using Raven Tools, what and where to check in your log files to identify problems you may to need to fix to maximize your rankings and organic traffic. (Source: https://raventools.com/site-auditor/)
Slide 2: The information that all of you see in Google Analytics is false.
Server logs contain the only data that is 100% accurate in terms of how both people and search engines are crawling your website. Here’s one example why.
Sides 3-4: Google Search Console says that Googlebot crawled 662 pages on my company’s website on July 19th.
Guess what the actual server logs reveal? 546 requests from Googlebot on that day. And this is something that is minor — there are often many more disparities between what Google says and what is actually in your server logs. If you rely only on Google Analytics, you have bad data. You can take the red pill or the blue pill when deciding what data set to use.
What does all of this mean? Before I go further into server logs, I will briefly explain log data in general.
Slide 5: Every machine, device, server, and network in an IT environment continuously outputs log data, which is essentially an ongoing record of performance activity. This information is being created and stored every single second of every single day. Well, at least whenever the machine is running.
Think about the implications of the bad data in Google Analytics when you track traffic, conversions, downloads, signups, and transactions. You are creating and executing online marketing strategies based on bad data from Google.
Slide 6: And I’m not the only one who has noticed.
You see the mention of the open source ELK Stack? More on that in minute.
Now, why is Google Analytics information wrong? Google Analytics uses client-side code to gather data while log files are direct, server-side information. Two completely different sets of data are coming from two completely different locations. If the Google Analytics code is blocked or otherwise does not execute in a given session or visit, you have no information in Google Analytics on what is occurring.
In the past, most marketers would analyze their server logs by exporting a batch to an Excel spreadsheet. But that’s wrong because a single batch of server logs shows only a moment in time.
Slide 7: To do server log analysis the proper way, you need to access and visualize your log data. In our company, we use the open-source ELK Stack of Elasticsearch, Logstash, and Kibana. Logstash is a log pipeline tool that accepts and then exports the data to Elasticsearch, which is a searchable database. Kibana then visualizes the parsed data in a dashboard like this.
Slide 8: This dashboard allows me to see everything that is happening on our website and in our servers in real time. You can do the same.
Now, your system administrators or devops engineers would likely be the ones who would set up this platform, so the last slide of this deck contains links to resources on installing and using the open-source ELK Stack for server log analysis.
Slide 9: But why do log analysis in the first place? Many different reasons. There are many different use cases including IT operations (especially in SaaS and other online types of businesses that need to be running perfectly every single second), security and compliance, business intelligence, and marketing.
And that last one is why we’re all here.
Slide 10: What’s my favorite use of log data — and probably yours too? Server log analysis! Here’s what a single server log line usually looks like when Googlebot — or anyone — makes one server request of one item on a website.
- White — the IP address
- Blue — timestamp
- Green — method (GET, POST, etc.)
- Red — uniform resource identifier (URI) such as a URL or something else that is being requested
- Orange — HTTP status code
- White — size of the file returned (in this example, it is zero)
- Purple — the browser and the user-agent that is making the request (Googlebot, etc.)
Slide 11: Every single server request of anything — a website page or image or file — outputs a server log line. For example: If your website receives 10,000 visitors a day and each person or bot views an average of ten pages, then your server will generate 100,000 log entries every single day. And all of those items together constitute a server log file. One server log file is usually created and stored every calendar day.
Slide 12: While you ship, visualize, and analyze all of your server log data, here are the things you need to check specifically in an SEO context. Let’s take these six items one by one.
Slide 13: Bot crawl volume. If the number of times that a search engine is visiting your site suddenly drops, check your robots.txt file, your XML sitemap, and your meta-robots tags. You should also visit the support pages of that search engine.
Slide 14: Response code errors. Every single server log entry contains a response code. Group URLs by response code to look into those problems easily and solve them in bulk.
Slide 15. For example, take temporary redirects. Every log entry with a 302 response code is a temporary redirect. Those should usually be changed to 301 (permanent) redirects.
Slide 16: Crawl priority. Which pages and directories of your website get the most and least attention from search engines? Does that match your business priorities? If you are a large e-commerce site that updates, say, the shoes section of your website daily, then you want Google crawling that section more often than, say, the furniture part of your site that is updated once a month.
You can influence crawl priorities in your XML sitemap and through your internal linking structure. You can move pages and directories that you want crawled more often closer to the home page, and you can have more internal links going there from the home page.
Slide 17: Last crawl date. If a recently published or updated page is not appearing in the SERPs, check in the logs when Google last visited that URL. If it has been a long time, try submitting that URL directly in Google Search Console.
Slides 18-19: Crawl budget waste. Google allocates a crawl budget to every website. If Googlebot hits that limit before crawling new or updated pages, it will leave the site without knowing about them.
The use of URL parameters often results in crawl budget waste because Google crawls the same page from multiple URLs. There are two solutions: block Google in the robots.txt file from crawling all URLs with any defined tracking parameters and use the URL Parameters tool in Google Search Console to do the same thing.
Slide 20: I’ve only got 15 minutes here, so this slide of my deck contains links to more information:
- Introductions to Apache, IIS, NGINX, and Windows server log analysis
- Tutorials on Elasticsearch, Logstash, and Kibana (the open source ELK Stack)
- ELK Apps for Apache, IIS, NGINX, and Windows servers
- The Complete Guide to the ELK Stack
- Log Analysis in AWS Environments with the ELK Stack
Slide 21 and Conclusion: The point to remember: If your technical SEO is bad and Google cannot crawl, parse, and index your site properly, then nothing else you do will matter. And server log analysis is a critical part of technical SEO.
There is no substitute for server log analysis in your SEO audits. Data from any other source is not complete or accurate. Frankly, it’s not enough. If you can became a master at server log analysis by looking at this data directly, then in terms of SEO, you will become “The One.”
Interested in log analysis for monitoring your IT environment (especially if you’re a marketing SaaS company!), your servers for business intelligence and marketing, or your security and compliance? Visit Logz.io to learn more about my company’s AI-powered hosted ELK log analysis platform and start a free trial.