Example: Zapping 404 errors, clean up crawl errors in Google Webmaster Tools

Clean up crawl errors automatically in Google Webmaster Tools

Site redesigns usually have consequences for your search rankings. Although the intention is to improve every key indicator, often URLs change or disappear as part of the redesign and suddenly all those links pointing to your pages from external sites start giving 404 errors to your users.

Let’s make it clear with an example: a user performs a search for an interesting term, gets some results, picks one and visits the page. That page contains a link to your site, the user finds it interesting, she clicks on it and boom! page not found. It was an old URL that doesn’t exist anymore in your redesigned site.

Although 404 errors do not affect directly  your search ranking it does affect users that click on external links and experience an error page.

Ideally a site redesign should be accompanied by a specification of URL rewrites, so that your web server knows how to redirect users to the right page if that still exists under a different URL. In some cases the 404 error page is just the right option, and a good design of that page could lead the users to the right track of your site. In other cases it might be possible to ask the origin webmasters to just change the reference to the new URL.

Well, if for any reason this task is not done, 404 errors start to grow. And something must be done after users complain, or you notice a drop in visits or conversions in your site and start to investigate why.

The Crawl Errors report in Google Webmaster Tools shows information that the search engine has about the pages on your site that return errors. After your site redesign is done, old URLs are still in the Google index. Those pages whose incoming links were other old pages might eventually be dropped from the index. But those that are linked from external sites won’t ever disappear. By looking at the report you can see if a given URL is linked from another site or from your own sitemap. If you know that all the URLs that Google considers pointing to your site are not valid anymore, you can mark the URL as fixed, and they will be removed from the report.

This fixing process can be long and tedious. It turns out that GWT allow you to see as many as 1000 URLs per day in the Crawl Errors report. If you have many URLs in the report to fix, the process can be very slow and error prone. This is when automation comes to the rescue. GWT has an API that lets you program most of the actions that can be done manually from the tool. I am going to show you how to do that.

Fixing crawl errors manually

This is how you could approach the task if done manually:

  1. For each URL in the Crawl Errors report, check if it actually returns 404. If it doesn’t, mark as fixed.
  2. If this is the case, look at the Linked From tab and check where the links come from. Visit the origin URLs and check if they still link to your page. If they don’t, or they do not exist anymore, mark as fixed.
  3. If any origin URL still links to your page, and this URL is internal, try to find a cause and a pattern. Something is wrong with your site code or platform and must be fixed urgently.
  4. If your sitemap still contain references to the wrong URLs, then regenerate it urgently and submit it back to Google.
  5. It is not advisable to mark a URL as fixed if if you check that it is linked internally and you have not fixed it. It will show up again soon in the report.
  6. Finally, if the origin page is external, and there is an equivalent new URL in your site, then you can create a 301 redirection so that the user won’t notice the change. You can mark as fixed that URL in the report, and it will not show up again.

Automating the task

First of all you need to create a service account and give it access to your GWT data. A service account is associated with a program that interacts with Google APIs instead of an end user. Logging in to Google Developers Console with an existing user account that has permissions to access your GWT data, you can create a project for your application, switch on access to the GWT API, and generate a client ID.  Then you can generate a private key for the service account that will be used by the program to authenticate. An email address is also created so that you can give it edit permissions for your Webmaster Tools data. The whole process is explained in detail in the Google Developers website.

With the private key conveniently stored in a file, you can start coding the program. We will use Java here as the language of choice but you have other languages available to call Google APIs.

Authenticating to the Google API using the service account

 

Get the list of crawl errors. The API returns the same list (limited to 1000 items) as the end user tool does.

For each crawl error, check if it still returns 404

If it still returns 404, get the URLs and sitemaps that link to the page.

If a sitemap is reported to link, get it (save it to avoid getting it again) and check if it still has the link.

If any other URL is reported to link, check if it still does.

If the page doesn’t return 404, or if it does but it is not linked from any page or sitemap in the report, mark as fixed.

With a little bit of housekeeping, logging, error checking and recovery, you can easily have a working program and install it to run as an automated task once a day. Every day Google will add another 1000 links in the report and after some time you will only keep the URLs that you have to actually act upon.

You can find the full source code here.

The following image is the crawl report in a real site after two months running the program daily. It went from 60000+ errors to just a few, and after that it was very easy to correct the remaining errors and leave the report clean.

Image showing daily reduction of 404 errors reported in Webmaster Tools

 

Leave a Reply