This post is written at the heels of numerous changes taking place on the Google front. These changes have to do with user experience (UX), content and localization; leading us, web marketing professionals, to focus mainly on optimizing aspects such as site speed and content. However, are those the best places to invest our efforts? In my assessment, there are entire areas of site optimization that don’t receive the attention they deserve, whilst they hold the potential to greatly boost the site’s strength and rankings. One such area is the site’s crawlability level.
Yes, some will claim this doesn’t even fall under the category of SEO, rather under GBO (Googlebot Optimization). Notwithstanding, I view any activity aimed at improving a site’s organic rankings and incoming traffic, as falling under SEO. If you agree, ask yourselves: are you checking your site’s crawlability correctly and regularly, and are you doing everything possible to get your site to the level it needs to be?
Many have already eulogized SEO; five years ago, two years ago and even several weeks ago, when popular ahrefs.com claimed that on-page SEO, as a whole, was dead, after examining 2 million search phrases. Needless to say, I disagree with their claim, firstly because basing such a statement on 2 million search phrases alone is ludicrous, considering that Google reports over 2 trillion (!) searches per year. Assuming approximately 25% of all searches are unique, ahref’s sample consists of only 0.0004%.
Long live on-page SEO — and it’s really working if you good enough to understand how it works!Click to tweet
On the other hand, in my experience, it is possible to witness significant improvements in rankings with every new client we work with, once we perform a comprehensive infrastructure setting process, on everything pertaining to on-page SEO. A year ago I published an article on this very topic on Search Engine Journal, explaining how to perform such an analysis – step-by-step – in an in-depth way, going beyond just correcting page titles, adding a few keywords in the content, boosting site speed and adding alt-tags to images.
What is crawling, and why is it so important?
Search engines crawl the internet regularly using crawlers (bots), whose purpose it is to discover and index new web pages. The method of crawling has changed several times over the year, and gotten more efficient with the exponential rise in the number of webpages. For this reason, nowadays Google and other search engines are making an effort to index pages that seem important mainly according to their pagerank. No, I’m not talking about the score we used to see on our Google toolbar, but rather the score that Google assigns to every page. This score should have been reflected in the toolbar, but it was almost never up-to-date, until Google decided to remove it entirely. In any case, the score still exists as far as Google is concerned, it’s just no longer available for us to view.
Many make the mistake of assuming Google crawls their website every week, two weeks or even month, thereby thinking it’s best to make changes in those intervals – more or less – to be ready for the crawlers. However, Google actually crawls websites every several seconds/minutes, though it does not index pages at that frequency (precisely for the pagerank parameter I mentioned above). The rate of indexing will rise if Google recognizes that your site is updated at a high frequency, with high quality, relevant content. For instance, if the bot reaches a particular website and compares its present form to the most recent saved version, and sees there is no change time after time, Google will begin to increase the time between each crawl and index, accordingly. On the other hand, in the case of highly dynamic news websites, for instance, Google may index new pages every number of seconds, because every time the bot crawls the site it recognizes new content. The content renewal factor alongside a site’s popularity, both signal to Google at which frequency it should crawl the site.
Every time Google arrives at your site, bots begin to crawl, comparing previous versions of pages to new versions, and downloading new pages found. Every page downloaded requires a certain amount of broadband, and once the allotted broadband per site has been reached, the bot stops crawling and moves on to the next site. The professional name for this cap is Crawl Budget. The more accessible, fast and relevant your site, the better use will be made of your allocated Crawl Budget. Other than the fact that site speed serves as a factor in determining ranking, it should be clear to all that having a faster website, with more crawlable pages (and therefore more indexed pages) will increase the chances of more pages from your site appearing in Google search results pages for a variety of search terms. In the end, that translated into more organic traffic.
Which tools do I need?
In order to perform our inspection, we’ll need to use several different tools, most of which are routinely used in our day-to-day work:
- Google Analytics
- Google Search Console
- Bing Webmaster Tools
- Screaming Frog
- Web Developer Toolbar
We’ll start out with a fundamental step that may seem trivial to many; nevertheless, it is important: checking the robots.txt file to ensure no important parts of the site are accidentally/purposely blocked. First, naturally, verify you have a robots.txt file, and that it has been uploaded to Google Search Console. To confirm this, simply type /robots.txt right after the domain name, e.g.: http://www.cnn.com/robots.txt
If there is an up-to-date file, you should see a white screen that contains several rows of text, like this:
A second way to verify you have a robots.txt file, is to access Google Search Console and check if you input the location of the file – then check to see if Google recognizes it: Crawl>robots.txt Tester
If Google does not recognize such a file, you’ll see a message like this:
Alternatively, if there is a file and its location is updated, you’ll be able to check whether Google detects any errors in it, as well as see what the file looks like:
Before examining the file, verify you’re not blocking all bots – or Google’s bots specifically. If you find the following lines in your code, you’ll need to remedy this as soon as possible:
If these lines were not found in the text, the next step is to confirm you’re not accidentally blocking any files that Google has an interest in seeing. It’s very important to place an emphasis on the following factors.
Full transparency and disclosure of malicious content
Blocking certain files, or specific portions of the code, can push Google into making assumptions that hurt us. Google does not have a way of knowing everything there is to know about your website, and there’s no practical way that the search engine can examine every component manually. As a result, it’s important to be cautious in the signals we’re sending, and avoid producing any artificial warning signals.
The problem: The presence of blocked files on a website can lead Google to exercise heightened caution, and even label the site malicious and dangerous – even if it has not been proven to be manipulative.
The solution: Go over the following checklist to ensure you’re not unintentionally sending Google warning signals.
In other cases, the message may look like this:
I’d even go as far as recommending you add the following lines to your robots.txt file:
Pay special attention to not block template files which, by the way, also block JS and CS. It’s a common mistake to think blocking these files helps Google crawl through your site’s content instead of your site’s code, but that’s not how this works. WordPress site owners should, however, block their plugins library.
- Though Google can decipher (‘read’) text, the presence of images in the body is just as important. Don’t block Google from crawling your image library, since that will prevent your images from being indexed. As a result, the amount of content classified under your site with decrease. There is no definitive evidence that blocking a site’s image library can damage rankings, but it should lead to a decrease in traffic due to a loss of traffic originating in Google’s Image Search. Therefore, if there’s no special reason to block media files, avoid doing so at all costs.
- Avoid block your RSS Feed library. This is an important source of information for Google (and for you, if you think about it). More important, even, than the site map.
After verifying that you’re not intentionally blocking any pages on your site via your robots.txt file, the next step is to verify you’re not blocking important pages using the NOINDEX tag. To this end, you’ll need to review the code for every single page and verify that the following tag is not present: <meta name=”robots” content=”noindex, nofollow”>
If you have a smaller website, you’ll be able to view the code using the View Source command, or by clicking Ctrl+U. Then, search for the word noindex using the Ctrl+F (find) function. Ideally, you won’t find any matches. If you do, however, that means the specific page is blocked to search engines.
If your site is on the larger side, with dozens, hundreds or even thousands of pages, searching the code manually may result in hospitalization! I kid, but really, you’ll be better off using a tool such as Screaming Frog to scan the entire site (the software is free for scanning sites under 500 pages). Once the scan is finished, go to the Directives tab, then select Noindex under Filter.
This table contains all blocked URLs. The mere fact they appear here does not necessarily point to a problem. Review all the URLs on the left and confirm that no important pages are blocked. If you do recognize that an important page is blocked, now’s your chance to remedy that.
A third way to verify there are no blocked pages on the site is using the “Nofollow” add-on. I routinely use it when examining specific pages, not necessarily while conducting a comprehensive analysis. This add-on works in the background and can always help you locate blocked pages without having to remember to check. In the event that the add-on detects a blocked page tagged noindex, nofollow or even just one of those, a righthand pop-up will appear on the bottom of the screen, drawing your attention to the problem:
Another neat little feature of this add-on is the marking of nofollow links with a broken red line. This helps us see the nature of outgoing links in a given page quickly. Below is an example of blocked outgoing links to external websites, from Forbes’ website:
If you’ve made it this far, Google should be able to crawl your website without unexpected blockings. Well done! That said, before we delve deeper into the technical aspect, it’s important to verify that Google ‘sees’ the same version of your site that users see, otherwise it can point to a problem or worse, a security breach.
To check this, sample the homepage and other central pages on the site. To begin the inspection, go to the Search Console, click on Crawl and then Fetch as Google.
When inspecting the homepage, skip page 3. When inspecting internal pages, ensure you enter only the page’s URI into the search box, and not the full URL – otherwise you’ll get an error.
This process can take anywhere from a few seconds to several minutes, depending on the site. Once completed, take a look at the Status column. It will contain one of these options:
Complete – This is the result we want, of course, and it means Google was able to successfully crawl through the entire page.
Partial – This result indicates that the site is rendered partially to Google. Some of the elements may be temporarily inaccessible, or permanently blocked via robots.txt. If you see this result, click on it to see how Google views the page.
In the above example, you can see that Google is reporting a blockage of high severity, since it stems from a blocked CSS file – which places the site at risk for a Panda penalty.
Redirected – This result indicates that the page is redirecting the crawler to another page, resulting in an inability to crawl the original page. Before you input the final version of the site URI, ensure you’re entering the address after the redirect. Another common error that leads to getting this result in the Status column, is having input the URI along with a forward slash mark (/) ahead of it, leading Google to check a website containing two forward slash marks (//) instead of one.
Not Found – The page was not found, and is likely returning a 404 error. Verify that the URI entered is correct and that there is no block present preventing Google from reaching the page.
Unreachable – Google’s bot waited for a period of time that exceeds the amount of time allotted, or it was not able to reach the page. This may point to a site speed problem, or it may stem from a rejection of the request by the server – for any number of reasons.
Blocked – Google was not able to scan the page since it is blocked, likely by the robots.txt file.
Unreachable robots.txt – Indicates that Google was not able to access your robots.txt file; it’s possible you’re blocking Google’s bot in the file.
Not authorized – Your server is blocking the bot from accessing the page’s URI, likely error 403.
DNS not found – You’re searching for an incorrect domain. Ensure that the site address that appears on Search Console is correct, with or without www., and that you typed either http: or https: accordingly.
Temporarily unreachable – Google could not reach the page temporarily, since the server took too long to respond or because you submitted a large number of requests simultaneously.
Once the inspection is done, unless every row is marked Complete, you’ll need to click on the Status box, examine the problem according to the above guide, and fix it accordingly. Your goal should be clear – strive for maximum uniformity between what you see (the right side) and what Google’s bot sees (left side). Another way to check this is to simply access search results and load the Cache version of the page. There, select the Text Only version, and then compare what you see as users, and what Google’s bot sees when scanning the page. You may suddenly discover that text you see at the top of the page as users, appears only halfway or all the way down the page to Google’s bot (this can only stem from CSS). Pay attention to these types of discrepancies to improve the way Google perceives and ranks your site.
Over the past several months, we’ve witnessed numerous instances of site security breaches perpetrated by spammers. Some are more sophisticated than others, but it’s important to note that WordPress sites are at greater risk of breach due to their use of open source code plugins that may contain security loopholes. It’s important to verify that your site (1) has not been hacked and (2) has not become fertile ground for hackers and spammers to take advantage of for their own personal gain, thereby causing your site damage. Site security is an important field in itself, but it’s not the one at the heart of this post, so I’ll keep this section brief. Below is an example of a transparent site hacking, nicknamed so because it is invisible to the user and can sometimes be undetectable even in the site’s HTML code. It is, however, rendered before Google’s bot (Cloaking):
To assess the severity of the problem, first check whether any of these codes can be extracted from the site’s source code. It’s possible you won’t be able to view these codes unless you view the page as Google’s bot (in Settings).
This is an example of a breach that involves implanting code that has to do with medicine and pornography. This problem necessitates swift, thorough action since typically, the hacker inserts codes in the heart of the site and not just the front end. For this reason, especially when talking about WordPress sites, the main recommendation is to remove WordPress altogether then reinstall it, and only then reconnect the site – without the plugins. Finally, turn on only necessary plugins. If Google has yet to recognize the breach, you’ve been given a chance to take care of it quickly before your rankings take a hit. If Google has already detected the problem, it’s possible you’ll receive a message along these lines:
In which case, Google will add the following warning to your site when it appears in search results:
In some cases, the hacker appoints himself as the owner of the site, giving himself full control over your Search Console. The hacker may even remove you completely from any management role by removing the site’s Analytics code, or any other code you inserted in order to identify yourselves before Google. If you see this message from Google, know that someone has taken over your website and you need to take action quickly to prevent even more damage:
Are your menus accessible to Google for scanning? Though a legitimate question, the answer may not always be clear to everyone. In the recent past, it was common to compose menus with JS. Awareness to the consequences was low, and many menus were not scannable and therefore did not ‘count’, as far as Google was concerned, as internal links. Nowadays, the problem is less prevalent, with most menus being constructed in HTML or – at least – having an HTML version. To verify this, use Web Developer Toolbar for Chrome.
Select the option to view the page without JS, and click Refresh:
The menu should open with all relevant sub-categories. This is the way it appears to users using JS.
After using the Toolbar and unmarking JS, the page should look like this. As you can see, the menu looks somewhat different, but still accessible and visible.
Once we’ve verified that no parts of the site are blocked, either intentionally or unintentionally, through robots.txt or through page tags, that the site scan goes smoothly and that there are no security breaches, it’s time to check whether there are anomalies in the rate that the site is scanned. To accomplish this, we’ll need to check the number of scanned pages and the time it takes to download them. An increase in the time it takes to download the pages will inevitably lead to a decrease in the number of pages potentially scanned. First, go to Search Console: Crawl > Crawl Stats.
If you recognize anything out of the ordinary, recall any recent changes you made – server changes, redirects, new pages added, broken links or any other factor which may be slowing down the site.
At this point, you should be ready to verify that there is no spike in the number of site errors. A spike in the number of site errors may point to temporary scanning issues such as server crashes, momentary overload and more. As long as these spikes happen once every several months, you should be perfectly fine. That said, a permanent spike in site errors calls for attention and care. Go to Search Console: Crawl > Crawl Errors to check out where you are.
If you recognize a gradual increase in site errors, as the screenshot below depicts, you may have a forming problem on your hands, causing more and more pages to return a 404 error. Investigate the root of the problem and remedy it.
On the other hand, if anything unusual happened on your site which drastically impact scanning, you should see a sharp spike in the number of site errors, which looks like this:
Such a spike is usually accompanied with a warning from Google, received on the Search Console. Here are examples of such warning messages:
If you’re the recipient of such a message, you’re also likely to see a spike in server errors (code 500). You should check whether the site is still live and if the server is functioning correctly. If everything appears to be in order, perform a test with Screaming Frog; if it successfully accesses the site, that means the problem is not yet severe. If, however, it fails to access the site, you should perform a thorough investigation to find the cause of the problem, and even consider contacting the storage solutions company which can conduct an even more comprehensive inspection. In any case, if the total number of errors is under 1000, you’ll be able to mark all pages on Search Console and select Mark as Fixed.
At this point, wait a day or two and see if new pages are added to the list of error pages. If the problem is contained, the pages should disappear, and you’ve already reported handling te problem to Google.
NOTE: If you notice a spike in the number of server errors, don’t forget to switch tabs and check the same for mobile errors – as these may point to completely different problems.
I can’t bring up 404 errors without mentioning the importance of defining a page correctly. A 404 error is actually a nickname for a status code returned by the service when a page is not found. The entire checklist above – checking different tools, etc – is based on the assumption that the page itself is set up correctly in the system, and returning a 404 code instead of a 200 code. I’ve stumbled upon many sites with a well designed, friendly 404 page which was set up incorrectly, returning a 200 error code instead.
To check this, go to Search Console: Crawl > Crawl Errors > URL Errors > Soft 404s
A “soft 404” error is a non existent page error that does not yield a 404 response from the server. The most common manifestation of this type of error is one you must’ve stumbled upon – clicking a link and reaching the homepage instead of what you were expecting. Effectively, one of the most common ways to handle this type of error is setting up a 302 or 301 redirect to the homepage. In other cases, there’s no redirect set up – you’ll end up on a friendly 404 page returning a 200 error code.
Do you have an e-commerce site? If so, this step is especially for you. Most e-commerce sites have the same problem, numerous indexed pages. This can stem from a variety of problems, such as the platform the site is built on, to the publication of duplicate pages (unintentionally), the intention to show the same product on several categories, using search filters or other merchandising related reasons. If your site contains only several dozen product pages, it’s safe to assume the situation has yet to get out of hand. However, if your site sells hundreds and even thousands of products, things can get messy – quick. You may soon find your site has hundreds of thousands of indexed pages on Google, which is a problem for two reasons: (1) it may prevent Google from ‘comprehending’ your site’s purpose as you would like it to, and (2) if Google’s spending valuable time on non-important/expired pages, this diminishes your site’s strength and it’s taking away from the resources devoted to scanning your more important pages. In order to verify this, you’ll need to perform a series of tests which are time consuming but well worth it.
To start, you’ll need access to Search Console, WMT, Bing and Analytics. Start by performing an initial check that’s meant to confirm whether or not there is a problem. Go to Search Console: Google Index > Index Status. Below is an example of a site which contains approximately 4800 pages, but has nearly 200,000 indexed pages:
Hopefully, by now you see why this discrepancy is a problem. If your inspection yields a similar result, run the same text on Bing, using Bing Webmaster Tools. As a side note, Bing has been gaining strength in the United States in recent month, bringing in higher traffic and revenue. Moreover, Bing’s organic traffic demonstrates a significantly higher conversion rate than Google.
Dashboard > Site Activity > Pages Indexed
In this example, Bing shows much more encouraging results, but evenso, has tenfold more indexed pages than those active. To filter out the pages that are truly important, we’ll need to use Analytics. Before that, I’ll quickly touch on the topic of sampling. Google Analytics provides accurate information, up to 250,000 sessions. When it comes to large, established, high-traffic sites, we often encounter a problem where the data that feeds the reports on Analytics comprises only a small percentage of the actual sessions. In extreme cases, this makes the data moot. Pay attention to the size of the sample when analyzing high traffic websites. If you want to learn more about this bias, click here.
To prevent a sampling bias, such as in our case, I recommend using a little trick that’ll enable you to view the full scope of the data. Instead of accessing the raw data through Acquisition > Campaigns > Organic Keywords > Change Primary Dimension to Landing Page
Instead, access it through Acquisition > All Traffic > Channels > Organic Search > Change Primary Dimension to Landing Page
Take a look at the bottom right corner – this figure shows how many landing pages, total, are responsible for the site’s traffic. In our example, that figure is 8850 page. If you take a look at the chart, though, you’ll see Google indexes over 180,000 pages – meaning that less than 5% of the indexed pages actually get any traffic.
Now, continue to the next check, meant to find how many of these pages are responsible for considerable traffic, and not just few sessions. To do this, filter the report so that it only shows pages with more than n sessions (n being the number of sessions you consider minimal). I filtered to see only pages with more than 10 sessions, and here’s what it looks like.
Now, the picture should be much clearer. In the case of our example, there are essentially only 1384 pages with reasonable incoming traffic. This figure makes a lot more sense than the 180,000 indexed pages we found earlier. Now what? I recommend pulling a report of the pages that bring in less traffic than the minimum number of sessions you defined, and try to find common characteristics among them. If necessary, start blocking them off one by one.
There are dozens of different reasons for indexed page inflation, but here are the main three mistakes that lead to the phenomenon we just reviewed:
- Failing to block parameters: Filtering parameters such as colors, sizes and materials can inflate your indexed pages. You can block these through Search Console, or use a canonical tag from all filter pages, redirecting to the category page.
- Failing to block order pages: When checkout and order finalization pages are not blocked, every new order results in newly indexed pages, increasing the site’s size with every passing day.
- Failing to block blog categories, tags and archive pages: If you manage an active site blog through WordPress, I highly recommend paying close attention to this common mistake.
In any event, before making the decision to block, redirect or implement any other solution, check Analytics for the amount of traffic these sites receive – to ensure you won’t be causing traffic. If some of the pages receive traffic you’re interested in preserving or redirecting, use a 301 redirect.
Another way to prevent and treat lack of focus on Google’s end, is to first figure out which pages are recognized by Google as relevant to your main keywords. I recommend checking this anyway, to keep tabs on your SEO efforts. Simply go to Search Console: Search Traffic > Search Analytics. Check all of the options in the top menu, and select the requested phrase under Queries (you can search for it through the Search window).
Now, click on Pages. If you see the landing page you were expecting to see, that’s great! Move on to the next keyword. If not, check what’s causing the lack of focus. It may be due to a surplus of pages, or as simple as changing titles or restructuring internal links.
Below is an example of an unwanted situation, with several pages appearing on Google search results, pulling traffic away from the wanted landing page. Since Google can rank and present pages according to its discretion, users may sometimes see alternative, non-optimized pages without the page titles and descriptions we worked hard and put considerable thought into. This causes us to lose valuable clicks:
If you operate a blog, you probably blocked off the tags and archive pages, but did you verify that your pagination is in order? This is a common problem that can lead to an array of bigger problems, from lack of focus and waste of valuable crawling resources on unimportant pages, to having significant traffic flow to these pages instead of central pages on the site which are more likely to lead to conversion. Generally speaking, Google ‘prefers’ to rank the main blog page, over other internal blog pages. However, it’s important to notify Google of the hierarchy and connection between these pages. There are many advanced settings you can use, as well as adding noindex tags on all pages starting with the second one, changing the crawling settings in Search Console, and more. Personally, I prefer to leave all of those things alone, since they can cause more damage than they can bring good. I do recommend, however, adding nofollow tags on every page starting with the second one. On the one hand, this will prevent you from wasting resources on these pages, and on the other hand, it will prevent Google from wasting valuable crawling time on them.
To check if you’re facing this problem, first access the source code of the blog page you want to check, and see if there is a rel=”next” code element present. If you found such an element, go to the next page and search for rel=”prev”. If you found this element as well, it’s likely you have built in pagination. At this point, it’s best to check if this has any negative impact on the site. To do this, go to Analytics and take a look at the organic landing pages reports:
Acquisition > All traffic > Channels
Change the main dimension to Landing Pages.
Now, search for the word Page. If you see that there is traffic coming to these pages, it may be the case that these pages are cannibalizing traffic that should go to the homepage.
One more thing you should check is whether the tags were implemented correctly. For this, go to Search Console and load the error report. There, search for the word Page; if you see pages with high figures, far and beyond what you truly have on your site, it may be that you did not implement the code correctly on the last page. Check if you used the element rel=”next” on the last page, which may be causing Google to keep indexing more and more pages in the series, though they do not exist.
We can’t discuss crawling and optimization without talking about the sitemap. A sitemap is one of the ways we present our most important pages to Google. The commonly accepted format is XML. The sitemap should be uploaded to the site’s root directory. Once the sitemap is updated and correctly installed, you’ll be able to see data about the number of pages you’ve submitted to Google, vs the number of pages that were crawled and the various errors found (if applicable). Below is an example of a sound sitemap, with the number of submitted pages being nearly identical to the number of pages Google has indexed.
There are a number of tools with which you can create a sitemap. The oldest and most common is this one, through which you can create different types of sitemaps for free, for sites with less than 500 pages. If your site contains more than 500 pages, you’ll have to splurge for the paid version. Other than this tool, you’ll be able to create free sitemaps using Screaming Frogs (the same 500 page limit applies, beyond which you’ll need to purchase a license to use the software). Whichever tool you choose to use, open the software and input your site’s address in the search window, and click Start.
Once the review is done, click on Sitemaps, then select Create XML Sitemap from the drop down menu:
I usually don’t touch anything in the window that pops up, since the software does a great job. You shouldn’t worry about pages containing errors or redirects; the software will leave those out of the sitemap. Click on Next and save the file, and there you have it – your very own sitemap.
If you’re dealing with an e-commerce site, you may need a slightly different map. Large e-commerce sites tend to divide their sitemap to several sub-sitemaps, according to categories, then create one unified, general guide file called an Index – which then references all other files. This method will help you make several sitemaps available to Google, and help you identify any site problems more easily. Instead of looking through a 20,000 address file, Google will let you know which map contains errors, so you’ll be able to narrow down your search and save time.
These are the most common types of sitemaps:
- Image map: This makes your images available to Google, which will in turn make your images available to users of the Image Search tool. Moreover, it’s beneficial if you operate a Google Shopping campaign. This is the accepted format:
<?xml version=”1.0″ encoding=”utf-8″?>
Note: The page is denoted first, then all the images pertaining to that page.
- Videos map: This type of sitemap helps Google rank your videos in search results. If this is relevant to you, you can find the format here.
- Categories map: This is where categories and sub-categories from the various menus will appear. On large e-commerce sites, this could contain hundreds of addresses.
- Products map: This will solely contain products on the site. On e-commerce websites, this will be the longest list of all, and can contain thousands of addresses.
For the latter three sitemap types, we’ll use the regular sitemap format shown below.
xml version=”1.0″ encoding=”utf-8″?><!–Generated by Screaming Frog SEO Spider 5.1–>
<urlset xmlns=”http://www.sitemaps.org/schemas/sitemap/0.9″ xmlns:xsi=”http://www.w3.org/2001/XMLSchema-instance” xsi:schemaLocation=”http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd”>
Once you’ve finished optimizing all the sitemaps, it’s time to create the Index. This is a simple map that basically shows Google the addresses of all other sitemaps. You can find a great guide to creating an Index here.
When creating the Index, it’s very important to pay attention to the following:
- Use only addresses with canonical tags that point to themselves. This way, you’ll be able to find errant pages easily using Screaming Frog: Directives > Filter: Canonicalised. There, you’ll find all addresses whose canonical tags redirect to different pages.
- Do not include pages with a noindex tag.
- Do not include pages with redirects, only final destination pages.
- Do not include various landing pages that serve specific paid campaigns.
If there is a problem in the format or the general way you’ve constructed the sitemap, you should see the error on Search Console.
If the syntax is correct and there are no errors in it, but you still see that after several days Google has indexed only a small part of the pages you submitted, there may be other problems. For instance, the addresses may contain problems (non-final addresses included, old or unwanted pages included, and more).
Below is an example of such a red flag. Note that the number of indexed pages is significantly lower than the total number of pages submitted.
In the 9 steps detailed above, I attempted to provide a comprehensive run through of the most important checks you’ll need to perform on your site, on the road to optimal crawlability. There are, of course, additional checks you can choose to do. The main takeaway is that you should not look for shortcuts in this process, since it will simply not guarantee that your site is crawled properly. Improper crawling can have a significant negative effect on the results of your otherwise sound marketing efforts, even cancelling them out completely.
If you’ve followed this entire guide, you should be well on your way to excellent site health and readability. I hope this guide assisted you in understanding how to approach the different checks, as well as provided you with tools to verify (when possible) the site’s performance in terms of speed, crawlability and focus. Most of all, I hope you were able to understand the reasoning behind this rather lengthy process.
Do you know any additional tests that should be run? Enlighten us in the comments! Now get to work, and good luck.