Category Archives: Crawlability

The Meta Noindex Tag

The final element that we’re going to talk about in the crawlability world is the noindex tag. This tag lives in the <head> code of your site’s pages and looks like this:

<meta name=”robots” content=”noindex”>

This tag instructs search engines not to index that page, which means it will not be included in any search results. The noindex tag is similar to blocking a page via robots.txt (slightly different, since a noindexed page can still be crawled, just not indexed, while a blocked page shouldn’t even be crawled).

A noindex tag is the only way to be certain that Google won’t ever show the page in search results; however, note that the noindex tag only works if Google can crawl the page! If you have a page blocked in robots.txt, Google won’t crawl the page, and thus won’t see the noindex tag.  Then if Google see’s a lot of links to that page, it might decide to serve it as a search result, since it never saw your instruction not to.

You can also tell Google whether or not to crawl through any links it finds on your noindexed page. For example, you might not want Google to serve page 2+ on your paginated list of products, or blog posts, as a search result, but you definitely still want Google to crawl through the links on those paginated pages. You can choose to let Google follow links or not in your noindex tag like this:

<meta name=”robots” content=”noindex,follow”>
<meta name=”robots” content=”noindex,nofollow”>

By default, if you don’t say follow or nofollow, Google will follow the links.

Just like with robots.txt, there probably aren’t many pages on your site that you need to noindex. This tag is commonly used in the same places that blocked pages in robot.txt might be. In addition, the noindex tag is often used on certain kinds of duplicate content, and on paginated pages — thus if you had a list of products, and at the bottom you can move to page 2, then page 3 — it’s common to noindex everything after page 1 (because you really want your main page to rank, not a page halfway through your list).

Happily, that’s about all there is to the crawlability portion of SEO. For the majority of sites, all you really need to do is set up a good hierarchal site structure and ignore the rest (or possibly just double check to make sure you aren’t accidentally blocking things).

The thing to remember here is that robots.txt and noindex are about blocking search engines from your site: don’t use them, and you won’t be blocking anything.

Robots.txt File

Every site has a simple text file sitting in the main directory called robots.txt. This file gives instructions that bots are supposed to obey when they’re crawling your site (Google and Bing bots obey these instructions — many private crawlers do not).

A robots.txt file

This file is used to block certain pages or directories of your site from search engines; pages that you don’t want Google to see and that you don’t want to show up in search results. Commonly site owners will block checkout pages, or anything behind a login. As you can imagine, any page that you block in robots.txt will not rank in Google. (Technically a blocked page can rank: if Google sees a lot of links to a page, it might rank it even though it’s never visited the page).

You can also give specific instructions to specific bots. One move that more paranoid webmasters or their security teams like to do is to specifically allow Google and Bing, but block every other kind of robot. It can also be used as part of a honey trap: make a page, link to it, and tell bots in robots.txt not to visit it: any bot that does visit that page is a naughty bot that you can then block.

In General, Don’t Worry About It

As a general rule of thumb, most webmasters do not need to block anything on their robots.txt file. That said, if you are blocking things, just be certain that you don’t use sweeping logic and end up blocking Google from the entire site. This happens far more often than you might think.

It’s worth checking your robots.txt file to make sure you don’t have something like /disallow * (blocking everything), but odds are that you’re fine, and that you won’t need to worry about your robots.txt ever again.

If you’re curious about how other sites set up their robots.txt file, you can just go look. After all, it’s a public file in a standard place on every site (it has to be, for the bots to find it). Just go to www.domain.com/robots.txt and you’ll see their file.

You can see Amazon’s here, for example: www.amazon.com/robots.txt

XML Sitemaps

A sitemap, or XML sitemap, is a text file that lists every page of your site that is designed to help search engines crawl your site. It’s published somewhere on your site (usually domain.com/sitemap.xml) and then you can submit it to Google and Bing via their webmaster tools. A XML sitemap looks like this:

Example of a XML sitemap

In theory, XML sitemaps help Google find and crawl all the pages of your site. In reality, modern search engines do not need sitemaps to understand or index your site. A sitemap does not in any way influence the ranking of your site. It will not convince Google to index a page that Google decided not to index. Though a sitemap lets you tell search engines how often to index pages, search engines generally ignore those instructions and figure out their crawl schedule on their own.

Basically I’m telling you that a XML sitemap is not useful for the vast majority of sites out there. About the only time a XML sitemap will do anything for you is when you have orphaned pages (pages that are not linked to from any other page). And the solution for those pages is to make sure they’re part of the site hierarchy.

As SEOs we continue to build sitemaps mostly because clients expect us to — or because competing agencies use the lack of a sitemap as a way in to steal clients. But the fact is you almost certainly don’t need one, and having one will not help your SEO.

How to Make a XML Sitemap If You Really Want One

You can make a sitemap my manually creating it in excel or in a text file. There are also a lot of free automatic sitemap generators out there, and they do a fine job. If you are creating a sitemap yourself, manually or programmatically, the formatting details can be found here.

Once you’ve created the sitemap, just upload the file to somewhere on your site.

Submitting Your Sitemap to Google

To submit your sitemap ust log into the Google Search Console for your site and from the left menu select Crawl > Sitemaps. Then click the big Add/Test Sitemap button in the upper right corner. Just give Google the URL where you have uploaded your sitemap and click Submit Sitemap.

Submitting a XML sitemap to Google

Once you do, Google will eventually get around to checking it out, and by the next day will report to you how many pages of your sitemap it has indexed. Large sites will quickly note that Google gleefully reports to you how it’s ignoring tons of pages on your sitemap — again, a sitemap does not improve the chances that Google will index your pages.

It’s important to note here that Google is not telling you how many pages it has indexed, but instead is only telling you how many pages on your sitemap it has indexed. For example: you might have 100 pages on your sitemap, and Google tells you it’s indexed 98 out of the 100. However, Google may well have thousands of pages of your site indexed, including URLs you never even knew you had!

One Useful Thing About Sitemaps

The one nice thing about sitemaps is you can use them to get Google to tell you how much of your site it’s crawling.

The way to take advantage of this reporting is to split your sitemap into multiple different site maps, each covering a different selection of URLs. For an ecommerce site you might have your product pages on one sitemap, category pages on another, and list pages on a third. A large service-based site might put all the About pages on one sitemap, pages describing services on another, and blog posts on a third.

This then gives you slightly better insight into what Google is indexing. If you find that only half your pages in one category are indexed, you can then start investigating to find out which ones are being left out and why.

It’s worth stressing, however, that this kind of process is only really worthwhile for large sites. If your site only has a few hundred pages you are not going to have any issues within indexation.

For the smaller sites that I run, including AwesomeDice.com and WarcraftHuntersUnion.com, I didn’t even bother with sitemaps. And I’ve even run sites with millions of pages without XML sitemaps (including one where we finally created a sitemap, and sure enough, it made zero impact on our indexation, rankings, or traffic).

Submitting Your Site to Search Engines

There are a lot of services out there who offer to submit your site to hundreds or thousands of search engines for a fee. They often tout SEO benefits and promise to help kickstart your fledgling site.

These services are scams!

As we already know, there are really only two search engines of matter in the English speaking world: Google and Bing. What these services really do is list your site in a bunch of spammy link directories.

The best you can hope for is that they just steal your money and do nothing. The worst-case scenario is they really do get you listed in a thousand directories, in which case your site may promptly be penalized for having nothing but spammy links.

In point of fact, most sites don’t need to submit themselves to search engines. After all, it is the job of search engines to discover and crawl every site out there, and these days they are very, very good at it. Once another site links to your new site, the search engine bots will eventually find the link and follow it to your site.

That said if you absolutely don’t want to wait, you can easily submit your site to the two search engines that matter yourself.

Crosslinking Pages

Another good strategy for large sites is to crosslink pages very low in the hierarchy to each other. This is one of the reasons that large ecommerce retailers will have Related Products and Customers Also Bought links on their products pages (the main reason, of course, is that it’s good for sales).

Since Googlebot crawls in from links (rather than strictly crawling top-down) this kind of cross linking can provide an additional path for the pages very low in a site’s hierarchy to get crawled.

It is usually not necessary to crosslink pages higher in the site hierarchy: by the very nature of being high in the site hierarchy, there will be lots of paths for Google to crawl to them (and they will have lots of authority flowing to them). It’s also worth noting that smaller sites with only a few hundred pages usually do not need to worry about this kind of crosslinking at all.

Crosslinking Gone Wild!

I have seen plenty of sites that get out of control with crosslinking: every page has a giant list of dozens (or hundreds!) of links at the bottom. An inexperienced SEO figured that crosslinking would help flow authority around, and they wanted to get as much authority to as many pages as possible.

At first glance, it seems like a nice theory, but them problem is you are essentially removing the hierarchy of the site and creating a flat structure. Yes, you are getting more authority to all those low hierarchy pages, but at the cost of lowering the authority of your most important pages.

I’ll explain how this works in detail in the PageRank Flow section. But for now just understand that you want a hierarchal site structure. Crosslinking the bottom of your hierarchy to a few other pages on the bottom can be good for large sites, but going too far will hurt your overall ranking ability.

Hierarchal Structure

Google assumes that your site will have a hierarchal structure, and its algorithm is built on that assumption. This starts with your home page at the top of the hierarchy. Then your global navigation (the links that appear on every page of your site, in the header and footer usually) should link to the next most important pages. Those should link to the next level down, and so on.

As a general rule of thumb, every page should link to the top layer of the hierarchy (the global nav) and each page should also link to the pages above and below it in the hierarchy. For larger sites this is where breadcrumbs come in useful: go to the product page of most ecommerce sites and you’ll see a list of links showing the hierarchal path you took to get there. These breadcumbs provide another crawl path for bots, as well as flowing link authority up to more important pages.

Example of breadcrumbs on a product page

This hierarchal structure is very important for ranking, as well discuss in Authority: On Site, but it’s also important for crawlability. Googlebot generally crawls into your site from an external link: that link may point to your home page, or it may point to something at the very bottom of your hierarchy. Googlebot will continue to crawl through the links it finds on that page and subsequent pages; however, at some point it’ll stop.

A strong hierarchal structure will make sure that regardless of where Googlebot enters your site, it’s definitely going to crawl the most important pages. If anything on your site gets skipped, you want it to be the least important ones.

As you can imagine, it’s vital that every page of your site needs to be linked to from somewhere within that hierarchy. A page is only going to get crawled if another page somewhere on the web is linking to it.

By Definition Hierarchy is Not Flat

It’s worth stressing here that you should not go crazy and embed hundreds of links on every page linking like crazy to every other page. This will do bad thing for your authority, as discussed later in the authority sections, but it also removes the hierarchal structure of your site.

Remember, Google expects the most important pages to be linked to the most, and the least important to be linked to the least. I know that some business owners think every page of their site is the most important, but if you link to everything equally, you are creating a flat structure where no pages are important in Google’s eyes.

The Dear God Don’t List of Crawlability

The crawlability of your site is vital for improving your indexation; however, it’s also the easiest part of SEO. Most sites that have a logical structure will have no problem being crawled by Google — after all, Google was built to be able to crawl sites. The vast majority of SEOs don’t have to worry about crawlability at all.

However, there are site designers out there who manage to do some truly spectacularly bone-headed things that prevent Google from seeing the site.

The Dear God Don’t List

Before we get into the stuff that you should be doing, here are the truly horrible things that you absolutely do not want to be doing. These are the spectacular SEO fails of site design. Unfortunately I have indeed seen sites that do each of these things — any one of them will prevent Google from crawling your site, which means you won’t ever show up in Google at all.

  • Don’t design your site in Flash or within a single ajax frame. If you do this Google cannot see any of your content. There are technically ways to design a single page application (SPA) that doesn’t totally destroy your SEO, but even those are proven to do worse than a real site time after time.
  • Don’t use JavaScript for your links: doing so can essentially hide your links from Google. Not only will Google not be able to use your navigation links to find other parts of your site, but it may think that you’re trying to cloak your links and penalize you for it.
  • Don’t block all bots in your robots.txt file — doing so is explicitly telling Google not to crawl any page of your site.
  • Don’t use the noindex tag on every page of your site — doing this explicitly tells Google not to index those pages.

Okay, got that out of the way. Now we can sit back and have a pleasant conversation about SEO and crawlability.