How to Control Search Engine Spiders for Improved Rankings

In order to get your blog or website listed at the top of the search engines keyword search rankings you need to gain a deeper understanding of the search engine spiders that crawl over your website. After all, these spiders determine the relevance of your website and decide where your site or web pages will land in the search engine results pages (SERPs). Therefore, by learning how to control the direction of the spiders, you can be certain your website will rise in rankings.

Gaining Control with the Help of Robots.txt

You may think that gaining control of search engine spiders is an impossible task, but it is actually quite easy: all you need to do is take advantage of a handy little tool called the robots.txt file. With the robots.txt file, you can give the spiders the direction they need to locate the most important pages on your website while preventing them from wasting time on the more obscure pages such as your About Us and Privacy Policy pages. After all, these pages will not do much to increase your search engine ranking and will not help your target market find your website, so why should the spiders waste their time exploring these pages when in the process of ranking your site?

Another positive aspect of using a robot.txt file is the fact that it prevents the spiders from indexing duplicate pages. This is beneficial because having duplicate content may actually reduce your search engine ranking. So, while you are making changes to your website or working on an area that is not fully developed yet, you can instruct the spiders to leave those pages alone until you are ready to have them crawled. The same is true if you have a blog (or a blog on your website), as a blog post created in WordPress (or on Blogger, Typepad, etc.) will show up in the main post page, in an archive page, in a category page and as a tag page. With the help of the robots.txt tool you can instruct the spiders to look only at the main post page.

In using the robot.txt files, you can tell the search engine spiders which pages they should and should not search through and index. It is important to keep in mind, however, that the robots.txt tool is meant to be used to prevent search engine spiders from searching certain pages. Therefore, you will only need to use it on those pages you do not want the spiders to crawl.

Implementing the Robots.txt Tool

To successfully use the robots.txt tool, you first need to determine which pages you do not want the spiders to search. Then slowly begin making the changes to your site. By using the tool on only a couple of pages at a time, it will be easier for you to identify mistakes that you may have made during the process.

To make your changes, you will need to add the robots.txt file to the root directory of your domain or to your subdomains. Adding it to your subdirectories will not work. For example, you may add the robots.txt file to a url such as

http://domain.com/robots.txt

or to

http://privacypolicy.domain.com/robots.txt

But, adding it to a subdirectory such as

http://www.domain.com/privacypolicy/robots.text

will not work. With just one robots.txt file within your root directory, you can manage your entire site. If you have subdomains, however, you will need a robots.txt file for each one that you need to manage. You will also need separate robots.txt files for your secure (https) and nonsecure (http) pages.

Creating a Robots.txt File

Creating a robots.txt file is relatively easy, as you only need to name your text file robots.txt within any text editor, such as Textpad, NotePad or Apple TextEdit. Your robots.txt file only needs to contain two lines in order to be effective. If you wanted to stop the spiders from searching the archives of the blog on your site, for example, you would add the following to your robots.txt file:

User-agent: * Disallow: /archives/

The “User-agent” line is used to define which search engine spiders you want to have blocked. By placing the asterisk (*) here, you are instructing all search engine spiders to avoid the specified pages. You can, however, target specific search engine spiders by replacing the asterisk with the following codes:

Google – Googlebot

Yahoo – Slurp

Microsoft – msnbot

Ask – Teoma

The “Disallow” line specifies which part of the site you want the spiders to ignore. For instance, if you want the spiders to ignore the categories portion of your blog, you would replace “archives” with “category” and so on. If you wanted to instruct the spiders to ignore multiple sections, you would simply add a new “Disallow” line for each area you want to be ignored. Just as you can name specific areas that you want the spiders to avoid, you can also list specific areas that you want specific spiders to view. For example, while you may want most spiders to avoid a specific area, you may want the MSN mediabot, Google image bot or Google AdWords bot to visit those areas. In this case, you can use the asterisk to instruct all search engines to avoid the area while instructing a specific spider to allow the same area. If you want Google’s AdSense bot to access a folder, for instance, you would create the following command:

User-agent: * Disallow: /folder/

User-agent: Mediapartners-Google Allow: /folder/

You can also use your robots.txt files to prevent dynamic URLs from being indexed by the search engine spiders. You can accomplish this with the following template:

User-agent: * Disallow: /*&

With this command, you are instructing the spiders to index only one of the URLs that matches the parameters you have set. For example, if you had the following dynamic URLs:

* /vintageguitars/details.php?propcode=ANCHORS&SRCH=tr

* /vintageguitars/details.php?propcode=ANCHORS&vr=1

* /vintageguitars/details.php?propcode=ANCHORS

your robots.txt instructions will tell the spiders to list only the third example because it will disallow any URLs that start with a forward slash (/) and contain the & symbol. You can use the same strategy to block any URLs containing a question mark by using the following:

User-agent: * Disallow: /*?

You can block all directories that contain a specific word in the URL as well. For example, you might create a robots.txt file such as the following:

User-agent: * Disallow: /gibson*/

With this command, any page with a URL containing the word “Gibson” will not be crawled by the spiders. It is important to be extra cautious when using these directives, however, as they will cause the spiders to avoid all pages containing the word you specify. As a result, you may accidentally block pages that you do want to be indexed. If you do want to block all but one or two pages with URLs containing a specific word, you need to create a robots.txt file that specifically allows the page you still want to be indexed. In this case, your robots.txt file would look something like this:

User-agent: * Disallow: /gibson*/ Allow: /vintageguitars/gibson andlespaul/details.html

It is also possible for you to instruct the spiders to avoid an entire folder on your website while still allowing it to access specific pages within that folder. To achieve this, you would write something like:

User-agent: * Disallow: /category/ Allow: /category/just-this-page.html

It is important to note that the search engine spiders will ignore general directives if you have one that addresses a specific spider. For example, if you create the following robots.txt:

User-agent: * Disallow: /category/

User-agent: Googlebot Disallow: /archives/

the Google spider will still crawl and index the category page because you listed a directive that was specific to the Googlebot, and that overrides the directive that addresses all search engine spiders. So, if you list a specific spider in your robots.txt, you need to individually list all of the things you want that spider to avoid. In our example, you would have to create the following robots.txt file to get the Googlebot to avoid the category and archives sections while telling all other spiders to avoid the category section:

User-agent: * Disallow: /category/

User-agent: Googlebot Disallow: /archives/ Disallow: /category/

If you want the spiders to avoid indexing certain types of files, you will need to use the dollar sign symbol ($). To instruct the spiders to avoid GIF files, for instance, you would use the following:

User-agent: * Disallow: /*.gif$

You would use the same pattern for other types of files that you may want to be avoided by the spiders, such as .pdf$, .jpg$ or .jpeg$.

Addressing Other Search Engine Concerns

In addition to blocking certain pages from being indexed by the search engines, there are a number of other concerns you may address with the robots.txt tool. For instance, if the search engine spiders are downloading your pages too quickly and causing difficulties with your server, you can add a crawl-delay directive to your file that will tell the spider how long it should wait between downloads. In general, it is best to set this directive low, such as somewhere between 0.5 and 1, and then to increase it later if necessary. This robots.txt file would look something like:

User-agent: * Crawl-delay: 0.7

Google does not follow the crawl delay directive, however, and it generally is not necessary to add this directive to your robots.txt file.

Another handy aspect of the robots.txt file is that it can help you create a path to your XML sitemap. By adding a line such as:

Sitemap: http://www.yoursitename.com/sitemap.xml

By using your robots.txt file in this way, you can submit your XML sitemap to search engines without registering with a variety of different Webmaster Tools programs. You can also store your XML sitemap anywhere you like with this tool, which can be helpful if you manage several sites and want to keep all of your XML sitemaps in one place.

Finally, it is important to realize that it is still possible for a search engine to index pages that you have included in your robots.txt file. There are a number of reasons why this may happen. For example, if someone created a link to the page, it will still get crawled through that link. To close this opening, you will need to unblock the page from your robots.txt file and then place a meta noindex tag on the page before you put the page back on your robots.txt file.

In this post I have told you how you can use robot.txt files to improve the ranking of your site in the results pages of the major search engines.

Good luck!

Technorati-Tags: seo,pagrank,search engine rankings,sitemap,search engine optimization,indexing

Eblogging Tricks