Wednesday, June 8, 2011

What is a Robots.txt file?

Robots.txt is a text (not html) file you put on your site to tell search robots which pages you would like them not to visit. You may not want certain pages of your site crawled because they might not be useful to users if found in a search engine's search results. Often, I get quizzical looks during my training sessions when I talk on this subject. Why would you not want search engines to index all the content of your site?

For instance, if you have two versions of a page (one for viewing in the browser and one for printing), you'd rather have the printing version excluded from crawling, otherwise you risk being imposed a duplicate content penalty. Also, if you happen to have sensitive data on your site that you do not want the world to see, you will also prefer that search engines do not index these pages.

The location of robots.txt is very important. It must be in the main directory because otherwise user agents (search engines) will not be able to find it – they do not search the whole site for a file named robots.txt. Instead, they look first in the main directory (i.e. http://www.supersavvyme.com/robots.txt) and if they don't find it there, they simply assume that this site does not have a robots.txt file and therefore they index everything they find along the way. So, if you don't put robots.txt in the right place, do not be surprised that search engines index your whole site.

Here are some quick things to remember:

Location: This file, which must be named "robots.txt", is placed in the root directory of your site.

URL Address: The url of your robots.txt should always be http://www.yoursite.com/robots.txt

Example:








In the above screengrab you will see the webmaster has allowed all search engines to crawl the XML sitemap of the site. I will explain in detail -

User-agent” are search engines' crawlers
Allow: / lists the files and directories which the webmaster would like all search engines to crawl
* denotes all compliant search engine bots

You can also have the syntax disallow which will indicate to search engines the files and directories to be excluded from indexing. Below is an example:






The above screen grab means - All compliant search engine bots (denoted by the wildcard * symbol) shouldn't access and crawl the content under /images/ or any URL whose path begins with /search

If you do want to prevent search engines from crawling your pages, Google Webmaster Tools has a friendly robots.txt generator to help you create this file. Note that if your site uses subdomains and you wish to have certain pages not crawled on a particular subdomain, you'll have to create a separate robots.txt file for that subdomain.

For more information on robots.txt, I suggest this Webmaster Help Center guide on using robots.txt files

2 comments: