|
Creating a robots.txt file
By Sumantra Roy
Some people believe
that they should create different pages for different search
engines, each page optimized for one keyword and for one
search engine. Now, while I don't recommend that people
create different pages for different search engines, if you
do decide to create such pages, there is one issue that you
need to be aware of.
These pages, although
optimized for different search engines, often turn out to be
pretty similar to each other. The search engines now have the
ability to detect when a site has created such similar
looking pages and are penalizing or even banning such sites.
In order to prevent your site from being penalized for
spamming, you need to prevent the search engine spiders from
indexing pages which are not meant for it, i.e. you need to
prevent AltaVista
from indexing pages meant for Google and vice-versa. The best way to do
that is to use a robots.txt file.
You should create a robots.txt
file using a text editor like Windows Notepad. Don't use your
word processor to create such a file.
Here is the basic syntax of
the robots.txt file:
User-Agent: [Spider Name]
Disallow: [File Name]
For instance, to tell
AltaVista's spider, Scooter, not to spider the file named
myfile1.html residing in the root directory of the server,
you would write
User-Agent: Scooter
Disallow: /myfile1.html
To tell Google's spider,
called Googlebot, not to spider the files myfile2.html and
myfile3.html, you would write
User-Agent: Googlebot
Disallow: /myfile2.html
Disallow: /myfile3.html
You can, of course, put
multiple User-Agent statements in the same robots.txt file.
Hence, to tell AltaVista not to spider the file named
myfile1.html, and to tell Google not to spider the files
myfile2.html and myfile3.html, you would write
User-Agent: Scooter
Disallow: /myfile1.html
User-Agent: Googlebot
Disallow: /myfile2.html
Disallow: /myfile3.html
If you want to prevent all
robots from spidering the file named myfile4.html, you can
use the * wildcard character in the User-Agent line, i.e. you
would write
User-Agent: *
Disallow: /myfile4.html
However, you cannot use the
wildcard character in the Disallow line.
Once you have created the
robots.txt file, you should upload it to the root directory
of your domain. Uploading it to any sub-directory won't work
- the robots.txt file needs to be in the root directory.
I won't discuss the syntax and
structure of the robots.txt file any further - you can get
the complete specifications from http://www.robotstxt.org/wc/norobots.html
Now we come to how the
robots.txt file can be used to prevent your site from being
penalized for spamming in case you are creating different
pages for different search engines. What you need to do is to
prevent each search engine from spidering pages which are not
meant for it.
For simplicity, let's assume
that you are targeting only two keywords: "tourism in
Australia" and "travel to Australia". Also,
let's assume that you are targeting only three of the major
search engines: AltaVista, HotBot and Google.
Now, suppose you have followed
the following convention for naming the files: Each page is
named by separating the individual words of the keyword for
which the page is being optimized by hyphens. To this is
added the first two letters of the name of the search engine
for which the page is being optimized.
Hence, the files for AltaVista
are
tourism-in-australia-al.html
travel-to-australia-al.html
The files for HotBot are
tourism-in-australia-ho.html
travel-to-australia-ho.html
The files for Google are
tourism-in-australia-go.html
travel-to-australia-go.html
As I noted earlier,
AltaVista's spider is called Scooter and Google's spider is
called Googlebot.
A list of spiders for the
major search engines can be found at http://www.jafsoft.com/searchengines/webbots.html
Now, we know that HotBot uses Inktomi and from this list,
we find that Inktomi's spider is called Slurp. Using this
knowledge, here's what the robots.txt file should contain:
User-Agent: Scooter
Disallow: /tourism-in-australia-ho.html
Disallow: /travel-to-australia-ho.html
Disallow: /tourism-in-australia-go.html
Disallow: /travel-to-australia-go.html
User-Agent: Slurp
Disallow: /tourism-in-australia-al.html
Disallow: /travel-to-australia-al.html
Disallow: /tourism-in-australia-go.html
Disallow: /travel-to-australia-go.html
User-Agent: Googlebot
Disallow: /tourism-in-australia-al.html
Disallow: /travel-to-australia-al.html
Disallow: /tourism-in-australia-ho.html
Disallow: /travel-to-australia-ho.html
When you put the above lines
in the robots.txt file, you instruct each search engine not
to spider the files meant for the other search engines.
When you have finished
creating the robots.txt file, double-check to ensure that you
have not made any errors anywhere in it. A small error can
have disastrous consequences - a search engine may spider
files which are not meant for it, in which case it can
penalize your site for spamming, or, it may not spider any
files at all, in which case you won't get top rankings in
that search engine.
An useful tool to check the
syntax of your robots.txt file can be found at http://www.tardis.ed.ac.uk/~sxw/robots/check/.
While it will help you correct syntactical errors in the
robots.txt file, it won't help you correct any logical
errors, for which you will still need to go through the
robots.txt thoroughly, as mentioned above.
Article by Sumantra Roy.
Sumantra is one of the most respected search engine
positioning specialists on the Internet. To have Sumantra's
company place your site at the top of the search engines, go
to http://www.1stSearchRanking.com/t.cgi?3114
For more advice on how you can take your web site to the top
of the search engines, subscribe to his FREE newsletter by
going to http://www.1stSearchRanking.com/t.cgi?3114&newsletter.htm
|