Robots.txt File Guide
By David CallanWe all know search engine optimization is a
tricky business, sometimes we rank well on one engine for a
particular keyphrase and assume that all search engines will like
our pages and hence we'll rank well for that keyphrase on a number
of engines. Unfortunately, this is rarely the case. All the major
search engines differ somewhat, so what get you ranked high on one
engine may actually help to lower your ranking on another engine.
It's for this reason that some people like to optimize pages for
each particular search engine. Usually these pages would only be
slightly different but this slight difference could make all the
difference when it comes to ranking high, however because search
engine spiders crawl through sites indexing every page they can
find they might come across your search engine specific optimized
pages and notice that they're very similar. Hence the spiders may
think you're spamming and will do one of two things, ban your site
altogether or severely punish you in the form of lower rankings.
What can you do to, say, stop Google indexing pages that are meant
for Altavista, well the solution is really quite simple and I'm
surprised that more webmasters who do optimize for each search
engine don't use it more. It's done using a robots.txt file which
resides on your webspace.
A Robots.txt file is a vital part of any webmasters battle against
getting banned or punished by the search engines if he or she
designs different pages for different search engines.
The robots.txt file is just a simple text file as the file
extension suggests. It's created using a simple text editor like
Notepad or Wordpad. Complicated word processors such as Microsoft
Word will only corrupt the file.
Here's the code you need to insert into the file:
Red text is compulsory and never
changes while the blue text you'll
have to change to suit the file and the engine which you want to
avoid it.
User-Agent:
(Spider Name)
Disallow: (File
Name)
The User-Agent is the name of the search engines spider and
Disallow is the name of the file that you don't want that spider
to spider. I'm not entirely sure if the code is case sensitive or
not but I do know that the code above works, so just to be sure
check that the U and A are in caps and likewise the D in disallow.
You've to start a new batch of code for each engine, but if you
want to list multiple disallow files you can one under another.
For example -
User-Agent:
Slurp (Inktomi's spider)
Disallow:
internet-marketing-gg.html
Disallow:
internet-marketing-al.html
Disallow:
advertising-secrets-gg.html
Disallow:
advertising-secrets-al.html
In the above code I have disallowed Inktomi to spider two pages
optimized for Google (internet-marketing-gg.html &
advertising-secrets-gg.html) and two pages optimized for Altavista
(internet-marketing-al.html & advertising-secrets-al.html). If
Inktomi were allowed to spider these pages as well as the pages
specifically made for Inktomi, I run the risk of being banned or
penalized so it's always a good idea to use a robots.txt file.
I mentioned earlier that the robots.txt file resides on your
webspace, but where on your webspace? The root directory that's
where, if you upload your file to sub-directories it won't work.
If you want to block certain engines from certain files that do
not reside in your root directory you simply need to point to the
right directory and then list the file as normal, for example -
User-Agent:
Slurp (Inktomi's spider)
Disallow:
folder/internet-marketing-gg.html
Disallow:
folder/internet-marketing-al.html
If you wanted to disallow all engines from indexing a file you
simply use the * character where the engines name would usually
be. However beware that the * character won't work on the Disallow
line.
Here's the names of a few of the big engines, do a search for
'search engine user agent names' on Google to find more.
Excite - ArchitextSpider
Altavista - Scooter
Lycos - Lycos_Spider_(T-Rex)
Google - Googlebot
Alltheweb - FAST-WebCrawler/
Be sure to check over the file before uploading it, as you may
have made a simple mistake which could mean your pages are indexed
by engines you don't want to index them, or even worse none of
your pages mightn't be indexed.
A little note before I go, I have listed the User-Agent names of a
few of the big search engines, but in reality it's not worth
creating different pages for more than 6-7 search engines. It's
very time consuming and results would be similar to those if you
created different pages for only the top five, more is not always
best.
Now you know how to make a robots.txt file to stop you from
getting banned by the search engines. Wasn't that easy, till next
time!
Article by David Callan. David is an Internet marketing
professional and webmaster of
http://www.akamarketing.com/webmaster-forums/. Visit his
webmaster forums for the latest discussions on search engines,
website authoring and Internet marketing related issues and
topics.
Back to Articles Main Page
|