Robots, Agents and
Spiders - Identifying Search Engine Crawlers
If you've been surfing search engine optimization web sites,
you've no doubt come across the above being mentioned on many
occasions.
Crawlers, Agents, Bots, Robots and Spiders
Five terms all describing basically the same thing, but in this
article they'll be referred to collectively as spiders or
"agents". A search engine spider is an automated software program
used to locate and collect data from web pages for inclusion in a
search engine's database and to follow links to find new pages on
the World Wide Web. The term "agent" is more commonly applied to
web browsers and mirroring software.
If you've ever examined your server logs or web site traffic
reports, you've probably come across some weird and wonderful
names for search engine spiders, including "Fluffy the Spider" and
Slurp. Depending upon the type of web traffic reports you receive,
you may find spiders listed in the "Agents" section of your
statistics.
Not all spiders are good
Who actually owns these spiders? It's good to know the
beneficial from the bad. Some agents are generated by software
such as Teleport Pro, an application that allows people to
download a full "mirror" of your site onto their hard drives for
viewing later on, or sometimes for more insidious purposes such as
plagiarism. If you have a large or image heavy site, the practice
of web site stripping could also have a serious impact on your
bandwidth usage each month.
Banning spiders and agents
If you notice entries like Teleport Pro and WebStripper in your
traffic reports, someone's been busy attempting to download your
web site. You don't have to just sit back and let this happen. If
you are commercially hosted, you'll be able to add a couple of
lines to your robots.txt file to prevent repeat offenders from
stripping your site.
The robots.txt file gives search engine spiders and agents
direction by informing them what directories and files they are
allowed to examine and retrieve. These rules are called The Robots
Exclusion Standard.
To prevent certain agents and spiders from accessing any part of
your web site, simply enter the following lines into the
robots.txt file:
User-agent: NameOfAgent
Disallow: /
Ensure that you enter the name of the agent exactly as it appeared
in your reports/logs e.g. Teleport Pro/1.29 and that there is a
separate entry for each agent. Skip a line between entries. You
could do the same to exclude search engine spiders, but somehow I
don't think you'll really want to do this :0). The "/" in the
above example means disallow access to any directory. You can also
disallow access by spiders and agents to certain directories e.g.
User-agent: *
Disallow: /cgi-bin/
In this example the asterisk (wildcard) indicates "all". Don't use
the asterisk in the Disallow statement to indicate "all", use the
forward slash instead.
If you don't have a robots.txt file, create one in notepad and
upload it to the docs directory (or the root of whichever
directory your web pages are stored in). Never use a blank
robots.txt file as some search engines may see this as an
indication that you don't want your site spidered at all! Have at
least one entry in the file.
Unfortunately, defining web stripper agents and spiders in your
robots.txt file won't work in all cases as some mirroring software
applications have the ability to mimic web browser identifiers;
but at least it's some protection that may save you some valuable
bandwidth.
If you're not able to create a robots.txt file, which is usually
the case if you are hosted by a free hosting service, this article
may be useful:
http://www.tamingthebeast.net/ articles/robotswhoread.htm
Search engine spider identification
The following is a basic listing of search engine spider names
and their "owners". This is by no means complete, as there are
many thousands of search engines on the Internet, but it covers
the more common beneficial spiders. Look for these in your traffic
reports or search for the names through your server logs to
discover which pages they have been spidering. You'll find that
many of the entries will also have accompanying numbers or letters
e.g Googlebot/2.1 or Slurp.so/1.0
| Spider name |
Spider owner |
| Googlebot |
Google.com |
| TeomaAgent |
Teoma.com |
| Zyborg |
Wisenut.com |
| Gulliver |
NorthernLight.com |
| Architext spider |
Excite.com |
| FAST-WebCrawler |
FAST (AllTheWeb.com) |
| Slurp |
Inktomi.com |
| Yahoo Slurp |
Yahoo Web Search |
| Ask Jeeves |
AskJeeves.com |
| ia_archiver |
Alexa.com |
| Scooter |
AltaVista.com |
| Mercator |
AltaVista.com |
| crawler@fast |
FAST (AllTheWeb.com) |
| Crawler |
Crawler.de |
| InfoSeek sidewinder |
InfoSeek.com |
| Lycos_Spider_(T-Rex) |
Lycos.com |
| Fluffy the Spider |
SearchHippo.com |
| Ultraseek |
InfoSeek.com |
| MantraAgent |
LookSmart.com |
| Moget |
Goo.jp |
| T-H-U-N-D-E-R-S-T-O-N-E |
Thunderstone.com |
| MuscatFerret |
Euroferret.com |
| VoilaBot |
Voila.fr |
| Sleek Spider |
Search-info.com |
| KIT_Fireball |
FireBall.de |
| WebCrawler |
Webcrawler.com |
If you have spotted any significant activity from these spiders
in your reports or logs, there's a good chance that you'll be
listed on that particular search engine. But you'll need to be
patient; some Search Engines take up to 6 months to refresh their
databases!
Further learning resources:
Learn more about positioning in our
SE optimization tutorials section
Studying Web Traffic and Server Logs. What is a hit? What is a
visitor? What is a page view? Traffic statistics terminology and
methods of web site traffic reporting.
A basic
tutorial on the use of Meta Tags in improving search engine
rankings. A solid set of meta-tags is an important component of
any overall promotion strategy.
Michael Bloch
Taming the Beast.net
http://www.tamingthebeast.net
Tutorials, web content, tools and software
Web Marketing, eCommerce & Development solutions.
____________________________
Copyright information.... This article is free for reproduction
but must be reproduced in its entirety & this copyright statement
must be included. Visit
http://www.tamingthebeast.net to view great articles,
tutorials and tools for site owners, web developers and Internet
marketers! Subscribe for free to our popular ecommerce/web design
ezine!
Back to Articles Main Page
Love Spells |