Limiting what search engines can index using /robots.txt

Various search engines such as Google uses what is generally known as "spiders" or "robots" to continuosly crawl the web indexing content for inclusion in their search engine databases. While most users view inclusion in search engine listings in a positive light and high search engine rankings can translate to big bucks for commercial sites, not everyone wants every single page and file stored on their account to be publicly available through web searches.

This is where /robots.txt comes in. Most search engine robots will comply with a webmaster's/site owner's wishes as far as excluding content by following a robot's inclusion standard which is implemented via the use of a small ASCII text file named /robots.txt in the root web accessible directory of a given domain.

When a compliant robot visits a given site, the first thing it does is to check the top level directory for the presence of a file named "robots.txt". If found, the directives within the file which tells the robot what content it can or cannot visit and index is read, and in most cases honored.


Creating /robots.txt files

To create a /robots.txt file, simply open a plain text editor such as Windows NotePad, type or paste your directives and save the file using the file name "robots" (robots.txt). This file should then be uploaded to the /public_html directory in such that its URL will be http://yourdomain.com/robots.txt


The /robots.txt syntax

All valid /robots.txt files must contain at least two lines in the following format:

User-Agent: [robot name or * for all robots]
Disallow: [name of file or directory you do not want indexed]

Unless one wishes to implement different rules for specific robots, the user agent line should just include an asterisk [*] which is a wildcard read as "rules apply to all robots".

Disallow lines can be used to specify specific files or folders one doesn't wish to have indexed by search engines. Each file or folder to be excluded must be listed separately on its own line, and wildcards are not supported in Disallow directives. One can have as many or as few Disallow lines as is necessary.


Example of /robots.txt files

- A simple /robots.txt file which would allow all robots to access and index all content with the exception of the contents of a directory named "private" would be as follows:

User-agent: *
Disallow: /private/

- A /robots.txt file which would exclude all robots from indexing the content of "cgi-bin", "admin" and "stuff" directories plus a page named "private.html" would be:

User-agent: *
Disallow: /cgi-bin/
Disallow: /admin/
Disallow: /stuff/
Disallow: /private.html

- A /robots.txt file which would allow all robots to access and index all content on a given site would be:

User-agent: *
Disallow:

- A /robots.txt file which would forbid all robots from accessing and indexing any content would be:

User-agent: *
Disallow: /

- A robots.txt file which would allow Google's spider (a.k.a. GoogleBot) to index all content with the exception of files stored under a folder named "private" and which would exclude all other robots from indexing any content would read as follows:

User-agent: GoogleBot
Disallow: /private/

User-agent: *
Disallow: /

- A robots.txt file which would allow all robots with the exception of HotBot's (a.k.a. Inktomi Slurp) to index all content with the exception of files stored under folders named "images" and "cgi-bin" and which would exclude the HotBot spider from indexing any content would read as follows:

User-agent: *
Disallow: /images/
Disallow: /cgi-bin/

User-agent: Inktomi Slurp
Disallow: /


More Information

For more details on /robots.txt and the Robots Exclusion Standard, please visit The Web Robots Pages at http://www.robotstxt.org

  • 1 Users Found This Useful
Was this answer helpful?

Related Articles

How to enable register_globals on my account?

By default, all our servers have register_globals set to OFF for security reasons and...

My website is down and unreachable. What must I do?

Should you find your website to be unavailable at any point or in the event of any problems with...

Where are my raw access logs stored?

  The path to the raw access logs for your domain on our servers is as follows where...

Why is my IP being blocked by my host server?

Usually, when there are continuous login failure attempts (e.g. wrong password tries) within a...

Memory error when updating wordpress

If you get the following error when attempting to update wordpress:   Fatal error: Allowed...