Use a Robots.txt File to Focus Crawlers on Your Most Important Content
Robots.txt is a file in your website's root directory that tells internet spiders which webpages they are, and are not, permitted to crawl. Friendly search engine spiders (like those used by Google, Yahoo, and MSN) obey the directions found in robots.txt files, ignoring webpages that are marked as off-limits.
If your website contains any webpages you don't want indexed in search engines (such as private pages, printer-friendly duplicate pages, or test pages), upload a simple robots.txt file to direct attention away from the content you don't want search engines to notice. The Red Cross website, for example, uses a robots.txt file to request that search engines ignore test pages, forms, and administrative pages.

You can create a robots.txt file on your computer's "Notebook" program (usually found under "Accessories." Your file should start with the term "User-agent:" followed by the name of the "bot" (i.e., search engine crawler) that the directions apply to. If the directions apply to all bots (which is most often the case), indicate this with the asterisk symbol. Next, add the term "Disallow:" followed by the filepath of any webpage you do not want spiders to crawl. Repeat as necessary.
For instance, your robots.txt file might specify:
User-agent: *
Disallow: /cgi-bin/
Disallow: /stats/
If you have an XML sitemap (you should), use the robots.txt file to indicate its location. Search engine spiders will be more likely to notice your sitemap if it is listed in your robots.txt file. Simply add this line anywhere in the file, substituting "Sitemap URL" for the actual location of your site map:
Sitemap: <sitemap URL>
Once you've completed your robots.txt file, upload it to your server and save it in the root directory.
|
|