Implementing your robots.txt file is a good example of where a little knowledge can be dangerous. This is because there is a very large demographic out there that thinks that robots.txt files can conclusively stop search engines from indexing your pages. Not so…
So what is a robots.txt file anyway?
Robots.txt is a little text file that lives in your web’s root directory. It tells “bots” or “web crawlers” that it is okay to either visit specific pages or directories or disallow such visits. If you tell a bot it is okay to crawl a page then the bot most likely will. Subsequently, it will collect and analyze the information on your page. This data is then applied to the search engine’s algorithms. This may result in the search engine listing your page in their search results as well as determining your ranking within these results. You can read more about Search Engine Optimisation here: How do I get good search engine results?
How does a robots.txt file work?
What you put in your robots.txt file is based on a “Robots Exclusion Protocol” (REP) which was first introduced in 1994. It is a protocol that uses tags to instruct bots (characteristically search engine robots) on how to crawl pages on your website. Below you will find some of the major tags.
If you want to block all web crawlers ( * ) from all your content ( / ) use the following:
User-agent: * Disallow: /
You can also block a specific bot ( Googlebot ) from a specific directory:
User-agent: Googlebot Disallow: / blocked-directory /
Or you can block a specific bot from a specific web page
User-agent: Googlebot Disallow: / directory / blocked-page.html
Do you recommend using robots.txt?
Not really. Saying that you need something but there are better ways…
What is wrong with using robots.txt?
The obvious problem with robots.txt files is that it leaves the perception that it blocks the search engine from visiting your page and/or directory. However, there may be many other ways a search engine can find your page and subsequently lists your page in its search results. For example, this may happen when a bot simply follows a link from an external site directly to your blocked page.
Another issue that is often overlooked is the fact that robots.txt files are publicly accessible. In other words, everyone with Internet access can look at your robots.txt file. This can be problematic because often your blocked content is blocked for a reason. Security for one… So looking at a robots.txt file may highlight to potential hackers where they shouldn’t look; as if that will work! So you always need to secure your sensitive files with real security measures. See BulletProof Pro.
What is the industry’s best practice in blocking pages?
Using “meta tags” for each of your pages is the current practice throughout the industry. Simply adding a meta tag in the <HEAD> of your HTML code will instruct the web crawler to:
- either index your page or not index your page, and
- either follow all links on your page or don’t follow any links on your page.
Here are some combos that speak for themselves…
<META NAME=”ROBOTS” CONTENT=”NOINDEX, FOLLOW”>
<META NAME=”ROBOTS” CONTENT=”INDEX, NOFOLLOW”>
<META NAME=”ROBOTS” CONTENT=”NOINDEX, NOFOLLOW”>
The INDEX – NOINDEX values simply tell the bots not to index the page. The FOLLOW – NOFOLLOW values denote that the links on that page should not be followed.
Don’t think that the NOFOLLOW value is synonymous with the rel=”nofollow” link attribute. The latter is the HTML code used in your <a> link tag.
Note that not all bots are “gentlemen”. For example, malicious bots used to spread malware will simply ignore your meta tags so this is not a security measure. Furthermore, using a “NOFOLLOW” on a page doesn’t mean the links on that page will not be followed as other pages without “NOFOLLOW” may also link to your page. Are you following me?
To make things easier, WordPress users can simply install a plugin that looks after all these issues and more. Yoast SEO comes to mind...