Friday, 13 June 2014

How to write a Robots.txt file easily – 7 Golden Phases to Follow

Robots.txt file is a mystery for many bloggers just like me. But the fact is, nothing in this world is a mystery unless you explore it completely. If you are worried about how to write a robots.txt file easily, don’t ever panic. Its just as simple as you write a blog post or edit any existing article. All you have to know is what command is used for what action. Usually the robots/spiders crawl our site for many things, may it be the article pages, our admin panel, tags, archives, what not. They just index whatever is visible and accessible for them. It is very very important to restrict them from indexing everything from our website. Just as we restrict our strangers to hangout in our apartments.

Any /robots.txt file of a site will be located at www.domain-name.com/robots.txt. For example,www.seosiren.com/robots.txt. Usually Robots.txt file is also known as Robots Exclusion Protocol. So whenever a robot is visiting your website, it has to first visit the /robots.txt page, and then visit the other pages for indexing.
How to write a Robots.txt file easily

How to Write a Robots.txt File Easily: 7 Phases

Today we will check out how we can restrict search engine spiders to crawl our site for unwanted stuff. You should know the 6 golden phases to write a /robots.txt file. And you should also know the basic and advance commands atleast for one single time to write a /robots.txt file. Because, you wont edit it daily. Once you are done with your commands, you will not touch it again (just saying). You can obviously edit the matter whenever you can. Lets see the most important commands and phases to write a successful /robots.txt file.

Phase 1: Differences between * and / entries

So before writing a successful /robots.txt file, you should know the basic commands and their usage.  The first thing you need to know about the /robots.txt is the User-agent command. Next comes the Disallow command which is explained as below.
User-agent: *
Disallow:
Here, User-agent:* means that the section is applied to all the robots. * is called the wildcard, which usually means all. Coming to the Disallow command, this tells the robots that they cannot index anywhere they want.  So the * here means, robots should read all the matter before proceeding.
User-agent: *
Disallow: /
The Disallow:/ here means that the robots are not allowed to crawl anything. So now you got the difference? if * then index all, if / then don’t index anything!

Phase 2:  Advance commands in Robots.txt file

So that we found the difference between * and /, its now time to learn little more about the advance commands in /robots.txt file. Starting with the User-agent and Disallow, we will derive few commands for banning unwanted robots from accessing our site.
User-agent: *
Disallow: /cgi-bin/
This above command mean that, all the robots file are not allowed to index anything in the cgi-bin folder. Which means, if the folder cgi-bin has subfolders and pages like cgi-bin/newsite.cgi or cgi-bin/example/idontknow.cgi, then they wont be indexed or accessed by robots.
And if you wanted to restrict a particular robot file, then mention the robot name to restrict it from indexing your site.
User-agent: Googlebot-Image
Disallow: /
In the above example, we are restricting the Google image search bot to index our site for images. Here, Googlebot-Image is the robot which we are trying to ban from our site. So without your permission from /robots.txt, the Googlebot-Image shouldn’t index any file in the root directory of “/” and all its subfolders. wont index anything from your site. This bot is usually used to scan for picture to show them in Google Images search.

Phase 3: Difference between /something/ and /something

Here we will see how we can restrict different files, folders or places which can harm yourself health.
User-agent:  *
Disallow: /cgi-bin/
Disallow: /wp-admin/
Disallow: /wp-content/
Disallow: /wp-includes
The above long /robots.txt commands tell robots that anything in cgi-bin directory is not accessible by any of the bot. Similarly wp-admin, wp-content, wp-includes directories are restricted to trespass by the robots.
Alos you have to note a very important point about the “/” usage. If you want to mention a directory or folder in your site, then they have to start and end with “/” in the /robots.txt file. For example,
User-agent:*
Disallow: /cgi-bin/
This will tell the robots that cgi-bin is a directory. And
User-agent:*
Disallow: /cgi-bin
This will tell the robots to treat cgi-bin not a directory, but as a file in your website. Just like cgi-bin.html or something. So avoid making a mistake of missing “/” in the beginning and ending for a directory.

Phase 4:  How to restrict unwanted images

If you don’t want the Google bot to index a specific picture, you can restrict it to.
User-agent: Googlebot-Image
Disallow: /images/adsense.jpg
Using the above command, you can restrict Googlebot-Image to index adsense.jpg picture.

Phase 5: How to restrict unwanted pages

Just similar to the above command, you can also restrict a particular page in your /robots.txt file.
User-agent: *
Disallow: /seosiren/adsense.html
Disallow: /seosiren/applications.html
Disallow: /seosiren/secret.html
The above command tells the robots to not to index or crawl the above mentioned pages. /seosiren/ here means the directory, and adsense.html, applications.html, secret.html as pages. So we are restricting /seosiren/ as well as the other pages to be index.

Phase 6: What is a perfect /robots.txt layout file?

Your /robots.txt file should be something like this,
Sitemap: http://www.seosiren.com/sitemap.xml
User-agent:  *
Disallow: /cgi-bin/
Disallow: /wp-admin/
Disallow: /wp-content/
Disallow: /wp-includes/
Disallow: /recommended/
Disallow: /comments/feed/
Disallow: /wp-content/plugins/
Disallow: /trackback/
Disallow: /index.php
Disallow: /xmlrpc.php
User-agent: Mediapartners-Google*
Allow: /
User-agent: Googlebot-Image
Allow: /wp-content/uploads/
User-agent: Adsbot-Google
Allow: /
User-agent: Googlebot-Mobile
Allow: /
Here in the above /robots.txt file, we are restricting the most important directories and files to be indexed or crawled by robots.

Phase 7: Knock Off!

If you are not clear or still confused about the /robots.txt file after reading this post, I would suggest you to knockoff the /robots.txt file from your friends or competitors websites. Hehe! That’s what you can do when you are not clear about the things in instant. Any site would surely end up like www.seosiren.com/robots.txt And I even dont mind if you knock off my own /robots.txt file.

This is how we can write a robots.txt file easily. And its always better to restrict the bots to index the unwanted files and directories. More over, Google will start considering your site as a spam if it finds more than one relevant article post title or name. So better to restrict all those unwanted stuff from being indexed. Else you wont be lucky enough to survive the Google panda and penguin updates.

And if you feel that your site has already been screwed up with unwanted tags, archives and duplicate issues, please don’t worry. My next article will be how to remove the unwanted and non-required stuff from Google and your website. I hope you liked this article. Please ask me your queries if you feel uncomfortable with any of the command. I’m always ready to help you out.

No comments:

Post a Comment