Robots.txt
file is a mystery for many bloggers just like me. But the fact is,
nothing in this world is a mystery unless you explore it completely. If
you are worried about how to write a robots.txt file easily, don’t
ever panic. Its just as simple as you write a blog post or edit any
existing article. All you have to know is what command is used for what
action. Usually the robots/spiders crawl our site for many things, may
it be the article pages, our admin panel, tags, archives, what not. They
just index whatever is visible and accessible for them. It is very very
important to restrict them from indexing everything from our website.
Just as we restrict our strangers to hangout in our apartments.
Any /robots.txt file of a site will be located at www.domain-name.com/robots.txt. For example,www.seosiren.com/robots.txt.
Usually Robots.txt file is also known as Robots Exclusion Protocol. So
whenever a robot is visiting your website, it has to first visit the
/robots.txt page, and then visit the other pages for indexing.
How to Write a Robots.txt File Easily: 7 Phases
Today
we will check out how we can restrict search engine spiders to crawl
our site for unwanted stuff. You should know the 6 golden phases to
write a /robots.txt file. And you should also know the basic and advance
commands atleast for one single time to write a /robots.txt file.
Because, you wont edit it daily. Once you are done with your commands,
you will not touch it again (just saying). You can obviously edit the
matter whenever you can. Lets see the most important commands and phases
to write a successful /robots.txt file.
Phase 1: Differences between * and / entries
So
before writing a successful /robots.txt file, you should know the basic
commands and their usage. The first thing you need to know about the
/robots.txt is the User-agent command. Next comes the Disallow command
which is explained as below.
User-agent: *
Disallow:
Here, User-agent:* means
that the section is applied to all the robots. * is called the
wildcard, which usually means all. Coming to the Disallow command, this
tells the robots that they cannot index anywhere they want. So the *
here means, robots should read all the matter before proceeding.
User-agent: *
Disallow: /
The
Disallow:/ here means that the robots are not allowed to crawl
anything. So now you got the difference? if * then index all, if / then
don’t index anything!
Phase 2: Advance commands in Robots.txt file
So
that we found the difference between * and /, its now time to learn
little more about the advance commands in /robots.txt file. Starting
with the User-agent and Disallow, we will derive few commands for
banning unwanted robots from accessing our site.
User-agent: *
Disallow: /cgi-bin/
This
above command mean that, all the robots file are not allowed to index
anything in the cgi-bin folder. Which means, if the folder cgi-bin has
subfolders and pages like cgi-bin/newsite.cgi or
cgi-bin/example/idontknow.cgi, then they wont be indexed or accessed by
robots.
And if you wanted to restrict a particular robot file, then mention the robot name to restrict it from indexing your site.
User-agent: Googlebot-Image
Disallow: /
In
the above example, we are restricting the Google image search bot to
index our site for images. Here, Googlebot-Image is the robot which we
are trying to ban from our site. So without your permission from
/robots.txt, the Googlebot-Image shouldn’t index any file in the root
directory of “/” and all its subfolders. wont index anything from your
site. This bot is usually used to scan for picture to show them in
Google Images search.
Phase 3: Difference between /something/ and /something
Here we will see how we can restrict different files, folders or places which can harm yourself health.
User-agent: *
Disallow: /cgi-bin/
Disallow: /wp-admin/
Disallow: /wp-content/
Disallow: /wp-includes
The
above long /robots.txt commands tell robots that anything in cgi-bin
directory is not accessible by any of the bot. Similarly wp-admin,
wp-content, wp-includes directories are restricted to trespass by the
robots.
Alos
you have to note a very important point about the “/” usage. If you
want to mention a directory or folder in your site, then they have to
start and end with “/” in the /robots.txt file. For example,
User-agent:*
Disallow: /cgi-bin/
This will tell the robots that cgi-bin is a directory. And
User-agent:*
Disallow: /cgi-bin
This
will tell the robots to treat cgi-bin not a directory, but as a file in
your website. Just like cgi-bin.html or something. So avoid making a
mistake of missing “/” in the beginning and ending for a directory.
Phase 4: How to restrict unwanted images
If you don’t want the Google bot to index a specific picture, you can restrict it to.
User-agent: Googlebot-Image
Disallow: /images/adsense.jpg
Using the above command, you can restrict Googlebot-Image to index adsense.jpg picture.
Phase 5: How to restrict unwanted pages
Just similar to the above command, you can also restrict a particular page in your /robots.txt file.
User-agent: *
Disallow: /seosiren/adsense.html
Disallow: /seosiren/applications.html
Disallow: /seosiren/secret.html
The
above command tells the robots to not to index or crawl the above
mentioned pages. /seosiren/ here means the directory, and adsense.html,
applications.html, secret.html as pages. So we are restricting
/seosiren/ as well as the other pages to be index.
Phase 6: What is a perfect /robots.txt layout file?
Your /robots.txt file should be something like this,
Sitemap: http://www.seosiren.com/sitemap.xmlUser-agent: *
Disallow: /cgi-bin/
Disallow: /wp-admin/
Disallow: /wp-content/
Disallow: /wp-includes/
Disallow: /recommended/
Disallow: /comments/feed/
Disallow: /wp-content/plugins/
Disallow: /trackback/
Disallow: /index.php
Disallow: /xmlrpc.phpUser-agent: Mediapartners-Google*
Allow: /User-agent: Googlebot-Image
Allow: /wp-content/uploads/User-agent: Adsbot-Google
Allow: /User-agent: Googlebot-Mobile
Allow: /
Here
in the above /robots.txt file, we are restricting the most important
directories and files to be indexed or crawled by robots.
Phase 7: Knock Off!
If
you are not clear or still confused about the /robots.txt file after
reading this post, I would suggest you to knockoff the /robots.txt file
from your friends or competitors websites. Hehe! That’s what you can do
when you are not clear about the things in instant. Any site would
surely end up like www.seosiren.com/robots.txt And I even dont mind if you knock off my own /robots.txt file.
This is how we can write a robots.txt file easily.
And its always better to restrict the bots to index the unwanted files
and directories. More over, Google will start considering your site as a
spam if it finds more than one relevant article post title or name. So
better to restrict all those unwanted stuff from being indexed. Else you
wont be lucky enough to survive the Google panda and penguin updates.
And
if you feel that your site has already been screwed up with unwanted
tags, archives and duplicate issues, please don’t worry. My next article
will be how to remove the unwanted and non-required stuff from Google
and your website. I hope you liked this article. Please ask me your
queries if you feel uncomfortable with any of the command. I’m always
ready to help you out.
No comments:
Post a Comment