The Robots Exclusion Protocol(REP) or Robots.txt is a text file, use to instruct web-masters how to crawl and index pages of the website. You need a robots.txt file only if your site includes content that you don't want search engines to index your page, like admin panel. Robots.txt file instruct the search engine bot to index or not to index/crawl the web-pages.
- Create a file, name it as robots.txt and write the following code according to the requirement.
To Block all web crawlers from all content:
user-agent: *
Disallow: /
To allow all robots complete access:
user-agent: *
Disallow:
To Block a particular folder from web crawlers:
user-agent: *
Disallow: /css/
Disallow: /js/
Disallow: /admin/
To Block particular file/files from web crawlers:
user-agent: *
Disallow: /folder/file.php
Disallow: /folder/file2.php
Disallow: /folder2/file3.php
To block a Google’s crawler from a specific folder
User-agent: Googlebot
Disallow: /myfolder/
To block a Bing’s crawler from a specific folder
User-agent: Bingbot
Disallow: /myfolder/
Points to be remembered while creating and saving Robots.txt file
- Robots.txt file location - it must be placed on root folder of your website. e.g. www.domain-name.com/robots.txt
- Robots.txt is case-sensitive - the file must be named "robots.txt" (not Robots.txt, robots.TXT, or not any other capital letter).
- The robots.txt file is a publicly available - just add '/robots.txt' at the end of any root domain to see that websites has submitted robots.txt file or not. This means that anyone can see what pages you want to be crawled, so don't use this technique to hide private user information.
- Each subdomain on a root domain uses separate robots.txt files.
You can also restrict crawler via meta tag.
<meta name="robots" content="noindex">
Meta robots tag, content value can be:
- Noindex - Tells a search engine not to index a page.
- Index - Tells a search engine to index a page. Note that you don't need to add this meta tag; it's the default.
- Follow - Even if the page isn't indexed, the crawler should follow all the links on a page and pass equity to the linked pages.
- Nofollow - Tells a crawler not to follow any links on a page or pass along any link equity.
- Noimageindex - Tells a crawler not to index any images on a page.
- None - Equivalent to using both the noindex and nofollow tags simultaneously.