A basic guide on Robots.txt with Best Practices
A complete guide on Robots.txt with Best Practices
A robots.txt is a file that is obeyed by most search engines including Google, Bing, and Yahoo to crawl or not crawl pages on the website. A basic file looks like this –
Importance of Robots.txt File
- Block nonpublic pages- There are certain pages that you don’t want to index such as a staging site or login page because you don’t want random people to access it. This is a case where robots.txt play a crucial role by blocking such pages from the index.
- Maximize crawl budget- You may have a crawl budget where you want the all-important page to be indexed then you can block the irrelevant pages in the robot.txt file. This way Google will crawl those pages that matter to you.
- Prevent Indexing resources- Meta directives work well with some of the pages but they don’t work with documents like pdfs and images. This is where Robots.txt play a crucial part and you can always check the index status in the search console.
How does the robots.txt file work?
The search engine works by crawling and indexing from the syntax of the robots.txt file and discovering to follow and nofollow. From robots.txt, the crawler knows which page to be indexed and which not.
Where you should put Robots.txt file?
The robots.txt file should be placed in the root of your domain and make sure you write it as “robots.txt” as it is case sensitive otherwise it will not work.
Best Practices on Creating Robots.txt
- Creating a Proper syntax – It is vital to write the proper robots.txt following the syntax for allowing bots to crawl or not crawl the specific pages. It can also be the different syntax for different bots. Make sure to allow and disallow syntax should be case sensitive. For an instance
User agent: Googlebot
The above directive means to crawl everything by the spider’s name Google bot except images folder. Make sure to enter the right disallow directive for images as it is case sensitive and should not be Images instead of images. You can choose * for all bots and syntax be like this
2. Common User agents – Here is the list of most common agents to match the most used search engines
|Bing||Images & Video||msnbot-media|
- Using wild cards/ regular expressions
Disallow: /.php Disallow: /copyrighted-images/.jpg
In the above example use of * in the first line is used to match the file name and will be blocked but second line will not be blocked from the crawl.
In the above example /index.php will be blocked but /index.php?p=1. Hence it is important to use the expressions very diligently otherwise many pages will get block on the site.
- Common directives used – Most sites use the below-mentioned directives as it is easy and very readable.
- Crawl Delay – Make sure when using the crawl-delay directive. If you set a crawl delay of ten seconds, you only allow search engines to access 8,640 pages a day.
6. Using a sitemap is crucial to index all the webpages although you need to submit it to the search console for recommendations.