In order to instruct web robots (typically search engine robots), robots.txt is created. It is a text file that provides directions to web crawlers how to crawl pages on a website. The robots.txt file is part of the the robots exclusion protocol (REP), a set of web standards that manage how robots access and index content, crawl the web and serve that content up to users. The REP also includes directives like meta robots, as well as page-, subdirectory-, or site-wide instructions for how search engines should treat links (such as “follow” or “nofollow”).
Let’s have an understanding on the best practices while creating a robots.txt
robots.txt file needs to be kept in the root directory of the website.
Since this file is case sensitive, the file must be named “robots.txt” (not Robots.txt, robots.TXT, or otherwise).
Some user agents (robots) may choose to ignore your robots.txt file.
The /robots.txt file is a publicly available: just add /robots.txt to the end of any root domain to see that website’s directives (if that site has a robots.txt file!). This means that anyone can see what pages you do or don’t want to be crawled, so don’t use them to hide private user information.
Each subdomain on a root domain uses separate robots.txt files. This means that both blog.example.com and example.com should have their own robots.txt files (at blog.example.com/robots.txt and example.com/robots.txt).
It’s generally a best practice to indicate the location of any sitemaps associated with this domain at the bottom of the robots.txt file.
Why the robots.txt file is needed ?
robots.txt files control the crawler access to particular areas of your site. While this can be very dangerous if you accidentally disallow Googlebot from crawling your entire site, there are some situations in which a robots.txt file can be very handy.
Some common use cases include:
Preventing duplicate content from appearing in SERPs (note that meta robots is often a better choice for this)
Retaining entire sections of a website private (for instance, your engineering team’s staging site)
Holding on to the internal search results pages from showing up on a public SERP.
Defining the location of sitemap(s)
Preventing search engines from indexing certain files on your website (images, PDFs, etc.)
Identifying a crawl delay in order to prevent your servers from being overloaded when crawlers load multiple pieces of content at once.
Here’s what happens when we include a robots.txt into a website.
