The purpose of a robots.txt file
Key Takeaways
- A robots.txt file tells search engines which parts of a website they can crawl.
- The file is located at the top-level directory of a website and provides instructions for web crawlers to follow.
- Using a robots.txt file can benefit a site's crawl budget and prevent access to parts of a site that need fixing.
- The "noindex" tag can be used to prevent a specific page from appearing in search results because the robots.txt file does not necessarily do this.
What is a robots.txt file?
robots.txt is a textual file that serves the purpose of telling search engines which parts of your website they can crawl. Search engines deploy so-called “spiders” and “bots” that crawl the web, reading and indexing pages they come upon. When they reach a specific website, they look for the robots.txt file and information relating to what they should crawl and what they should ignore.
How does the robots.txt work?
Spiders and bots travel across the web by following links from website A to B to C, and so on. When they reach the robots.txt, sometimes also called “Robots Exclusion Protocol”, they use it to look up which pages they are allowed to crawl and index.
Are all bots required to obey the information from robots.txt?
Some spiders (bots) can opt to ignore robots.txt. These are usually malware bots, spamming bots or email scrapers. The majority of bots from established search engines such as Google and Yahoo will adhere to the instructions in the file.
Do I need a robots.txt on my website?
If you want search engines to index your entire website, and there is nothing you want to block access to, then you don’t need to bother with robots.txt at all. When a bot reaches a particular website and does not find the robots.txt file, it will simply proceed to crawl the entire website.
Where do I need to put the robots.txt file?
The robots.txt file has to be at a website’s top-level directory. You can access any website’s robots.txt by adding /robots.txt to the root domain URL, such as https://www.example.com/robots.txt. It is also very important to make sure the file is correctly named “robots.txt”. “Robots.txt or ROBOTS.txt” are incorrect file names, since the file is case sensitive.
What are the basic instructions that robots.txt gives?
Allowing all web crawlers to access all content:
User-agent:*
Disallow:
Blocking all web crawlers from all content
User-agent:*
Disallow: /
It is possible to block only specific crawlers that obey the rules of “Robots Exclusion Protocol”. In general, there is a consensus among all the big, established search engines, disallowing access to a specific folder or the entire website.
User-agent: Googlebot
Disallow: https://www.example.com/blocked-page.html
User-agent: Bingbot
Disallow: /wp-admin/
Allow: /wp-content/uploads
In the first instance, robots.txt is not allowing the Googlebot to visit a specific page. These orders are specific and mean that Googlebot can crawl any other page, and that all other bots can crawl and index any page on that website they can reach.
In the second instance, Bingbot is not allowed to index the administrator folder on a WordPress installation. but it is allowed to crawl and index all the uploaded content (images) from the uploads folder in the installation folder of the website.
What are the advantages of using robots.txt?
Each website has a specific “crawl budget”, the number of web pages a search engine bot will crawl on that website. When blocking parts of a website from crawling, one can free up the budget for crawling the rest of the website, in case it has a very large number of pages. It is also good to prevent bots from crawling parts of the website that still need to be cleaned up or otherwise fixed before they can be presented to the public.
If we disallow a page, will it disappear from search results?
No. If a bot is not allowed to crawl a specific page, it will not do so. However, if the search engine finds links to the specific blocked URL on a third-party site, it will crawl the page. It means the page might show up in search results even if disallowed for crawling in the robots.txt file.
If you want to block a specific page from showing in search results, you need to use the “noindex” tag. On the other hand, to be able to find the page and not index it, that page must not be blocked by robots.txt.