purpleblog

Grab a coffee and read our purpleblog

Tea works too. Or hot choco­late. Or even some­thing stronger! Our arti­cles are based on the most com­mon ques­tions we get from our clients, that’s why they are so inter­est­ing to read, and actu­al­ly utilise. You won’t notice how time flies!

3 min read purpose of robots.txt file for SEO

The purpose of a robots.txt file

Key Takeaways

  • A robots.txt file tells search engines which parts of a website they can crawl.
  • The file is located at the top-level directory of a website and provides instructions for web crawlers to follow.
  • Using a robots.txt file can benefit a site's crawl budget and prevent access to parts of a site that need fixing.
  • The "noindex" tag can be used to prevent a specific page from appearing in search results because the robots.txt file does not necessarily do this.

What is a robots.txt file?

robots.txt is a tex­tu­al file that serves the pur­pose of telling search engines which parts of your web­site they can crawl. Search engines deploy so-called “spi­ders” and “bots” that crawl the web, read­ing and index­ing pages they come upon. When they reach a spe­cif­ic web­site, they look for the robots.txt file and infor­ma­tion relat­ing to what they should crawl and what they should ignore.

How does the robots.txt work?

Spi­ders and bots trav­el across the web by fol­low­ing links from web­site A to B to C, and so on. When they reach the robots.txt, some­times also called “Robots Exclu­sion Pro­to­col”, they use it to look up which pages they are allowed to crawl and index.

Are all bots required to obey the information from robots.txt?

Some spi­ders (bots) can opt to ignore robots.txt. These are usu­al­ly mal­ware bots, spam­ming bots or email scrap­ers. The major­i­ty of bots from estab­lished search engines such as Google and Yahoo will adhere to the instruc­tions in the file.

Do I need a robots.txt on my website?

If you want search engines to index your entire web­site, and there is noth­ing you want to block access to, then you don’t need to both­er with robots.txt at all. When a bot reach­es a par­tic­u­lar web­site and does not find the robots.txt file, it will sim­ply pro­ceed to crawl the entire website.

Where do I need to put the robots.txt file?

The robots.txt file has to be at a web­site’s top-lev­el direc­to­ry. You can access any web­site’s robots.txt by adding /robots.txt to the root domain URL, such as https://www.example.com/robots.txt.  It is also very impor­tant to make sure the file is cor­rect­ly named “robots.txt”. “Robots.txt or ROBOTS.txt” are incor­rect file names, since the file is case sensitive.

What are the basic instructions that robots.txt gives?

Allow­ing all web crawlers to access all content:

User-agent:*
Disallow:

Block­ing all web crawlers from all content

User-agent:*
Disallow: /

It is pos­si­ble to block only spe­cif­ic crawlers that obey the rules of “Robots Exclu­sion Pro­to­col”. In gen­er­al, there is a con­sen­sus among all the big, estab­lished search engines, dis­al­low­ing access to a spe­cif­ic fold­er or the entire web­site.

User-agent: Google­bot
Dis­al­low: https://www.example.com/blocked-page.html

User-agent: Bing­bot
Dis­al­low: /wp-admin/
Allow: /wp-con­tent/u­ploads

In the first instance, robots.txt is not allow­ing the Google­bot to vis­it a spe­cif­ic page. These orders are spe­cif­ic and mean that Google­bot can crawl any oth­er page, and that all oth­er bots can crawl and index any page on that web­site they can reach.

In the sec­ond instance, Bing­bot is not allowed to index the admin­is­tra­tor fold­er on a Word­Press instal­la­tion. but it is allowed to crawl and index all the uploaded con­tent (images) from the uploads fold­er in the instal­la­tion fold­er of the website.

What are the advantages of using robots.txt?

Each web­site has a spe­cif­ic “crawl bud­get”, the num­ber of web pages a search engine bot will crawl on that web­site. When block­ing parts of a web­site from crawl­ing, one can free up the bud­get for crawl­ing the rest of the web­site, in case it has a very large num­ber of pages. It is also good to pre­vent bots from crawl­ing parts of the web­site that still need to be cleaned up or oth­er­wise fixed before they can be pre­sent­ed to the public.

If we disallow a page, will it disappear from search results?

No. If a bot is not allowed to crawl a spe­cif­ic page, it will not do so. How­ev­er, if the search engine finds links to the spe­cif­ic blocked URL on a third-par­ty site, it will crawl the page. It means the page might show up in search results even if dis­al­lowed for crawl­ing in the robots.txt file.

If you want to block a spe­cif­ic page from show­ing in search results, you need to use the “noin­dex” tag. On the oth­er hand, to be able to find the page and not index it, that page must not be blocked by robots.txt.

SEO
Free Consultation
Please let us know your project requirements, and we’ll get in touch as soon as we can.

    We are pleased to welcome you on the purpleplanet!
    To order the service package you’ve chosen, please fill in the form and we’ll get in touch with you soon.

      We are pleased to welcome you on the purpleplanet!
      To order the service package you’ve chosen, please fill in the form and we’ll get in touch with you soon.