Does the NoIndex directive work in Robots.txt?
TLDR: Yes, but it takes quite a while, so if you do choose to implement it, use it in conjunction with the noindex meta tag. It may be useful for deindexing wildcard pages or whole directories on large sites.
The longer version: In late 2015, from seemingly out of nowhere there began discussions in the technical SEO-sphere around the use of a noindex directive within robots.txt in order to de-index content from your site.
Whilst this directive has been discussed as far back as 2008, it was still news to many seasoned SEOs with DeepCrawl declaring it the best kept secret in SEO, Eric Enge performed experiments to see if it actually works, John Mueller said to avoid using it which left many confused as to whether was actually worth bothering with.
So, in order to find out more and settle my own curiosity we set up an experiment to see how long it would take to de-index content using different types of directives.
On December 4th last year, we added commands to our robots.txt file with a view to removing live content on our site that was still crawlable and doesn’t make use of the noindex meta tag.
We added commands to de-index specific pages in terms of old staff profile pages for folk that have moved on to pastures new.
We added the following wildcard directive to remove any page from within our client case study section:
Whilst we used the following wildcard to deindex any download, such as PDFs or DOCXs:
We wanted to test the wildcard because in this discussion on Google+ it was suggested that the “*” wildcard directive is not supported.
So on 04/12/15 when these commands were added, there were:
10 staff profiles indexed
10 pages within the client results folder indexed
49 downloads indexed
Since then, have been checking on a regular basis and for some time, there was no movement. However, over the past few days, pages have started to be removed and at time of writing we now see:
5 staff profiles indexed (-5)
4 client results pages indexed (-6)
21 downloads indexed (-28)
This would suggest that whilst it might not be officially supported by Google, it still does work (for now) – but as it’s clearly not as effective as actually inserting the robots noindex tag, is there any value in implementing this tag?
Eric Enge’s experiments seem to suggest that there is no correlation between a page that is specified in the txt file, that page being crawled, and that page being de-indexed. So it would seem it is possible for it to be removed purely off the txt directive alone without being crawled.
If this is the case, and as the wildcards seemingly work, then I would say it is useful to try the command in conjunction with the meta tag when trying to de-index a large directory, or URLs of a similar naming convention, on a large site that has 000s of URLs that can be crawled, or where crawl budget limits how deep a site can be crawled on each visit – as it may be some time before such content is crawled and the meta tag picked up on.
At the end of the day, it won’t do any harm to implement, and it may actually speed up the process of de-indexing unwanted content on your site, which is something we tend to have to do on a regular basis on new websites we work on these days, and is a task which invariably takes a long time to complete.