The URL in our sitemap is in the format:
The problem with Googlebot is that even though that is the URL we put in the sitemap, it doesn’t use that URL to make the request – it contracts it down to:
So our server sees this ‘incorrect’ URL, issues a 301 with the ‘correct’ URL (that has the index.html bit in it), but then Googlebot doesn’t follow that URL faithfully and again tries to request the URL without index.html in the path. So our server again issues a 301 redirect, with the correct URL and here we go off on our infinite loop. So no wonder we get the error message:
URLs not followed....contained too many redirects.
- Beginner Guides
- Google Algorithm Updates
- SEO & PPC Contract and Legal Agreement Templates
- SEO Premium Business WordPress Themes
- Test User Page
- Ultimate Collection of Google Webmaster Videos
- View Profile
John Mueller, Webmaster Trends Analyst at Google Zürich replies:
In this case it actually is something that we’re doing — we strip “/index.html” from URLs because that’s generally irrelevant and only makes the URL longer and look more complicated to the user. We do this when processing the URLs in your Sitemap file so if you *need* to have “/index.html” in the URLs, they generally won’t work like that. At the moment, there is no solution for using these URLs in Sitemap files if you need to have “/index.html” in them. I would generally recommend dropping the “/index.html” part, but I realize that this is sometimes not easily done.
That said, we will still crawl the website normally, so if those URLs are reachable through a normal web crawl, we’ll still find and index them normally.
Useful advice there for anyone putting together a sitemap and wondering why Google was throwing an error on URLs requiring a default filename. I assume this would also apply to the other “default” page names such as index.html index.htm index.cgi index.pl index.php index.xhtml, index.asp and perhaps default.html etc.