Googlebot Strips Default Filenames From Sitemap URLs

There’s a useful thread over at Google Webmaster Groups that highlights an issue with default filenames such as index.html and sitemaps. As user edralph888 explains:

The URL in our sitemap is in the format:

http://www.domain.com/index.html?whatever=value

The problem with Googlebot is that even though that is the URL we put in the sitemap, it doesn’t use that URL to make the request – it contracts it down to:

http://www.domain.com/?whatever=value

So our server sees this ‘incorrect’ URL, issues a 301 with the ‘correct’ URL (that has the index.html bit in it), but then Googlebot doesn’t follow that URL faithfully and again tries to request the URL without index.html in the path. So our server again issues a 301 redirect, with the correct URL and here we go off on our infinite loop. So no wonder we get the error message:

URLs not followed....

contained too many redirects.

John Mueller, Webmaster Trends Analyst at Google Z├╝rich replies:

In this case it actually is something that we’re doing — we strip “/index.html” from URLs because that’s generally irrelevant and only makes the URL longer and look more complicated to the user. We do this when processing the URLs in your Sitemap file so if you *need* to have “/index.html” in the URLs, they generally won’t work like that. At the moment, there is no solution for using these URLs in Sitemap files if you need to have “/index.html” in them. I would generally recommend dropping the “/index.html” part, but I realize that this is sometimes not easily done.

That said, we will still crawl the website normally, so if those URLs are reachable through a normal web crawl, we’ll still find and index them normally.

Useful advice there for anyone putting together a sitemap and wondering why Google was throwing an error on URLs requiring a default filename. I assume this would also apply to the other “default” page names such as index.html index.htm index.cgi index.pl index.php index.xhtml, index.asp and perhaps default.html etc.

Nick Wilsdon is the Head of Content and Media at iProspect UK, part of the Densu Aegis Network. He manages online campaigns for the UK's leading telecom, finance and FMCG brands.

Click on a tab to select how you'd like to leave your comment

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>