If you read my blog via an RSS reader you probably noticed at few odd goings on earlier today. I changed a few things on the site and all of the posts going back to last year appeared as new again, even if you’d read them.
Sorry ’bout that, but there was a method to my madness, or at least a method to my fiddling.
Although it’s not entirely obvious, one of the main reasons I started running this site was to mess around with search engine optimisation and try out the theories of various experts who also run a blog but with a great deal more focus that me.
To that end, I’ve re-written the code that generates my rss feed, and included some in line formatting to make it easier to read. Now when you read the blog from a feed reader it should look a bit more like the website, give or take. Well, more give than take.
While some of the changes were purely cosmetic, I also changed the URLs for all my blog posts.
The new URLs is the bit that caused them to pop up as new posts in at least feedburner and Google reader. The change to the URLs was to remove the dates from the URL itself, and replaced all the underscores with hyphens.
Removing the dates was because it just looked ugly compared to the WordPress style of using directories for the year / month. I don’t use WordPress, but decided that if I was going to mess with all my URLs I might as well change to nicer looking ones while I’m at it.
If you do some searching for “hyphen vs underscore in URLs” using your favourite search engine you’ll find a bunch of writing, with the general wisdom falling on the side of hyphens. In fact as far back as 2005 Matt Cutts, a developer from Google, blogged about it. 
So why might you ask did I use underscores? Well. Ummmmm, cause it’s what I’ve always done is the only answer I’ve got.
A bit more searching around told me that the results for at least Google are apparently different between the two methods. Using underscore caused URLs to be considered as phrases and hyphens were more likely to result in search results for individual words in the URL.
This sounded like something worthy of some experimentation so wearing my best white lab coat I created some pages on a few different sites I look after which were not linked to the navigation but were listed in the xml sitemaps.
I mixed and matched the URLs with underscores and hyphens and used some miss-spelt words and phrases. There were a total of 48 pages, spread over 8 domains, which were all visited by googlebot a number of time over an eight week period.
I had a split of twelve pages with hypens and matching content, twelve with hyphens and unmatched content, and the same split with underscores. Where the content matched I used the same miss-spelling of the words to get an idea of how well it worked. All six of the sites have good placement of long tail searches for their general content and get regularly spidered.
The end result is that most of the hyphenated URL pages that did not have matching keywords in content or tags were indexed against individual words in the URL (eight out of twelve). All of the pages that had hyphenated URLs and matching keywords in the content were indexed against those words.
The pages with underscores and non-matched content didn’t fair so well. Only four out of the twelve pages got indexed against words in the URL, although nine of them were indexed against long-tail phrases from the URLs. Pages with underscores and matching content ranked lower for keywords in the URL than the hyphenated ones although that’s not an accurate measure as they were miss-spelt words on pages with no back links.
So, end result: The common wisdom of using hyphens would appear to be valid and helpful if you’re running a site where long keyword rich URLs make sense, and the strength of the individual keywords might be more valuable than the phrase.
If you’re going for long tail search results in a saturated market where single keyword rank is hard to gain, you might want want to mix it up a little and try some underscores, it certainly can’t hurt to try it.
One thing to note for those not familiar with why this is even an issue. Spaces are not valid in the standard for URLs although they are common in poorly or lazily designed websites. If you’re really bored you can read the original spec by Tim Berners-Lee back in 1994 , or the updated version from 2005, also by Mr Berners-Lee. 
The long an short of that in this context is that you can use upper and lower case letters, numbers, hyphens, underscores, full stops and tildes (‘~’). Everything else is either reserved for a specific function, or not valid and requires encoding. A space should be encoded as ‘%20’ and you can probably imagine how well that looks when trying to%20read%20things.
If you type a URL into your browser with a space the browser is converting it to ‘%20’ before sending it down the pipe for you. You sometimes see these encoded URLs with no just spaces but other random things in them, and they can be the cause of random behaviour for some websites and software, so it’s best to avoid odd characters in your URLs.
Apologies again if you got some duplicates in your RSS reader over the last few hours. I’ll try not do that again, and it’ll be interesting to see if a couple of pages that were being ignored by Google with underscores get indexed now.
Matt Cutts Blog posting from 2005 http://www.mattcutts.com/blog/dashes-vs-underscores/
1994 spec for URLs http://www.ietf.org/rfc/rfc1738.txt
2005 update to URL Spec http://www.ietf.org/rfc/rfc3986.txt