Underscores vs Hyphens and an Apology

If you read my blog via an RSS reader you probably noticed at few odd goings on earlier today. I changed a few things on the site and all of the posts going back to last year appeared as new again, even if you’d read them.

Sorry ’bout that, but there was a method to my madness, or at least a method to my fiddling.

Although it’s not entirely obvious, one of the main reasons I started running this site was to mess around with search engine optimisation and try out the theories of various experts who also run a blog but with a great deal more focus that me.

To that end, I’ve re-written the code that generates my rss feed, and included some in line formatting to make it easier to read. Now when you read the blog from a feed reader it should look a bit more like the website, give or take. Well, more give than take.

While some of the changes were purely cosmetic, I also changed the URLs for all my blog posts.

The new URLs is the bit that caused them to pop up as new posts in at least feedburner and Google reader. The change to the URLs was to remove the dates from the URL itself, and replaced all the underscores with hyphens.

Removing the dates was because it just looked ugly compared to the WordPress style of using directories for the year / month. I don’t use WordPress, but decided that if I was going to mess with all my URLs I might as well change to nicer looking ones while I’m at it.

If you do some searching for “hyphen vs underscore in URLs” using your favourite search engine you’ll find a bunch of writing, with the general wisdom falling on the side of hyphens. In fact as far back as 2005 Matt Cutts, a developer from Google, blogged about it. [1]

So why might you ask did I use underscores? Well. Ummmmm, cause it’s what I’ve always done is the only answer I’ve got.

A bit more searching around told me that the results for at least Google are apparently different between the two methods. Using underscore caused URLs to be considered as phrases and hyphens were more likely to result in search results for individual words in the URL.

This sounded like something worthy of some experimentation so wearing my best white lab coat I created some pages on a few different sites I look after which were not linked to the navigation but were listed in the xml sitemaps.

I mixed and matched the URLs with underscores and hyphens and used some miss-spelt words and phrases. There were a total of 48 pages, spread over 8 domains, which were all visited by googlebot a number of time over an eight week period.

I had a split of twelve pages with hypens and matching content, twelve with hyphens and unmatched content, and the same split with underscores. Where the content matched I used the same miss-spelling of the words to get an idea of how well it worked. All six of the sites have good placement of long tail searches for their general content and get regularly spidered.

The end result is that most of the hyphenated URL pages that did not have matching keywords in content or tags were indexed against individual words in the URL (eight out of twelve). All of the pages that had hyphenated URLs and matching keywords in the content were indexed against those words.

The pages with underscores and non-matched content didn’t fair so well. Only four out of the twelve pages got indexed against words in the URL, although nine of them were indexed against long-tail phrases from the URLs. Pages with underscores and matching content ranked lower for keywords in the URL than the hyphenated ones although that’s not an accurate measure as they were miss-spelt words on pages with no back links.

So, end result: The common wisdom of using hyphens would appear to be valid and helpful if you’re running a site where long keyword rich URLs make sense, and the strength of the individual keywords might be more valuable than the phrase.

If you’re going for long tail search results in a saturated market where single keyword rank is hard to gain, you might want want to mix it up a little and try some underscores, it certainly can’t hurt to try it.

One thing to note for those not familiar with why this is even an issue. Spaces are not valid in the standard for URLs although they are common in poorly or lazily designed websites. If you’re really bored you can read the original spec by Tim Berners-Lee back in 1994 [2], or the updated version from 2005, also by Mr Berners-Lee. [3]

The long an short of that in this context is that you can use upper and lower case letters, numbers, hyphens, underscores, full stops and tildes (‘~’). Everything else is either reserved for a specific function, or not valid and requires encoding. A space should be encoded as ‘%20’ and you can probably imagine how well that looks when trying to%20read%20things.

If you type a URL into your browser with a space the browser is converting it to ‘%20’ before sending it down the pipe for you. You sometimes see these encoded URLs with no just spaces but other random things in them, and they can be the cause of random behaviour for some websites and software, so it’s best to avoid odd characters in your URLs.

Apologies again if you got some duplicates in your RSS reader over the last few hours. I’ll try not do that again, and it’ll be interesting to see if a couple of pages that were being ignored by Google with underscores get indexed now.

References:

Matt Cutts Blog posting from 2005 http://www.mattcutts.com/blog/dashes-vs-underscores/
1994 spec for URLs http://www.ietf.org/rfc/rfc1738.txt
2005 update to URL Spec http://www.ietf.org/rfc/rfc3986.txt

Javascript Compression with Apache 2 and Debian Etch

If you’re trying to wring every last drop of performance out of a website you’re probably wanting to compress all your content before it hits the wire. While I was messing about with another project I noticed that the javascript from this blog wasn’t getting compressed.

If you just want the solution to the issue, skip to the bottom of this post, but for those interested in the finer detail, read on.

This site uses Apache 2 on Etch, and after a bit of Googling I didn’t really find a direct mention of this issue, so I though I’d slap it on here for other folks afflicted with un-compressed javascript.

First step is to enable mod_deflate in the first place, which will by default compress html, xml, css and plain text files, but due to a config issue will not get your javascript.

The command to enable mod_deflate is: ‘a2enmod deflate’. If you’re on shared hosting without this configured you’ll have to drop your support folks an email, although I’d think most shared hosting companies would be well on top of the config of mod_deflate as it saves them money!

The reason javascript is not compressed by default is that the mime-type specified in /etc/apache2/mods-available/deflate.conf for javascript is ‘application/x-javascript’.

Apache dosn’t know what extension is associated with this mime type. The mime-types used by apache are defined in /etc/apache2/mods-available/mime.conf by including /etc/mine.types.

/etc/mime.types in turn has the .js extension associated with application/javascript, not x-javascript.

To fix this up you’ve got a some options:

  • Change /etc/mine.types to be application/x-javascript, which might break other applications that include that file.
  • Change /etc/apache2-mods-available/deflate.conf to use application/javascript
  • Add the mime type ‘application/x-javascript’ to /etc/apache2/mods-available/mime.conf with the line: ‘AddType application/x-javascript .js’ which is the option I took.

After adding the line do a /etc/init.d/apache reload and you’re in business, all .js files leaving your server will be compressed if the browser reports that it accepts compressed files.

You could of course pre-compress the files and serve them using .gz extensions, or control specific compression rules using a .htaccess file, but I wanted server-wide compression without needing to specifically configure sites on the server.

Why is it so hard to think up a good password?

I’ve been working in IT for a wee while now, a shade over 20 years even, and in all this time there is one consistent thread of frustration that nibbles away at my very sanity. Trivial Passwords.

I’m sure this isn’t just be going nuts here, there must be thousands of network administrators and web masters going quietly bonkers all over the planet right at this very moment.

We slave away with intimate pride of our collective nerdiness, building robust and secure IT systems for all to behold. Fussing and fettling over minute details to ensure the ever important data is safe.

An unfortunate side affect of this creative journey is a necessary evil. The agent by which all things great in computing are undone. I’m not referring to the trivial password here but that which spawns it. The user.

“Do I have to put a number in my password, eight characters, really?”

I’m getting chills just typing that sentence.

A quick Google for ‘most common passwords’ will quickly reveal the painful truth. 123456 is actually a very common password, as is the word itself… ‘Password’.

Where am I going with all this? I use complex passwords. I love them with the same fervour as tatting enthusiasts like a good yarn. (See what I did there? No? Look it up…)

I used to use an on line password generator, but the owner of the website decided to put pop-up ads on the page, so that every time you refreshed the page you got another ad popup. ARGH.

If you were looking for something particular in your random password it could take ages with all the popping, closing and refreshing going on. Popup ads are second only in their evil nature to trivial passwords.

Fresh from a particularly annoying bout of popping, closing and refreshing last week I set about creating my own random password generator, which is now on line for all to use.

It uses the php rand() function seeded using microtime() which in lay terms means that in theory it can generate a different password every microsecond. Of course if you are a lay person you probably don’t care, and you’re using 123456 as your password.

That in a nutshell is it. Enjoy your randomly generated passwords.