As part of the search engine race to index my new blog site that you’re reading now I’ve noticed some interesting behaviour from Google. (The race)
It’s pretty well known that Google calculates a page rank for every website it indexes, and that the page rank is a complex beast created from an aggregate score of a whole bunch of things. Link popularity, keywords, content, meta tags, the phase of the moon.
Millions of words have been written about the mystery box that is Google page rank, and all I can say definitively is that it exists, and SEO ‘experts’ can only guess at exactly how it works because the people at Google in the know are keeping their lips tightly sealed under non-disclosure contracts and the pain that only corporate lawyers can inflict.
My observation is about the presentation of content, and the calculation of page rank. It appears to be two separate processes, which leads me to assume that the Google monster keeps it’s data in at least two separate databases one for the page rank, link popularity and URL information and a second for the content, titles, and summaries information.
For that matter there is probably a third, which has the all important keywords index information with it’s magical mix of synonym and phonetic matching that makes Google far more useful than it’s competitors, or at least I my humble opinion.
I’m making this all up on the basis of the changing search results for ‘trash.co.nz’ after I put this site online and submitted the new xml sitemap to google. See the before and after screen shots below.
The ‘after’ is around 60 hours after the before. So what made the extra results appear for ‘trash.co.nz’ when they did not appear two and a half days ago?
The two extra pages are linked from the forums site www.cnczone.com and get 4-5 hits a day from there. Before I put the blog online they had some holding pages saying that I’d moved the content to another one of my sites, www.ohmark.co.nz. The HTML was poorly formed, there was only one link on the page, no meta description tags, and little content of any sort.
Skip forward to now. Those links land on the new CMS and get redirected to the 404 page. The new page has properly structured HTML, meta description tag and multiple links. So, by my reasoning the page rank of the page increased, so the links became relevant enough to show in the results for trash.co.nz. Up until I change the content that was not the case, as the quality of the content on the old pages was quite low.
So, why is the new content not showing in the results? I’ve got a valid, unique title, and an equally valid, if not slightly silly meta tag description.
That’s where database number two comes in. The quality and ranking of my newly improved pages was stored by the spider on it’s first visit while following the link from www.cnczone.com. That went into database number 1, the page rank database we’ll call it for want of a better term.
At some stage in the next day or so I imagine that database number two will be populated by another visit from googlebot, where it will scrape the tag, and description and update the results in the page. This step will then probably populate database number three with the keywords, which will then recursively affect he page rank via link relevancy and the phase of the moon.
Also note that there is no ‘Cached’ link under the pages, I’m assuming that the second pass of google bot will enable this, and even though it had the description and title for a link the quality of the page was not high enough in the past to warrant caching a copy.
The takeaways from this are:
- HTML quality does matter. If you’re involved in SEO work and didn’t know that you’ve probably chosen the wrong career.
- HTML quality effects Google’s cache. It doesn’t cache junk pages.
- Googlebot makes multiple passes to create an update to a page. In this case it got the pagerank / link quality rank up first, and has not got the content yet.
Now, lets see who wins the race to index the site fully.