SEO Blog - Internet marketing news and views  

Google Blog Search; under the hood

Written by David Harry   
Tuesday, 24 August 2010 13:44

A look at how they see your blog

Have you ever wondered how Google Blog search works? Considered what elements might be different or relative to the regular organic search? Well, a recently awarded patent to Google had me thinking of digging deeper into it all.

Now, as is always the case with patents, we must consider that this was filed in 2005 (as was the sister patent which was awarded in 2007). That means that much of this was likely already in place at that time. But it still makes for some interesting reading towards understanding how Google has gone about the world of Blog search.

The patents in question are;

Indexing and retrieval of blogs
Filed Sept. 13 2005 – Awarded July 27th 2010

Ranking of blog documents
Filed Sept. 13 2005 – Awarded March 15 2007

Google Blog Search Patent

 

Elements they look for in a blog

The first bits we want to look at are some of the areas that Google looks at when considering blogs. These can include;

  • A Rich Site Summary (RSS) or Atom feed
  • Actual content of post
  • Title of the blog
  • Title of the post
  • Author or profile of the author
  • Blog Roll
  • Date for posts or updates (temporal relevance)
  • Geo-graphic information relating to an author

And considering that the feed is generally the first line of contact, satisfying as many of these elements as possible would be preferential. This includes the date and author name(s). We can also see the importance of this with recent moves by Google into the world of named entities (person, place or thing) with the purchase of Meta Web.

Now it is interesting that as part of the actual indexation process they look at not only the posts on a blog for relevance, but the blog (in it's entirety) itself.

“a method may include receiving a feed; fetching a blog and one or more posts associated with the feed; extracting information from the feed, the blog, and one or more posts ”

They also discuss looking at secondary sources for relevance as well.

“determine a relevance of a blog or a blog post to the search query based on information extracted from the blog or blog post and information extracted from at least one other source, and provide information relating to the blog or the blog post when the blog or the blog post is determined to be relevant to the search query.”

The interest in feeds and 'hypbrid' documents, (for internal assessment) seems to imply that the possible feed snippet becomes the first line of decision making towards the relevance of a blog/post for a given query. This may mean that newer blogs, ones Google might have less information on, might be better served with a full feed or at least expanded snippets. We can also infer that there is a premium on the first few paragraphs as far as ensuring it is relevant for the target query spaces.

Another aspect that plays into the aforementioned 'named entities' includes ensuring there is a link in the post to an author bio. This is further supported by how Google News operates, making this an important aspect for any blog.

 

Ranking the results

I do get a kick sometimes from older patents, this one is case and point. Instead of talking about 'ranking factors/signals' or 'weighting' they speak of 'IR Scores'. Some of the ones mentioned in the more recent patent include;

  • Number of occurrences of the search term (KW density...eeew)
  • Where the terms occur in the document (title, content etc..)
  • Characteristics of the term (font size, colour etc..)
  • Additional scoring for multi-term queries
  • Links to post
  • Links to blog (global)
  • Location (for geo-localized queries)

I does bear noting that they look at both the page and the blog in aggregate when deciding what to return for a given query. This is a bit of a divergence from the normal approach back in the day, (IR on a page by page level). Is this where 'global' site scoring was first implemented? Curious indeed.

And that last part on location is something bloggers that are localized in topical nature. It seems 'local SEO' isn't just for business alone. If you have a geo-centric blog it would be wise, as with any local targeting, to have this in your blog author profile.

Assessing Quality

Beyond the above, the first patent (covered by Bill here) talks about a quality assessment or what he called the 'quality score' (not to be confused with the AdWords relation). The scoring here can include;

  • Popularity – legitimate subscribers (kinda makes sense why they purchased FeedBurner huh?)
  • Implied popularity – click data from past queries. While more reliable than organic, it is still somewhat noisy due to click bias.
  • Inclusion in blog rolls – in aggregate popularity and also inclusion high quality blogrolls (sort of a TrustRank for blogrolls concept).
  • Tagging – giving a stronger 'signal' as to the core concepts of a post/blog
  • References – including chat and email (one can also infer non-link citations)
  • PageRank – and of course the link profile for a blog/post.

Also in the ranking patent they talked about possible 'negative indicators' or dampening factors that can be used including;

  • Frequency – spammers will have many posts in a short period of time or at specific intervals (3hrs 32min).
  • Content – as discussed earlier, differences from feed to content (eg; ads/aff links in actual post) can signal potential spammers
  • Words/Phrases – that evaluators have deemed to be common in spam could be flagged. You know, like 'buy' and 'viagra' ;0)
  • Post length – many times spammers have posts that are similar in length. This can be cause for dampening
  • Links – if many or most of the links in a post/blog are pointed to one or few domains, this can be a sign of spam.
  • Ads – when there are a lot of ads in a post, this can also be a sign of spam.

What is important with the dampening factors, as with any, is that they tend to local cumulatively. Don't go panicking if you satisfy one of the above factors. Ok? Thanks...

 

What does it mean to you?

Well as with any of these patents, it is all with a grain of salt. And truly, much of what we've learned along the way are things we already know. Or at least had a feeling was the case.

I think some of the spam related bits are always important to know because one does want to be able to play on the safe side. We certainly don't want to have dampeners (scoring reductions) on the blog because of minor infractions. I'd be sure to consider those moving forward.

The last bits of interest worth noting is that many of the same rules aply to targeting blog search as one would with the regular index. I would have to imagine if not yet, that some time soon social signals via the social graph (mixed with personalizations) will also be playing a role. Also, if you haven't looked into Pubsubhubub and Feedburner yet, get on it.

If I run into anymore interesting Google Blog search patents, research papers, I shall definately update this post... so be sure to bookmark it.

 

Comments  

 
0 # Michael Martinez 2010-08-27 05:22
"Number of occurrences of the search term (KW density...eeew)"

Nope. NOT keyword density. KD = Number of occurrences of Keyword / Number of words in document.

I think some people labeled "Number of occurrences of the search term" as Keyword FREQUENCY but don't hold me to that.
Reply | Reply with quote | Quote
 
 
0 # Dave 2010-08-27 14:06
@Michael - holy crap man, where ya been? And yes, yer right there, 'frequncy' might be a better term since I've really seen very little on ratios in any paper/patent. I guess my point was more towards what a weak signal that can be. Fortunately this is an older filing and it is unlikely much of a weighting at this point. In truth most of this has evolved, I just like documenting the path over time. Always interesting to see where they are thinking.
Reply | Reply with quote | Quote
 

Add comment


Security code
Refresh

Search the Site

SEO Training

Tools of the Trade

Banner
Banner
Banner

On Twitter

Follow me on Twitter

Site Designed by Verve Developments.