SEO Blog - Internet marketing news and views  

What you need to know about phrase based optimization

Written by David Harry   
Thursday, 27 August 2009 14:24

We’ve almost got the whole collection…woo hoo!

Ok sure, there are those that called me a total SOSG for getting excited about the patent awarded to Google yesterday, but there is goodiness for every SEO really. Granted it may be a bit geeky, but it is still somoething that needs more consideration in the SEO space... You see my friends, it has always been an odd thing that people in the business can go on and on about Google and LSI (total DUH…). I have seen nothing in Google research papers or patents that relate to it…

Understanding Phrase Based IR

On the other hand, we know of at least 9 patents on another semantic analysis approach… namely; Phrase Based IR


I’ve been on about it for a few years as you can see with these;

On my company site;

Here on the Trail;

The latest link in the chain

The reason I am once more mentioning it is that last time out we knew of two patent filings that had not yet surfaced – that changed yesterday. Google was awarded one of the missing elements;

Phrase-based personalization of searches in an information retrieval system
Filed July 26 2004
Assigned; August 25 2009

Here are the others;

Phrase Identification in an Information Retrieval System,
Filed on Jul. 26, 2004;
Assigned; Jan 26 2006
Phrase-Based Generation of Document Descriptions,
Filed on Jul. 26, 2004;
Assigned; Jan 26 2006
Phrase-Based Searching in an Information Retrieval System,
Filed on Jul. 26, 2004;
Assigned; Feb 09 2006
Automatic Taxonomy Generation in Search Results Using Phrases,
Filed on Jul. 26, 2004;
Assigned; Sept 16 2008
Phrase-based indexing in an information retrieval system
Filed on Jul. 26, 2004;
Assigned;Sept 30 2008


Then in 2005 continuation of;
Multiple index based information retrieval system
Filed on Jan 25, 2005;
Assigned; May 18 2006

And in June 2006 came
Detecting spam documents in a phrase based information retrieval
Filed on June 28 2006;
Assigned; Dec 28 2006


The only other one I know of that hasn’t been assigned is;

Phrase-Based Detection of Duplicate Documents in an Information Retrieval System,
filed on Jul. 26, 2004, ?? (it is referenced in the others)



The personalization concept

From what we’ve learned about PaIR, it essentially looks at optimal rates of related phrases and concepts on seed documents (pages deemed relevant by Google raters etc…) and uses that as a benchmark for ranking subsequent documents the engine comes across. That’s the easiest version I can give you… read some of the earlier posts for more…

Now, how a proposed personalization layer would work, is that it would use many of the signals for standard personalization and look deeper at the documents you’ve accessed to find optimal phrase relations and then change the search results not by merely seed sets, but documents that the user has expressed an (implicit) interest in previously (although explicit signals could also be used; think Google search Wiki).


While geeky, this kind of highlights it;

“(…) one or more second related phrases that are related to the second phrase(s) of the query and that are present in the user model; weighting a plurality of scores of a corresponding plurality of the search results according to the cluster counts of the identified one or more second related phrases; ranking the plurality of the search results for presentation to the user according to their weighted scores, to provide personalized search results; and presenting the personalized search results to the user.


Once more, understanding current personalization schemes can be important to getting the full grasp, but here’s a few potential signals mentioned in this filing;

  • A document accessed by the user comprises a document printed by the user.
  • The document accessed by the user comprises a document saved by the user.
  • Comprises a document stored as a favourite or link.
  • Comprises a document sent by email by the user.
  • A document maintained in a browser window for a predetermined amount of time.


There are plenty of other potential implicit and explicit signals to be had… but you get the idea… Since a picture is worth 1000 words... here's a few thousand more on it...

Phrase based IR graphics
Phrase based IR graphics 2


Phrase based IR graphics 3Understanding phrase based optimization

Of course you may be asking; ‘Great Davey baby… but how should I use this?’ – no worries, there are many ways. Essentially when creating content (or crafting link building programs even) you want to bear in mind that it isn’t the old days of KW stuffing and density… which is crap anyway. You want to start training yourself to be crafting not only core/secondary and stemming terms, but also collecting a list of ‘related phrases’ as well.


Here’s what you do…

  • While doing the KW research we’ll use various tools to also create a list of ‘related phrases’
  • Layout content program and structural hierarchy
  • Map out terms to pages
  • Give your writers not only core/secondary target terms, but related phrases as well.
  • Review and tweak pages prior to launch
  • Vary link texts when possible and remember themes/concepts as well as KWs
  • Understand the relations of concepts when actively link building


Essentially you’re creating a picture with words. You can find some more usable stuff in this 2007 post on relevance in SEO (though I’ve evolved the concepts now) and even use your imagination with this one on ..diversifying link profiles (from last summer).

There’s even a guest post on me mate Steven’s site ( aka VanGogh)

We’ll get back to possible applications… later.



Why am I still on about this??


With the risk of sounding redundant or obsessed, I thought we’d go at it one more time. You see it seems massively stunning that no one has really discussed these patents and approaches in the greater SEO community… Sure patents are really not much to go on, but I still hear about LSI…soooo… Some have even somehow confused these patents (from Anna Patterson) with those related to Applied Semantics LSI…So it’s an oddity at best..


What is more interesting than the fact there was such a great interest from Google, is the fact that a PaIR approach has so much potential beyond mere semantic analysis. The patents also cover Spam Detection, Duplicate Content, Link valuations and Personalization. Much like the not-so-covered Personalized PageRank, these things need to find their way onto the greater consciousness of the community.


It seems sensible to progress beyond talk of the unknown. Ever tried to sort all of it out and isolate any one ranking signal? Having done it a few times… it’s a heck of a challenge. Ultimately, it can be entirely futile in that we’ve not established ALL the signals which need to be isolated. We can only account for that which we know is there… if we’re not looking for potential signals and weights…well…


Do you see where this is going?



And so with only one more of these out there…. ‘tis doubtful you’ll have to bear with me on this path much more. As always, thanks for letting me go on; just looking for new (old?) things to talk about…


Here’s come more coverage from other the years;

Early Search Engine Watch mention
Article by Bill Slawski;
Article on Search Engine Land
Brother Mad Hat

On the boards -

WebMaster World thread
Musings on SEO Roundtable


Search the Site

SEO Training

Tools of the Trade


On Twitter

Follow me on Twitter

Site Designed by Verve Developments.