SEO Blog - Internet marketing news and views  

What you need to know about phrase based optimization

Written by David Harry   
Thursday, 27 August 2009 14:24

We’ve almost got the whole collection…woo hoo!

Ok sure, there are those that called me a total SOSG for getting excited about the patent awarded to Google yesterday, but there is goodiness for every SEO really. Granted it may be a bit geeky, but it is still somoething that needs more consideration in the SEO space... You see my friends, it has always been an odd thing that people in the business can go on and on about Google and LSI (total DUH…). I have seen nothing in Google research papers or patents that relate to it…

Understanding Phrase Based IR

On the other hand, we know of at least 9 patents on another semantic analysis approach… namely; Phrase Based IR


I’ve been on about it for a few years as you can see with these;

On my company site;

Here on the Trail;

The latest link in the chain

The reason I am once more mentioning it is that last time out we knew of two patent filings that had not yet surfaced – that changed yesterday. Google was awarded one of the missing elements;

Phrase-based personalization of searches in an information retrieval system
Filed July 26 2004
Assigned; August 25 2009

Here are the others;

Phrase Identification in an Information Retrieval System,
Filed on Jul. 26, 2004;
Assigned; Jan 26 2006
Phrase-Based Generation of Document Descriptions,
Filed on Jul. 26, 2004;
Assigned; Jan 26 2006
Phrase-Based Searching in an Information Retrieval System,
Filed on Jul. 26, 2004;
Assigned; Feb 09 2006
Automatic Taxonomy Generation in Search Results Using Phrases,
Filed on Jul. 26, 2004;
Assigned; Sept 16 2008
Phrase-based indexing in an information retrieval system
Filed on Jul. 26, 2004;
Assigned;Sept 30 2008


Then in 2005 continuation of;
Multiple index based information retrieval system
Filed on Jan 25, 2005;
Assigned; May 18 2006

And in June 2006 came
Detecting spam documents in a phrase based information retrieval
Filed on June 28 2006;
Assigned; Dec 28 2006


The only other one I know of that hasn’t been assigned is;

Phrase-Based Detection of Duplicate Documents in an Information Retrieval System,
filed on Jul. 26, 2004, ?? (it is referenced in the others)



The personalization concept

From what we’ve learned about PaIR, it essentially looks at optimal rates of related phrases and concepts on seed documents (pages deemed relevant by Google raters etc…) and uses that as a benchmark for ranking subsequent documents the engine comes across. That’s the easiest version I can give you… read some of the earlier posts for more…

Now, how a proposed personalization layer would work, is that it would use many of the signals for standard personalization and look deeper at the documents you’ve accessed to find optimal phrase relations and then change the search results not by merely seed sets, but documents that the user has expressed an (implicit) interest in previously (although explicit signals could also be used; think Google search Wiki).


While geeky, this kind of highlights it;

“(…) one or more second related phrases that are related to the second phrase(s) of the query and that are present in the user model; weighting a plurality of scores of a corresponding plurality of the search results according to the cluster counts of the identified one or more second related phrases; ranking the plurality of the search results for presentation to the user according to their weighted scores, to provide personalized search results; and presenting the personalized search results to the user.


Once more, understanding current personalization schemes can be important to getting the full grasp, but here’s a few potential signals mentioned in this filing;

  • A document accessed by the user comprises a document printed by the user.
  • The document accessed by the user comprises a document saved by the user.
  • Comprises a document stored as a favourite or link.
  • Comprises a document sent by email by the user.
  • A document maintained in a browser window for a predetermined amount of time.


There are plenty of other potential implicit and explicit signals to be had… but you get the idea… Since a picture is worth 1000 words... here's a few thousand more on it...

Phrase based IR graphics
Phrase based IR graphics 2


Phrase based IR graphics 3Understanding phrase based optimization

Of course you may be asking; ‘Great Davey baby… but how should I use this?’ – no worries, there are many ways. Essentially when creating content (or crafting link building programs even) you want to bear in mind that it isn’t the old days of KW stuffing and density… which is crap anyway. You want to start training yourself to be crafting not only core/secondary and stemming terms, but also collecting a list of ‘related phrases’ as well.


Here’s what you do…

  • While doing the KW research we’ll use various tools to also create a list of ‘related phrases’
  • Layout content program and structural hierarchy
  • Map out terms to pages
  • Give your writers not only core/secondary target terms, but related phrases as well.
  • Review and tweak pages prior to launch
  • Vary link texts when possible and remember themes/concepts as well as KWs
  • Understand the relations of concepts when actively link building


Essentially you’re creating a picture with words. You can find some more usable stuff in this 2007 post on relevance in SEO (though I’ve evolved the concepts now) and even use your imagination with this one on ..diversifying link profiles (from last summer).

There’s even a guest post on me mate Steven’s site ( aka VanGogh)

We’ll get back to possible applications… later.



Why am I still on about this??


With the risk of sounding redundant or obsessed, I thought we’d go at it one more time. You see it seems massively stunning that no one has really discussed these patents and approaches in the greater SEO community… Sure patents are really not much to go on, but I still hear about LSI…soooo… Some have even somehow confused these patents (from Anna Patterson) with those related to Applied Semantics LSI…So it’s an oddity at best..


What is more interesting than the fact there was such a great interest from Google, is the fact that a PaIR approach has so much potential beyond mere semantic analysis. The patents also cover Spam Detection, Duplicate Content, Link valuations and Personalization. Much like the not-so-covered Personalized PageRank, these things need to find their way onto the greater consciousness of the community.


It seems sensible to progress beyond talk of the unknown. Ever tried to sort all of it out and isolate any one ranking signal? Having done it a few times… it’s a heck of a challenge. Ultimately, it can be entirely futile in that we’ve not established ALL the signals which need to be isolated. We can only account for that which we know is there… if we’re not looking for potential signals and weights…well…


Do you see where this is going?



And so with only one more of these out there…. ‘tis doubtful you’ll have to bear with me on this path much more. As always, thanks for letting me go on; just looking for new (old?) things to talk about…


Here’s come more coverage from other the years;

Early Search Engine Watch mention
Article by Bill Slawski;
Article on Search Engine Land
Brother Mad Hat

On the boards -

WebMaster World thread
Musings on SEO Roundtable



0 # Brett Pringle 2009-08-27 14:59
I doubt there will be much talk about phrase optimisation, certain secrets are kept inhouse :-) without sharing all the inside secrets to the rest of the community.

"Keyword" optimisation is like link building, there are still people using search engine submissions as link building :-)
Reply | Reply with quote | Quote
0 # Dave 2009-08-27 15:13
lol... I'd not consider much of this to be outside the realm of discussion IMO. Once more, it's merely another possible layer (for semantic analysis) and what's important IS the discussion. I'd say that since these are from 2004 they've indeed been evolved or abandoned altogether. So to me what's important is;

1. Understanding that we can't test without knowing all the potential options that may be in play.

2. Discussing semantic analysis in terms beyond the mythic LSI and looking for ways to better optimize a page thus putting less stress on link spam...erm.. I mean 'building'..hehe

To me these are simply an interesting discussion - one I've been having with other SEOs for years now...

....and admittedly, I do get a bit over excited with this stuff... (don't even get me started on Personalized PageRank)

Reply | Reply with quote | Quote
0 # Brett Pringle 2009-08-27 15:30
lol, fully agree. Find myself getting lost in hours of internal discussions regarding content, text and how SE's evaluate pages/text/sections/page segmentation etc. Definitely need to know what factors are involved when performing tests, otherwise conclusions are not complete :-)
Reply | Reply with quote | Quote
0 # Dave 2009-08-27 15:46
Ain't it great? We do need things to play with other than links... (internal/external) and page segmentation is another very interesting one for sure...I say, the better U nail the on-site/page, the less links... sooooo... woo hoo.. Link building can often be tiresome.

Reply | Reply with quote | Quote
0 # terry van horne 2009-08-27 15:31
David, though they just got the patent IMO, it's been in the algo for a while. Patents are protection of intellectual property. Just wanted to make that point.

As to the community... outside of the dojo... too busy chasing links and doing it the hard way to take the time to consider the complexities of geek SEO. They're too busy following the gurus who are still teaching the same techniques they were in 96. Know because I was one of 'em. ;-)
Reply | Reply with quote | Quote
0 # Dave 2009-08-27 15:50
Yea, I did touch on that in the previous comment...this was 2004 so it's likely nothing more than a history lesson. Adopted or not, I think it can spark more interesting discussions of semantic analysis beyond what we've seen in the community to date.

It does seem funny that things like LSI (as Applied Semantics was purchased around the same as Anna's joining Google) got more play over the years and this stuff never did... I made up a pretty acronym and

....sigh.... oh well, I still think it's an interesting area of study myself...
Reply | Reply with quote | Quote

Add comment

Security code

Search the Site

SEO Training

Tools of the Trade


On Twitter

Follow me on Twitter

Site Designed by Verve Developments.