Phrase based WTF?
Often one finds themselves looking in the rear view mirror at topics that just won’t go away. One such shadow for me is a set of patents produced in whole, or in part, by Anna Patterson, (former Googler, was to Cuil for ‘em); the now (in)famous Phrase Based IR offerings from Google. I’ve lost count how many times I’ve discussed them/written about them over the years. It seems they just won’t seem to go away (we’ll get back to that shortly)
The other topic the keeps coming back? Well, that’s what is best known as; the SEO Magic Bullet

There are those that seem to believe in the magic bullet. Then, there are some sane people that passed on the tooth fairly long ago. For the purpose of today’s discussion, we’re going to go back to the 2007; The Magic Bullet - A chat with Bill Slawski
An email conversation with grand master Slawski, that turned into a post. The jist of the gig was that we shouldn’t treat patents/papers as gospel. Absorb them. Here’s some wisdom from that post;
 |
|
“The main benefit from looking at patents isn't necessarily seeing the methods that they describe, but rather being able to view the assumptions and the mindsets that they uncover. We can be so absorbed in looking at things from the perspective of marketers, and make up our own folklore and mythology (sandbox, anyone?) that having this other perspective can be really helpful.”
“Patent filings and white papers from search engineers don't necessarily provide a magic bullet, but they do provide the chance to look at information that comes directly from people working in search. To ignore those documents means not taking advantage of publicly available information that gives us a glimpse what those search engines find valuable enough to protect as intellectual property.”
“There are trade secrets that will likely never be disclosed in patent applications. And, the descriptions of processes in patent filings are only examples, and illustrations, that describe enough to protect the intellectual property behind the documents, while not disclosing enough so that they can be easily reverse engineered.” - Bill Slawski
|
|
 |
Ok? Get the idea here or what? While it is a great exercise, learning about search engines, some perspective is required. Remember, this ain’t rocket science, it’s computer science.
Patent Pending
One thing that we need to remember is that when a patent hits the streets, that’s simply the award date. We can have a patent awarded today that was submitted back in 2004. Does that mean the search engine was waiting around and WHAM… started implementing it today? …erm… of course not. It has been a patent pending status.
This means it’s quite likely it was at least in some semblance of beta when the patent was written, implemented, morphed and evolved in the years that passed. On the other hand, it may never have been used, or used and abandoned as well. But, either way, they weren’t waiting around to start implementing the technology/methods.
Which all brings me back to my first redundant shadow; Phrase Based IR.
...Meh...
Was all one o’ me mates had to say when confronted with a recent spate of misconceptions I came across. One of the first technologies that captured my fire and forever geekified me, it was phrase based IR. Monsieur Slawski introduced us and it was love at first site.
What’s odd though, is it gets mentioned/rediscovered from time to time and people start to spark up as explanations for the oddest things. Witness;
 |
|
“I'm wondering if Google has made a change in their phrase-based indexing approach - something that the new Caffeine infrastructure makes feasible. Recently there has been more patent activity in that area.” – Google MAYDAY update - Ted via WMW
|
|
 |
Hmmmm. Well, the patent in question was filed more than 3 years ago. We also know they had an interest way back in 2004. Obviously bringing the author, Anna, into the fold meant there was great interest. We can also note that in the later one, there were multiple authors (Anna was on the way to Cuil street?)
Would caffeine help? Sure, if as advertised, it is an infrastructure update. But that could be said for a lot of things (Open HTMM? PLSA? See? I can guess too...sigh). That’s not the point. What happens next is we see;
 |
|
“I'm still trying to get a handle on some of the odd fluctuations in site metrics attributed to what are undoubtedly bits and pieces of the Caffeine implementation. If you follow Google's patent activity, there's been some interesting recent activity in the area of phrase-based indexing.” – Dave Cosper via SEG
|
|
 |
Awww…. Crap. See? This is how it happens. We’ve been down the LSI trail a time or two as well dontcha know. And who can forget the bounce rate fun? This is mis-reported and entirely improvable at the end of the day. But it does get around. But ok, a few posts, although in authority locales, but it’s not that bad…I mean, it’s not like there’s wide spread insanity over the phrase based stuff, right?

Dammit! Dammit! Dammit!DAMMIT!!!! Here we go again...
Slow down the ride! I wanna get off!
This, my weary web wanderer, is where the need to understand the magic bullet theory comes into play. These patents and papers are nothing more than insight. Even if we knew Google used it. Even if we knew there were know other signals. We’d still be lost as we don’t know the weights/thresholds/dampeners in place.
But alas, there are far more signals that we can’t account for in the mix. This makes isolation of any one signal next to impossible for mere mortals. Let us not do the chicken (little) dance, running about stating what Caffiene (an infrastructure thang big daddy) is being driven by. Nor blame the poor algo for wrecking Tom, Dick and Harry’s rankings. It’s grasping at straws. Ok? Thanks. I hope we don't have to do this again (however unlikely)
Oh, along the way, I also discovered what was surely the catalyst – that Bill guy again. Whaddya know. Well, at least we know that he doesn’t believe in magic bullets. Do you?
Until next time… play safe!
More reading
Here are some other PaIR posts not mentioned here for those interested in learning more;
Blog Posts;
Related Patents;
Phrase Identification in an Information Retrieval System,
Filed on Jul. 26, 2004;
Assigned; Jan 26 2006
Phrase-Based Generation of Document Descriptions,
Filed on Jul. 26, 2004;
Assigned; Jan 26 2006
Phrase-Based Searching in an Information Retrieval System,
Filed on Jul. 26, 2004;
Assigned; Feb 09 2006
Automatic Taxonomy Generation in Search Results Using Phrases,
Filed on Jul. 26, 2004;
Assigned; Sept 16 2008
Phrase-based indexing in an information retrieval system
Filed on Jul. 26, 2004;
Phrase-based personalization of searches in an information retrieval system
Filed July 26 2004
Assigned; August 25 2009
Don't be a lonely search geek!

|
Happy to see the return of the magic bullet.
I agree that it's important to keep in mind that patent filings are a good way to get a glimpse into the mindsets of people at the search engines, but might not provide an accurate and actual view at what is going on with the engines themselves.
Another of those first generation phrase-based indexing patents was granted this week, on duplicate content detection in a phrase-based indexing system.
The second generation patents from Google on phrase-based indexing are pretty interesting because they focus upon how such a system could be technically implemented in a very large file system/server system. They also provide some insights into how that system could work that the first generation didn't include.
Regardless of whether or not Google is using such a phrase-based indexing system, the patent's description of the environment in which that system works, including the inverted index, the term (or phrase) posting lists, etc., provide an interesting look at the architecture of a search engine.
When I write about patents, I try to present what the patent may mean as a possibility rather than a certainty, and hope that it provokes more questions than conclusions, more ideas for testing than proof that a search engine is doing one thing or another.
For instance, in my post that describes phrasification, I hope that at least some of its readers started asking themselves, what if Google were using this process now. What would that mean to the way I do keyword research? What implications does it have regarding the keywords tools I might be using?
What does it mean for the way I attempt to optimize pages for certain phrases, or what I decide to use in anchor text pointing to pages? What can I do to test that might make a difference? The questions that patents raise are more important than the answers.