Yet another patent on phrase based indexing and retrieval (YaPaIR)
Unfortunately Cuil buzz has cooled, but at least co-founder Anna Patterson is back on the radar. If only in name and memories of what used to be that is. And if that radar is for patent droolers and semantic simians... then it is a name of note.
You see, things were different back in Oh4 when she was working with Google and toiled on computer learning model for understanding concepts and semantic relationships through phrasing. Ah yes, I remember it like it was yesterday
.
Thats Cuil
Oh, we were just talking about Phrase based indexing and retrieval last week? You say Bill wrote about it as well? And we even managed to slip it into yesterdays rant? Wow
just doesnt want to go away huh?

Hot on the heels of a delayed phrase based IR patent, another has surfaced. Also filed back in 2004 with the rest of the collection today Google was granted; Phrase-based indexing in an information retrieval system
Follow the drama
Now this little diddy is part of an exciting ongoing mini-series. The excitement is waiting for pieces to pop up every few years. Thats the best part
like a treasure hunt. Here's the set as we know it;
Phrase Identification in an Information Retrieval System,
Filed on Jul. 26, 2004;
Granted; Jan 26 2006
Phrase-Based Generation of Document Descriptions,
Filed on Jul. 26, 2004;
Granted; Jan 26 2006
Phrase-Based Searching in an Information Retrieval System,
Filed on Jul. 26, 2004;
Granted; Feb 09 2006
Automatic Taxonomy Generation in Search Results Using Phrases,
Filed on Jul. 26, 2004;
Granted; Sept 16 2008
Phrase-based indexing in an information retrieval system
Filed on Jul. 26, 2004;
Granted; Sept 30 2008
Missing links?
Phrase-Based Detection of Duplicate Documents in an Information Retrieval System,
filed on Jul. 26, 2004, ?? (it is referenced in the others)
Phrase-Based Personalization of Searches in an Information Retrieval System,
filed on Jul. 26, 2004; ?? (it is referenced in the others)
Then in 2005 continuation of;
Multiple index based information retrieval system
Filed on Jan 25, 2005;
Granted; May 18 2006
And in June 2006 came
Detecting spam documents in a phrase based information retrieval
System
Filed on June 28 2006;
Granted; Dec 28 2006
So it has been a ride full of thrills and fun for the whole family. There certainly was some interest in this for awhile at Google. Sadly, the passion Cuiled and they parted ways. What ever did become of phrase based IR? Did Google use some for personalization and query revisions? A minor signal even? What about Cuil? Did Anna take the same mindset over there when they built that bad boy?
Will we ever know?
How the hell should I know? It does make for some interesting reading and opens up new dimensions on how better relevancy for concepts and ideas could be found through phrases. Last time I checked singular word searches are on the decline and 2-3 word phrases are becoming the commonplace.
In simplest terms the system would learn based on commonalities in phrasings across a given document set. Instead of a linear (Boolean) approach page content, link texts and so forth can be analyzed to topically related themes. A good example given was for a query for - blue merle agility training;
"blue merle":: "Australian Shepherd," "red merle," "tricolor," "aussie";
"agility training":: "weave poles," "teeter," "tunnel," "obstacle," "border collie".
Are examples of related terms that appear in other documents in the results set. By looking statistically for one can find documents with the topimal occurances. They also discuss snippets, personalization, spam detection, duplicate content and more through the phrase based IR approach.
Think of it like having related phrases to a given topic and then adding phrase extensions to further max out the list. Now look at the potential web pages to be ranked and start finding the ones with the optimal related terms. You can also look at the inbound and outbound link anchor text for further scoring. Check for duplication and away you go. Sure, thats simplified; but weve been down this trail before.
And my spammer friends would be hard pressed to sort out what exactly the right magic mix is. And thats assuming it was a standalone system.
Can I get some PageRank with that please?
Now lets say we already have a system in place that uses a wide variety of factors in the indexing and retrieval system. This means we can take our phrase based model and slowly play with it and tune it so that it finds a home along-side of the other algos in the big Happy Googley world.
Because it covers a variety of areas from personalization and query analysis down to Snippet generation and spam detection, it would likely be a handy tool.
A blast from the past
Almost seemed funny, you know, funny-smirk-giggle, not bwaaa ha ha haha funny, that Cuil (Annas new beau) talks about not using personal info, nor relies on links. Because when I read this, almost sounded like a shot considering all thats happened;
Search engines that use a ranking algorithm that relies on the number of links that point to a given document in order to rank that document can be "bombed" by artificially creating a large number of pages with a given anchor text which then point to a desired page. As a result, when a search query using the anchor text is entered, the desired page is typically returned, even if in fact this page has little or nothing to do with the anchor text.
Ouch
meeeooow!! You tell em girl. I guess the writing was on the wall, erm, patent for this young couple. And as you can see
Google got to keep all her stuff in the divorce
. How Cuil is that?
For more on Phrase Based Indexing and Retrieval see; Phrase Based Optimization resources
|