Building a better search engine
We all know about the latest Google killer by now right? You know, Cuil (pronounced cool)
the relevance based search engine that boasts more pages in its index than the mighty peeps at the Plex?
While there has been no lack of FAIL floating around after its recent launch, there are a few reasons why this Gypsy might just take a deeper look.
The phrase based IR connection
What I found interesting was that one Anna Patterson is on board... which itself might not mean much, but spare me a few moments to enlighten. Shes one of the ex-Googlers that are part of the team at Cuil (along with hubby Tom). I know.. all falling in place now right? Ok.. a little further...
...let me quote from the management team page;
Anna was the architect of Googles large search index, TeraGoogle, that launched in early 2006. While at Google, Anna was the technical lead of one of the two Web ranking groups at Google, in charge of GoogleBase, and the manager for the core piece of Googles ad-matching technology. She joined Google in 2004 after designing, writing and selling Recallthe largest search engine in existence at the time at 12 billion pages..
And what made any of this remarkable to me was the fact she has done a lot of work in the Phrase based indexing and retrieval area
as you can see in these patents;
Why that is of interest is that I covered these back in a set of posts in 06 (hat tip to Senor Slawski)
and held high hopes for such systems. To me it makes for a good approach, though potentially limited without secondary signals
in my opinion at least. Now how does this play out with Cuil? Not sure, but from what they posted on the philosophy page, seems it could be in the mix;
Cuil prefers to find all the pages with your keyword or phrase and then analyze the rest of the content on those pages. During this analysis we discover that your keywords have different meanings in different contexts. Once weve established the context of the pages, were in a much better position to help you in your search. - the Cuil Philosophy
In simplest terms they are looking to rank documents via relevance, not popularity. This is certainly in-line with the probabilistic modelling of the phrase based approach. So one does have to believe that it is somewhere near the core of the Cuil indexation and ranking methods. But is it enough?
Is it really a FAIL?
It is truly hard to say, but early reviews (at the end of post) and some of my own tinkering around seem to leave one wondering. Of the many problems I have with it so far, one is certainly the historical relevance of results. That is there are many older documents (2004 and beyond) ranking in query spaces I know to be somewhat time sensitive. This is certainly one area the old query deserves freshness angle plays out well for Google. I would question how on page relevance deals with temporal issues. While topical relevance is great, historical signals are also important.
Another area I am bullish in is user behavioural metrics as ranking signals and personalized search methods. This is certainly NOT going to be in the Cuil bag of tricks as part of their mantra is not collecting related data;
While there are those that have privacy issues with such data, it hasnt stopped Google in its tracks now has it? At the end of the day these signals are important in understanding the users relationship with the search results and sites listed. They shouldnt be discounted. Much like links and historical signals, popularity has its place in the results as does authority. Working from a relevance centric approach may seem utopian, but it cant carry the day.
I also found problems with duplicate results or more than one from a given site which the indented listings in our familiar format lend themselves to (as do site links). There are a few problems out of the gate to be sure
Somebody get me a blender
Even if it doesnt rule the space, we might just be able to take away some ideas that can improve the next generation of search
As recent as a few weeks ago I was writing about a world beyond links in an effort to illuminate ranking signals that we can consider beyond mere links and the offering from Cuil seems to be light on methods beyond mere page relevance. I truly yearn for a search engine that get's the mix right. As for Cuil, we havent even begun to look at the spam-ability of this approach either. I havent read much from the Cuil team on this end of things, always a serious area of concern for any search engine. Are they using the PaIR spam detection approach? Remains to be seen...
For the moment I am not going to write them off entirely and will dig deeper to see what positives there may be
if only for the fact that I was once a phrase based IR (PaIR) junkie.
If it is also using probabilistic learning, then it may (theoretically) get better over time. That would certainly explain holes early on with certain query spaces. I will do some more playing around with it next weekend.