SEO Blog - Internet marketing news and views  

How relevant is Cuil?

Written by David Harry   
Monday, 04 August 2008 08:21

Building a better search engine

We all know about the latest Google killer by now right? You know, Cuil (pronounced ‘cool’)… the relevance based search engine that boasts more pages in its index than the mighty peeps at the Plex?

While there has been no lack of FAIL floating around after its recent launch, there are a few reasons why this Gypsy might just take a deeper look.

The phrase based IR connection

What I found interesting was that one Anna Patterson is on board... which itself might not mean much, but spare me a few moments to enlighten. She’s one of the ‘ex-Googlers’ that are part of the team at Cuil (along with hubby Tom). I know.. all falling in place now right? Ok.. a little further...

...let me quote from the management team page;

“Anna was the architect of Google’s large search index, TeraGoogle, that launched in early 2006. While at Google, Anna was the technical lead of one of the two Web ranking groups at Google, in charge of GoogleBase, and the manager for the core piece of Google’s ad-matching technology. She joined Google in 2004 after designing, writing and selling Recall—the largest search engine in existence at the time at 12 billion pages..”

Is Cuil using phrase based IR?


And what made any of this remarkable to me was the fact she has done a lot of work in the ‘Phrase based indexing and retrieval’ area… as you can see in these patents;

Why that is of interest is that I covered these back in a set of posts in ‘06 (hat tip to Senor Slawski)… and held high hopes for such systems. To me it makes for a good approach, though potentially limited without secondary signals… in my opinion at least. Now how does this play out with Cuil? Not sure, but from what they posted on the philosophy page, seems it could be in the mix;

“Cuil prefers to find all the pages with your keyword or phrase and then analyze the rest of the content on those pages. During this analysis we discover that your keywords have different meanings in different contexts. Once we’ve established the context of the pages, we’re in a much better position to help you in your search.” - the Cuil Philosophy

In simplest terms they are looking to rank documents via relevance, not popularity.  This is certainly in-line with the probabilistic modelling of the phrase based approach. So one does have to believe that it is somewhere near the core of the Cuil indexation and ranking methods. But is it enough?


Is it really a FAIL?

It is truly hard to say, but early reviews (at the end of post) and some of my own tinkering around seem to leave one wondering. Of the many problems I have with it so far, one is certainly the historical relevance of results. That is there are many older documents (2004 and beyond) ranking in query spaces I know to be somewhat time sensitive. This is certainly one area the old ‘query deserves freshness’ angle plays out well for Google. I would question how on page relevance deals with temporal issues. While topical relevance is great, historical signals are also important.

Another area I am bullish in is user behavioural metrics as ranking signals and personalized search methods. This is certainly NOT going to be in the Cuil bag of tricks as part of their mantra is not collecting related data;

“Cuil analyzes Web pages and not click-throughs, we don’t need to know your search history and habits. So our privacy policy is very simple: when you search with Cuil, we do not collect any personally identifiable information, period. We have no idea who sends queries: not by name, not by IP address, and not by cookie. Your search history is your business, not ours."

While there are those that have privacy issues with such data, it hasn’t stopped Google in its tracks now has it? At the end of the day these signals are important in understanding the user’s relationship with the search results and sites listed. They shouldn’t be discounted. Much like links and historical signals, popularity has its place in the results as does authority. Working from a relevance centric approach may seem utopian, but it can’t carry the day.

I also found problems with duplicate results or more than one from a given site which the indented listings in our familiar format lend themselves to (as do site links). There are a few problems out of the gate to be sure…


Somebody get me a blender

Even if it doesn’t rule the space, we might just be able to take away some ideas that can improve the next generation of search…

As recent as a few weeks ago I was writing about a world beyond links in an effort to illuminate ranking signals that we can consider beyond mere links and the offering from Cuil seems to be light on methods beyond mere page relevance. I truly yearn for a search engine that get's the mix right. As for Cuil, we haven’t even begun to look at the spam-ability of this approach either. I haven’t read much from the Cuil team on this end of things, always a serious area of concern for any search engine. Are they using the PaIR spam detection approach? Remains to be seen...

For the moment I am not going to write them off entirely and will dig deeper to see what positives there may be… if only for the fact that I was once a phrase based IR (PaIR) junkie.

If it is also using probabilistic learning, then it may (theoretically) get better over time. That would certainly explain holes early on with certain query spaces. I will do some more playing around with it next weekend.


Until then…more reactions;



Search the Site

SEO Training

Tools of the Trade


On Twitter

Follow me on Twitter

Site Designed by Verve Developments.