SEO Blog - Internet marketing news and views  

Google Caffeine; the latest Buzz

Written by David Harry   
Wednesday, 09 June 2010 13:00

Facts, Fiction and plenty of Tin Foil

Now that the cat is finally out of the bag, I'd been getting some questions as to my personal thoughts on the world of Caffeine. So, here we go... say it once, then we're moving on with life. For those that missed the memo; Google's Caffeine updates is now LIVE.

FIRST – this is infrastructure folks, plain and simple. If there's anything we've seen that is important with all of the recent movements is that this comes with a concerted 'need for speed'. We can only guess that it is a response to the many new elements since the last major update, Big Daddy such as, (bot not limited to);

  • Social (social search and Buzz)
  • Real Time (RTS and even QDF implications)
  • Universal (video, local shopping etc..)
  • PubSubHubub (and feed pushing)
  • Salmon (coming soon?)
  • Meta data (RDFa, Rich Snippets, semantic)
  • Google TV (Android etc..)
  • New Interface (far more options, Wonder Wheel etc..)

You get thin idea. While there are no direct ranking algorithm changes directly, there is the possibility for them to be looking at deeper processing of existing signals and getting the infrastructure in place to deal with future implementations.

SECOND – what we do know. Beyond the fact that Google has a need for speed, we can take what we have so far (on the record). According to Vanessa Fox, (via SEL);

“The Caffeine infrastructure provides more flexibility in the type of details that can be stored with a document.” - Vanessa Fox

...and Matt had said that,

“It’s important to realize that caffeine is only a change in our indexing architecture. What’s exciting about Caffeine though is that it allows easier annotation of the information stored with documents, and subsequently can unlock the potential of better ranking in the future with those additional signals.” - Matt Cutts

Ok, so we did establish that part, it is about crawling/indexing/retrieval.

For their part Google has said that it “provides 50 percent fresher results for web searches than our last index, and it's the largest collection “ of documents they've ever had and that we will find thing “much sooner after it is published than was possible ever before.”. Uh huh. That need for speed thing again.

This is all very interesting as far as understanding the core. But is that it? No, they also talk about the various new types of data out there (as in the elements we outlined above) and then say;

“We've built Caffeine with the future in mind. Not only is it fresher, it's a robust foundation that makes it possible for us to build an even faster and comprehensive search engine that scales with the growth of information online, and delivers even more relevant search results to you." - Google Blog

This is the interesting part. At least for me. One can't help but consider that given the growth of the web in not only size but types of media and segments, but the types of signals available, it isn't a huge leap of faith to understand this most certainly WILL affect rankings somewhere along it's path.

Yes, I am going there!

As many know there are a few things that I consistently write about here, two of which are behavioural signals and personalized search. The latter actually being inclusive of the former. To that end there have always been a problem, poor signal quality and spam-ability. We might also infer that the combination of the two, under the old technology, may have been harder to fully implement.

I can't help but wonder if somewhere down the road this latest update might not be able to help to this end. Surely this need for speed and Caffeine has the potential to be used for this (among other things).

Behavioural data – as mentioned, these can be problematic as far as gleaning any real usable signals. Certainly on their own they are next to useless. But in aggregate, click data and query analysis data might be able to deliver some interesting implicit user feedback. But that requires processing power. See where I am going with this? But then we also have new signals that Matt and Co need to police. Spam would be an issue.

Personalization – so what if we just use behavioural signals in a personalization setting? This would surely help deal with the spam issue. I mean you can't really spam yourself now can you? Google is most certainly big on personalization of all kinds and this should just keep growing.

And so we might consider that some of this current iteration of infrastructure might be used to 'get beyond the links' as far as how they've done things in the past. Now, once more, we have to better understand the holy grail that is implicit feedback (explicit is revered, but much harder to get). Last time out, the SEO world was convinced of one of the lesser valued ones, bounce rates;

Bounce Rates not a Ranking Signal

Seems that hasn't changed since we looked at it in early 09. But that's what I found amusing back then, that of the myriad behavioural signals, SEOs were fixated on that particular one, not really a factor of note. I do though have to believe some deeper personalization through a combination of query analysis, click data and other signals might be do-able. They may have the power to actually work with it.

What's it all mean?

Well, my weary web wandering market mavens, not much really. Much like the storm that wasn't, aka; MayDay (we dubbed it the 'Crap Hat' update) there isn't a lot to dwell on. To my thinking this is far more about the future than it is today. It is about enabling deeper processing (including ranking signals) and managing and ever growing web.

Google Caffeine Update

If there was anything to take away from this, it is not to start getting worked up. Invariably there will be those that start to freak out because they believe this has messed with their rankings/traffic. While we generally don't take all the Google says as gospel, this does make sense. An infrastructure upgrade allows for better processing. It is not necessarily a ranking thing at the outset.

And that's the view from here. What's your take-away from all this?

UPDATE; my fav search geek, Bill Slawski sent me some thoughts which were worth adding to this post;

From what I understand, caffeine is a software/hardware update to Google that allows considerably faster parallel processing. The following video is worth watching, if for no other reason than to understand the role of a master within Google's storage and retrieval of data, so pay attention to the role of the "master," which manages meta data

Video - Sean Quinlan, "Storage at Scale"

Follow that up with an interview with Sean Quinlan describing the 2 year project to go from the use of a single master to distributed masters, found at - GFS: Evolution on Fast-forward

The following patent is co-authored by Sean Quinlan and discusses a system with distributed masters: System and method for analyzing data records

Patents of Interest

The next two patents are related, and provide some information about how a system with distributed masters might be able to work together:

From both patents descriptions:

"In some information retrieval systems, freshness of the results (i.e., the turnaround from when a document is updated to when the updated document is available to queries) is an important consideration. However, there are several obstacles to providing fresh results. One obstacle is the expense or overhead associated with rebuilding the document index each time the document repository is updated. For example, significant overhead is often associated with building small indexes from new and updated documents and periodically merging the small indexes with a main index, and furthermore such systems typically suffer long latencies between document updates and availability of those documents in the repository index. A second obstacle is the difficulty of continuously processing queries against the document repository while updating the repository, without incurring large overhead. One aspect of this second obstacle is the need to synchronize both the threads that execute queries and the threads that update the document repository with key data structures in the data repository. The need to synchronize the query threads and repository update threads can present a significant obstacle to efficient operation of the document repository if document updates are performed frequently, which in turn is a barrier to maintaining freshness of the document repository."

Another section that is in both patents descriptions:

"As described above, in some embodiments, the tokenspace repository 106 may include a plurality of sectional repositories. For example, a tokenspace repository that stores webpages may have sectional repositories for the bodies of webpages, anchor texts, and URLs. Whenever a section of a document is updated but the other parts are unchanged, all of the sectional repositories are "updated"; the sectional repositories are synchronized. For example, if the body of a webpage has been updated but the anchor text and URL remains the same, then the sectional repository for webpage bodies is updated with the new content by appending the new content to the back end of the sectional repository and the old content is invalidated. The anchor text sectional repository is "updated" by appended the unchanged anchor text to the back end and invalidating the older version of the same anchor text. For the URL repository, the same URL is appended to the back end and the older version of the same URL is invalidated. Similarly, the sectional repositories are also synchronized in their treadmilling: when a document is treadmilled, the information for the sections of the document is appended to the back end in their respective repositories and their older versions are invalidated. "

I haven't seen any discussion on the Web of these three patents being part of caffeine, but they look the part after spending some time with Sean Quinlan's interview above.




0 # Bill Slawski 2010-06-09 20:32
Thanks for including my thoughts on caffeine in this post, Dave.

It's hard to even begin to conceive how much data Google is collecting about the Web, books, Maps, Videos, and their other repositories of data, as well as the user-behavior associated with people using those services and their browsing history.

I'm thinking back to the days when Google's web index updated every 4 or 5 weeks, and wondering if the effort to update their index in almost realtime shows a transformation of what searchers want from a search engine.

If you watched the SIGMOD keynote from Jon Kleinberg yesterday, he talked about two different kinds of informational resources that a search engine could be thought of. One as a current events monitor, and the other as an index of information.

With near realtime indexing, and the ability to boost pages based upon things like burstiness of topics/queries, we seem to be seeing an evolution towards that monitoring of current events.

A bigger, faster search repository, that can take advantage of user-behavior data, also makes it more likely that Google will experiment with presenting each of us with our own personal index of the Web.
Reply | Reply with quote | Quote
0 # Dave 2010-06-09 22:33
Aye sire, that's been something I've long been wondering about, more granular personalization . It would certainly go a long way to combating spam, one of the problems presented with behavioural data. I've long had the feeling, reading papers/patents as well as they various public statements, that they are committed to/interested in a deeper personalization . Some of the Personalized PageRank stuff does get into it, but it's more of a categorizaiton of user types for personalization than a granular person-by-person approach.

We have certainly come a long way since the days of the monthly Google dance, it has been a great ride and I can't wait to see what the future holds. And they say 'SEO is Dead' oh my. As long as they (the engines) keep evolving, so will the job of those looking to ensure content is findable.

Thanks again on putting some perspective in on it. Always a pleasure to get input from those with a historical sense of it all.
Reply | Reply with quote | Quote
0 # Bill Slawski 2010-06-10 14:00
Definitely, increasing indexing size and speed, and well as making retrieval of information from the index faster would be something that could allow for a more granular personalization .

Though one of the reasons to focus upon broader categories in something like a personalized PageRank is often because the information available to the search engine about a particular topic or query may be sparse, when looking at implicit and explicit profiles created for someone through their web histories. The search engine may have to backoff, and try to look broader categories or groups that someone may find of interest, where it doesn't have more specific information.

Interestingly, too, part of the evolution of search engines is because of the changes in how people search, and what they search for. If people want more videos, for instance, that means that the search engines have to improve their video search, and it makes sense to include videos in web search results. In many ways, the evolution of search is a response to trying to understand what people are searching for, and want to see in search results.
Reply | Reply with quote | Quote
0 # Patrick Pole 2010-06-14 13:57
I have to say that it is a good way to promote the product or services into maps…but the main question is that when there are more than 10 pizza providers in same location then how can people know that which one is better. Thank You;-).
Reply | Reply with quote | Quote
0 # Diana Ratliff 2010-06-18 19:53
Was doing some research on Google Caffeine for a newspaper column and this is some of the most comprehensive data I've seen - hadn't heard of you before but I'll be back! Thanks.
Reply | Reply with quote | Quote
0 # Sterling Mckinley 2010-07-20 17:01
I have to say coming to your site I always know I'm going to read something informative
Reply | Reply with quote | Quote

Add comment

Security code

Search the Site

SEO Training

Tools of the Trade


On Twitter

Follow me on Twitter

Site Designed by Verve Developments.