Facts, Fiction and plenty of Tin Foil
Now that the cat is finally out of the bag, I'd been getting some questions as to my personal thoughts on the world of Caffeine. So, here we go... say it once, then we're moving on with life. For those that missed the memo; Google's Caffeine updates is now LIVE.
FIRST – this is infrastructure folks, plain and simple. If there's anything we've seen that is important with all of the recent movements is that this comes with a concerted 'need for speed'. We can only guess that it is a response to the many new elements since the last major update, Big Daddy such as, (bot not limited to);
- Social (social search and Buzz)
- Real Time (RTS and even QDF implications)
- Universal (video, local shopping etc..)
- PubSubHubub (and feed pushing)
- Salmon (coming soon?)
- Meta data (RDFa, Rich Snippets, semantic)
- Google TV (Android etc..)
- New Interface (far more options, Wonder Wheel etc..)
You get thin idea. While there are no direct ranking algorithm changes directly, there is the possibility for them to be looking at deeper processing of existing signals and getting the infrastructure in place to deal with future implementations.
SECOND – what we do know. Beyond the fact that Google has a need for speed, we can take what we have so far (on the record). According to Vanessa Fox, (via SEL);
“The Caffeine infrastructure provides more flexibility in the type of details that can be stored with a document.” - Vanessa Fox
...and Matt had said that,
“It’s important to realize that caffeine is only a change in our indexing architecture. What’s exciting about Caffeine though is that it allows easier annotation of the information stored with documents, and subsequently can unlock the potential of better ranking in the future with those additional signals.” - Matt Cutts
Ok, so we did establish that part, it is about crawling/indexing/retrieval.
For their part Google has said that it “provides 50 percent fresher results for web searches than our last index, and it's the largest collection “ of documents they've ever had and that we will find thing “much sooner after it is published than was possible ever before.”. Uh huh. That need for speed thing again.
This is all very interesting as far as understanding the core. But is that it? No, they also talk about the various new types of data out there (as in the elements we outlined above) and then say;
“We've built Caffeine with the future in mind. Not only is it fresher, it's a robust foundation that makes it possible for us to build an even faster and comprehensive search engine that scales with the growth of information online, and delivers even more relevant search results to you." - Google Blog
This is the interesting part. At least for me. One can't help but consider that given the growth of the web in not only size but types of media and segments, but the types of signals available, it isn't a huge leap of faith to understand this most certainly WILL affect rankings somewhere along it's path.
Yes, I am going there!
As many know there are a few things that I consistently write about here, two of which are behavioural signals and personalized search. The latter actually being inclusive of the former. To that end there have always been a problem, poor signal quality and spam-ability. We might also infer that the combination of the two, under the old technology, may have been harder to fully implement.
I can't help but wonder if somewhere down the road this latest update might not be able to help to this end. Surely this need for speed and Caffeine has the potential to be used for this (among other things).
Behavioural data – as mentioned, these can be problematic as far as gleaning any real usable signals. Certainly on their own they are next to useless. But in aggregate, click data and query analysis data might be able to deliver some interesting implicit user feedback. But that requires processing power. See where I am going with this? But then we also have new signals that Matt and Co need to police. Spam would be an issue.
Personalization – so what if we just use behavioural signals in a personalization setting? This would surely help deal with the spam issue. I mean you can't really spam yourself now can you? Google is most certainly big on personalization of all kinds and this should just keep growing.
And so we might consider that some of this current iteration of infrastructure might be used to 'get beyond the links' as far as how they've done things in the past. Now, once more, we have to better understand the holy grail that is implicit feedback (explicit is revered, but much harder to get). Last time out, the SEO world was convinced of one of the lesser valued ones, bounce rates;
Seems that hasn't changed since we looked at it in early 09. But that's what I found amusing back then, that of the myriad behavioural signals, SEOs were fixated on that particular one, not really a factor of note. I do though have to believe some deeper personalization through a combination of query analysis, click data and other signals might be do-able. They may have the power to actually work with it.
What's it all mean?
Well, my weary web wandering market mavens, not much really. Much like the storm that wasn't, aka; MayDay (we dubbed it the 'Crap Hat' update) there isn't a lot to dwell on. To my thinking this is far more about the future than it is today. It is about enabling deeper processing (including ranking signals) and managing and ever growing web.
If there was anything to take away from this, it is not to start getting worked up. Invariably there will be those that start to freak out because they believe this has messed with their rankings/traffic. While we generally don't take all the Google says as gospel, this does make sense. An infrastructure upgrade allows for better processing. It is not necessarily a ranking thing at the outset.
And that's the view from here. What's your take-away from all this?
UPDATE; my fav search geek, Bill Slawski sent me some thoughts which were worth adding to this post;
From what I understand, caffeine is a software/hardware update to Google
that allows considerably faster parallel processing. The following video is worth watching, if for no other reason than to
understand the role of a master within Google's storage and retrieval of
data, so pay attention to the role of the "master," which manages meta data
Sean Quinlan, "Storage at Scale"
Follow that up with an interview with Sean Quinlan describing the 2 year
project to go from the use of a single master to distributed masters, found at -
GFS: Evolution on Fast-forward
The following patent is co-authored by Sean Quinlan and discusses a system
with distributed masters: System and method for analyzing data records
Patents of Interest
The next two patents are related, and provide some information about how a
system with distributed masters might be able to work together:
From both patents descriptions:
"In some information retrieval systems, freshness of the results (i.e., the
turnaround from when a document is updated to when the updated document is
available to queries) is an important consideration. However, there are
several obstacles to providing fresh results. One obstacle is the expense
or overhead associated with rebuilding the document index each time the
document repository is updated. For example, significant overhead is often
associated with building small indexes from new and updated documents and
periodically merging the small indexes with a main index, and furthermore
such systems typically suffer long latencies between document updates and
availability of those documents in the repository index. A second obstacle
is the difficulty of continuously processing queries against the document
repository while updating the repository, without incurring large overhead.
One aspect of this second obstacle is the need to synchronize both the
threads that execute queries and the threads that update the document
repository with key data structures in the data repository. The need to
synchronize the query threads and repository update threads can present a
significant obstacle to efficient operation of the document repository if
document updates are performed frequently, which in turn is a barrier to
maintaining freshness of the document repository."
Another section that is in both patents descriptions:
"As described above, in some embodiments, the tokenspace repository 106 may
include a plurality of sectional repositories. For example, a tokenspace
repository that stores webpages may have sectional repositories for the
bodies of webpages, anchor texts, and URLs. Whenever a section of a
document is updated but the other parts are unchanged, all of the sectional
repositories are "updated"; the sectional repositories are synchronized.
For example, if the body of a webpage has been updated but the anchor text
and URL remains the same, then the sectional repository for webpage bodies
is updated with the new content by appending the new content to the back
end of the sectional repository and the old content is invalidated. The
anchor text sectional repository is "updated" by appended the unchanged
anchor text to the back end and invalidating the older version of the same
anchor text. For the URL repository, the same URL is appended to the back
end and the older version of the same URL is invalidated. Similarly, the
sectional repositories are also synchronized in their treadmilling: when a
document is treadmilled, the information for the sections of the document
is appended to the back end in their respective repositories and their
older versions are invalidated. "
I haven't seen any discussion on the Web of these three patents being part
of caffeine, but they look the part after spending some time with Sean
Quinlan's interview above.