Looking at real time search once more
A while back I had been musing about the phenomenon that is real time search. It would seem that it is this year’s buzz phrase much like last years obsession with ‘social search’. What bothered me most was that many of the existing RTS engines were little more than social mention regurgitators with Twitter being at the core. To me, most of them don’t satisfy what I believe a search engine is (traditionally).
You can see these for more;
Real time search engines; should SEOs care?
Making sense of real-time social search
Now, along the way we talked about OneRiot and their ‘PulseRank’ approach. One of the reasons I became more interested in OneRiot is that they actually do some crawling and make indexing and ranking decisions. This is more in-line with what I believe a search engine to be…
Not one to be shy, I managed to get talking with the folks at OneRiot and despite my less than enthusiastic stance on RTS, they agreed to answer some questions for me. What follows are the interview with them… I hope you enjoy!
A chat with the gang at One Riot
Dave; How do you define real-time search? Obviously logic dictates it can’t be ‘real time’ but all of the semantics aside, how do you think of it or explain it to others?
Tobias Peggs - For us, “realtime search” means giving users the right results for right now. For example, if you’re searching for “Iran”, then the right results for right now would be the news covering the election, pictures reflecting the street scenes, or the blogs and videos that are building buzz.
This is highly differentiated from traditional search. A search for “Iran” on a traditional search engine is likely to return a Wikipedia page or an official government website. Those are dependable results – you’ll probably get very dependable information about Iran the country, such as population statistics - but those results don’t really reflect why you might be searching for Iran right now.
Studies show that around 40% of searches are from people trying to find the right answer right now. They are searching for something as heavy weight as “Iran” or as entertaining as “Britney”, they are typing a query into a search box and expecting that the search engine will tell them what’s going on right now. We deliver results to meet that demand.
Now, how we deliver those results – the mechanics of our search engine – is also described as “realtime search”. Our approach is to harness the realtime activity of users of social web to help curate our index. As users share links to interesting web pages (on Twitter or Digg or other social web services, or via our own panel), we use that action as a cue to go and index that web page. This means our search index is full of socially-relevant results for right now – webpages that other people are recommending as being relevant right now. So when you search on OneRiot we return relevant results that the social web is interested in right now, which helps us determine the right results for right now.
Dave; What networks are you currently crawling/monitoring? Any plans to expand the sources?
Tobias Peggs -We crawl what users think is interesting right now. If a user thinks it’s interesting right now, by virtue of sharing that link somewhere, then we’ll index it. The services we monitor for user sharing right now include Twitter and Digg (and several others we’re unable to disclose). In addition, we have our own panel of users – like Compete.com or any of the internet measuring services - who have opted-in to share, in realtime, information about the news, videos and blogs they are spending time on. OneRiot’s user panel is a key advantage.
It enables us to deliver broad-based realtime search results by harvesting both explicit social activity on Twitter, Digg and other services in combination with implicit data from over 3 million users who have elected to join our panel. Twitter is an important signal for us, but it makes up a fraction of our data. (The importance of this was highlighted today during the DDOS ... we still had great results. Other realtime search engines who only index twitter did not).
Dave; That's an interesting point about the recent DDOS, many of the RTS that rely on Twitter likely ran dry...
You guys also have gotten on with Microsoft and IE8 – can you tell me a bit about the ‘webslices’ program and what you are doing over there?
Tobias Peggs -We are a proud member of Microsoft’s StartUp Accelerator group which is keen on working with young, smart companies with scalable technology. As such, we were a close partner throughout the release cycle of Internet Explorer 8. We were the first company to build a Web Slice (IE8 browser extension that updates automatically when new information is available) and we also offer additional realtime search services for IE. And recently Microsoft released a new version of IE8 that is bundled with OneRiot search. This is a huge validation for a company like us – it shows we can deliver quality results from the realtime web, at scale.
(Dave's Note; MS webslices is an interesting evolution of 'page segmentation peeps... Read More)
Dave - OneRiot seems to like to compare PulseRank to PageRank. Is there some type of algorithmic similarity here or more of a marketing angle? You are indexing links and social votes are similar to links as votes. What differences/similarities are there?
Tobias Peggs - OneRiot PulseRank is PageRank for the realtime web. They both exist to help rank results for the use case in mind with the search engine. Traditional search engines (using variations of PageRank) will give high rank to pages with a large number of inbound links which typically reflects historical dependability. This is why you get a Wikipedia page as the top result if you search for “Iran,” and also why the results rarely change (PageRank is relatively static).
PulseRank gives high ranking to pages with a large amount of social buzz right now which characteristically reflects the content that people think is relevant right now. This is why you will find the current news, videos, pictures or blogs as the top results if you search for “Iran” on OneRiot and also why the results change frequently, i.e. in realtime.
So, they are are similar in that both PageRank and PulseRank are in place to provide relevancy and ranking mechanisms for information on the web. However they are fundamentally different since PageRank treats the web as a reference library for dependable information whereas OneRiot treats the web as a fluid and dynamic portal of information. In both cases, the algorithms are tuned to deliver the most relevant result to suit the use case.
Dave - Ok, so youre really looking to differentiate yourselves from traditional search... makes sense.
One area that seemed worth doing would be some form of suggestions or related queries. For example searching ‘latent semantic analysis’ brought no results (IR geeks aren’t social?) while ‘latent semantic indexing’ did. Do you see some form of recommendation engine incorporated in the future?
Tobias Peggs - There are so many unique possibilities to utilize our realtime data and index which is why we opened our API to allow 3rd party developers to take our realtime search results to their users. In doing so, they can build distinctive browser add-ons, desktop applications, social websites and other services powered by OneRiot.
We launched the OneRiot Realtime Search API in June, and already have more than 40 partners in our program. Developers that would like to take advantage of the free API can visit our OneRiot API wiki to learn more about it here..
Dave - Another interesting area that comes up is indexation, can you talk about the current size of your index and what criteria are involved in making indexation decisions? Are items dumped after a given period of time?
Tobias Peggs -Right now we ingest approximately 20 million URLs a day from our panel, and about half that number again from social web services like Twitter and Digg. Architecturally, we’ve built a sophisticated multi-tiered index, with a sizable chunk in RAM. About 96% of our queries are answered by results from that memory – so we currently optimize for that. Items are kept for up to three months right now. That could increase with time, based on user demand.
Dave - That's interesting as most of the current crop of RTS engines don't really have indexing (never mind ranking) mechanisms.
There is always a trade-off with freshness and quality in search. I know you’ve sort of poo-poo’d Google’s PageRank as a problem with getting ‘real time’ – but even they have developed approaches such as the QDF (query deserves freshness). Can a large scale engine truly be ‘real time’?
Tobias Peggs -Traditional search engines struggle to surface the hyper-fresh and socially relevant realtime results that satisfy users looking for “what’s going on right now” for their query. OneRiot is focused exclusively on solving that problem. To do that we’ve invented a new way to index the web and a new way to rank the content in that index to deliver the most socially-relevant results, right now.
One of the advantages of realtime is that the index is constantly refreshed by what people think is relevant right now. Said another way, there’s nothing in the OneRiot index that isn’t deemed socially-relevant. So our index can be much smaller than a traditional search engine. This means we can optimize for relevancy, speed and scale (where scale is QPS), with less concern about managing a truly enormous index.
Dave - Without getting into the ‘secret sauce’ too deeply, I am curious about your authority assignment. In the past I’ve looked at so-called ‘FriendRank’ type technologies from Google and Microsoft. Are you using any implicit signals from the various social sites? What makes an ‘authority’ in a world without links (aka PageRank)?
Tobias Peggs - One of the key factors within PulseRank is People Authority. PulseRank considers who shared the link on the social web. Known spammers tend to pummel their social graph with the same link many times a day.
Links shared in this manner will get a lower weight in our system. Thoughtful social web users, who share links that are heavily retweeted and dugg, are given a higher weight. User scores update in realtime in our system to account for fluctuations in their perceived authority. This is just one aspect of our authority rank, but hopefully it gives you a flavour for what we’re up to.
Dave - Ok, since you mentioned spam, traditionally IR folks have struggled with implicit user feedback signals (spam being a major reason). What types of implicit/explicit signals do you collect and how have you addressed the potential for noise?
<Response by Ron Benson, VP of Engineering, OneRiot>
Ron Benson - When collecting implicit / explicit user feedback, there is definitely room for abuse, and we have some interesting technology to detect this.The kinds of signals we collect are;
- tweets / retweets of links on Twitter,
- which users shared the links, how many clicks each of the links received,
- how long people spend on a page once clicked from our panel of users,
- how many diggs a page received and who dugg it.
By correlating these signals, we’re able to address spam.
Dave - Sure, the mother of all search engine blight is web-spam, that's a fact. It doesn’t matter who you are, eventually with wide adoption comes search/social spam. I did some tests on popular targets such as ‘viagra’ ‘cialis’ ‘payday loans’ ‘mortgage loans’ to varying degrees of success on the spam front. What are the current approaches to dealing with it at OR?
Ron Benson - There’s a tsunami of spam heading for the social web- especially in the realtime conversations that often act as the platform for link sharing. Undoubtedly, there is tremendous value from following the stream of realtime conversation on services like Twitter but at the same time a user can simply tweet something like; “Obama is awesome <link to porn movie>” and see the link to that porn movie show up in search results for “Obama” on any search engine that only searches for keywords in the tweets. This problem is being magnified when spammers utilize URL shorteners that can easily mask where the person is being linked to.
At OneRiot, we take spam seriously and address it using both novel techniques and more standard methods. We employ the standard methods to deal with spam in manners similar to other search engines, including removing spammers at the domain level and machine learning to automatically detect spam at the page level. For example, we know the many domains that phish for credit cards or push “payday loans” and URLs from those domains never make it into our system. ‘Viagra’ pages are removed when passing through our machine-learned spam filters.
In addition to these traditional methods, we also have developed some novel techniques. For example, if a page is only shared on Twitter, but doesn’t appear on any of the other sources we ingest, then these pages receive a lower ranking. We detect and remove “rings” from Twitter – those accounts automatically generated and then used to tweet the same URLs. We also employ our user influence metric, aka People Authority, within PulseRank that calculates the authority of a user sharing information to recognize spammers. If the person is a spammer they will have a low ranking, preventing their spam from reaching the upper echelon of our index. Another way to catch spam is through our Domain Authority factor within PulseRank that looks at the number of links being shared from it (www.nytimes.com vs. www.xyz.com). Spam-spewing sights are unlikely to have high number of inbound links.
Dave - Well you seem to have a solid approach, but I am also curious about reporting spam…is there such a function so that you can devalue a given users authority? I didn’t really see any ‘report spam’ function on the site. Have you considered that option?
Ron Benson - Our users can tell us! We use a fantastic customer feedback system called Get Satisfaction and are constantly listening to our community on Twitter for improvements.
We are open to feedback and love it when users help to report spam, issues with the engine and give us general feedback. Follow us on Twitter @oneriot!
Dave - And last…but most certainly not least; where can peeps get the T-shirt and stickers at? :0) -
Courtney Walsh - We love our community! To join simply follow OneRiot on Twitter and become a fan of OneRiot on our Facebook page .
If you send us a DM or Inbox note with your snail-mail address, t-shirt size and any suggestions or comments you have for us about OneRiot.com, we’ll send you the goods. Oh, and extra points for fans who post pictures wearing their new tees.
Dave - well a big thanks to all of you for doing this!! Considering my less than stellar views on RTS, it was mighty big o you folks to spend the time to try and clear up many of the questions I and my readers had on it... I am certainly starting to at least sway somewhat on my views...
For the Trail riders, here's some ways to hook up with these fine folks;
Tobias Peggs - VP at OneRiot
Ron Benson - VP Engineering
Courtney Walsh - Director of PR
The Jury is still out on Real Time Search
While I remain uncertain as to the future of RTS engines, I would have to say that OneRiot has to be a leader in the space. After talking to them I can say that they understand the challenges and have a strong grasp of the approaches which may unlock the code. Larger adoption of the service may also help bring in a social search element which I believe would be an important aspect of any of the engines in the space…
For a bit of irony, here’s a few screens of my monitoring (in TweetDeck) for ‘real time’ search;
Kind of funny that a bunch of the services out there are the ones actually spammig the query space… sigh… fortunately OneRiot wasn’t one of them. You can be sure this isn't the last of 'real time/social search' that we'll be talking about here on the Trail - cya next time! :0)
/end session -- (be sure to grab our new and improved FEED)
So my happy trail riders… what say you? Will real time search ever break through into the world of search? Or are they destined to be buzz monitoring tools?
Sound off …..
Other RTS posts and info;
Dana Lookadoo's post on the value of RTS
SEM Synergy broadcast on Real Time Search optimization
Real time search off - TechCrunch
Twitters real-time spam problem - Search Engine Land
Race is on for best real-time search engine - Seattle PI
Who rules real time search? - Venture Beat
Bing Keeps Its Foot On The Gas, Adds Tweets To Results - TechCrunch