Say hello to Yahoo's Personalized PageRank
Well I cant say that I had envisioned myself talking about a PageRank Patent
. at least not anytime soon. The last time I looked at a PageRank filing it was from you know who, not Yahoo
But it would seem thats exactly what we have on the menu today;
For your dining pleasure; User Sensitive PageRank
My original curiosity was helped out to no end when everyones favourite technical search guru informed me that one of the authors ( Pavel Berkhin ) had also worked on; A Survey on PageRank Computing (2005 - PDF) and Link Spam Detection on Mass Estimation (2006 - PDF) .
User Segmented Personalized PageRank
At first read, it seemed as though they were looking to integrate user performance metrics as a layer onto the ever popular probabilistic/nodal modelling of PageRank. Of course I had to start over again as I thought myself to be deluded by my recent obsessions in this arena was experiencing some strange anomaly produced by some lunch meats that were obviously past their prime. Starting over availed me no favours, only light;
The present invention relates to techniques for computing authority of documents on the World Wide Web and, in particular, to techniques for taking user behaviour into account when computing PageRank.
Ok, so maybe I wasnt altogether off-course after all. Once more search engineers seem to be looking at how surfers interact as an indication of relevancy or at least as some form of valuation in the scoring of search results. Or;
the demographic and behavior data relate to any of user importance, user recency, user tenure, and user time spent.
A new term that we can add to the lexicon is User Segmented Personalised PageRank or simply, Personalized PageRank. In simplest terms, PageRank is global by nature and in limiting the topical subsets that are flagged for inclusion or as a scoring signal, a form of personalization can potentially be ascertained. (formatting and emphasis below added);
For example, user segmentation is commonly used in targeted advertising. A user segment can be defined in terms of;
- a user demographic profile (e.g., age, gender, income, etc.),
- user location,
- user behavior, etc.
Any or all of equations (3)-(5) above can then be specified to reflect any such user segment in that they are constructed with reference to user data corresponding to an underlying population which, in turn, can be restricted to the relevant user segment. Moreover, as discussed above, such formulations can take into account any probabilistic distribution of user relevancy such as, for example, assigning weights to different users on the basis of an age range distribution.
Once again we seem to be looking at demographic and behavioural metrics as part of the mix. While it isnt new territory it is a new angle which is always an interesting excursion.
Blocked PageRank for relevancys sake.
PageRank, generally speaking, can apply to formulations on the aggregate of data collections ( larger data sets) whereas this proposed iteration would look to segment data in a more granular level. By looking at user behaviour as a dampening factor various block levels can begin to emerge. If a link to a given document doesnt meet the set requirements of a given block level, it would not pass value. Also sub set block levels could theoretically have further weights (dampening signals)
and so on. By associating user engagement data with PageRank principles, we begin to arrive at PageRank personalization.
It seems intuitive that as the size of the web increases there becomes an ever more present need to segment information to a certain degree. It also stands to reason that relevance is often a personal journey that requires subtlety and sensitivity to deliver a usable end product that doesnt overly tax the searcher. Would such a layer be a benefit? I would like to think so
Is such a layer feasible from an infrastructure stand point? Better ask some one that knows more than I
. the depth of the minutia would ultimately dictate the efficiency methinks
. That is beyond this humble rambler unfortunately.
The Web Garbage Collection Utility
One considered advantage of such a system is that it can be set in certain circumstances to be time sensitive (Query deserves freshness maybe?). Perhaps a given topical area tends to have a demographic that prefers newer content over the older offerings? A personalized PageRank would capture temporally dependent changes in page popularity, it also operates as a de-facto Web garbage collection utility. Or so the story goes;
Yet another advantage of the PageRank formulations of the present invention is that it is relatively straightforward to incorporate time dynamics. For example, a discount procedure such as, for example, exponential averaging, could readily be included into user behavior counts to emphasize recent events and discount old ones.
The underlying point being is that time dynamics are yet another aspect that such a behavioural enhancement could possibly deliver. Tailoring PageRank within given topical or demographic realms can offer a tighter relevance in the search results ( hopefully
They also discuss how such factors can also be used when setting rules for crawling;
one such application is controlling the manner in which a web crawling application crawls the Web. That is, the PageRank formulations of the present invention may be used to support decision making by a web crawler to determine whether and on which links associated with a given page to crawl.
Or in the treatment of Anchor texts;
Consider an anchor-text that is known as one of the most useful features used in ranking retrieved Web search results. It is usually assembled through aggregation of different \href HTML tag text strings related to incoming links. However, since incoming links have different popularity, this text can be supplied with some weights derived according to the present invention.
There is even some factoring of device types that I have to imagine could also be a valuable metric all on its own. They dont talk a lot about data collection sources outside of a somewhat cheeky reference to;
using any of a variety of well known mechanisms for recording a user's online behavior.
.. but seem to have caught on to the Google Personalized Search train with;
user data may be collected when a user registers with, for example, a particular web site or service.
There is a new term for my lexicon, and one relating to Yahoo no less?? I am certainly a fan of user performance metrics being incorporated into search. This patent is a positive step as far as I can see, though some among us are longing for a day when backlinks arent the core of the circus, this doesnt really do that. This approach for me, in the end, could lend some more relevance to the SERPs, but has the same limitations any PageRank approach would have. An interesting paper and worth consideration.
If you missed it, Bill has a post on this paper as well check it out.