Its all about getting along with your neighbours
A while ago the fine folks at Yahoo! had a patent granted relating to Personalized PageRank, (or User Sensitive PageRank). At the time it was not only interesting but just fun from a my PageRank is better than yours perspective. Today I ran into another Yahoo! patent that makes mention of said Personalized PageRank and another potential candidate for the lexicon; HarmonyRank. It is sort of like, how well do you play with others ;0)
System and method for characterizing a web page using multiple anchor sets of web pages - filed Oct.2006; Published April 03 2008 An improved method for characterizing a web page using multiple anchor sets of web pages.
Now as we have all learned from our search masters, have a look around right?
Ok, seems these fellows have ALL also worked on; Anchor-based Proximity Measures (PDF) 2007; which discusses both Personalized PageRank (PPR) and Harmonic Rank (HR - I prefer HarmonyRank)
...one author, Andrew Tomkins also worked on the Personalized PageRank patent. Also of interest, Amruta Joshi, (also an author on this one) did a PPT on Keyword generation for search engine advertising while at Stanford.
...at a glance...
How well do you get along with others?
The supposition is that the process of link analysis through a starting seed list of web pages and using a probabilistic expansion (random walk) can be flawed due to the fixed-depth neighborhood. This is also limited in its inability to charactarize a given web page nor subsequent groups of pages. They give the example of the well known HITS algorithm;
.. used by a search engine to generate a seed set, and then performed a fixed-depth neighborhood expansion in order to generate a larger set of pages upon which the HITS algorithm was employed.
Variations of HITS can be employed in;
- community finding,
- in finding similar pages,
- in pagerank,
- in trustrank,
- in classification of web pages.
- sophisticated expansions have been applied in the context of community discovery.
What it doesnt do is measure the character of a given web page that would give rise to understanding groups of web web pages contextually. Such charactarizations of a given page towards a group of web pages would enable a further signal in the discovery and ranking processes. In music a melody is usually singular notes while a harmony is many notes interacting. Thus Yahoo has brought us HarmonyRank ;0)
Not having dug into the history of these concepts, it is hard to get overly specific. I can say that once more, probabilistic models are the call of the day. As always for delivery and scaleability reasons the need for effeciency rules. By setting positive and negative charactaristics of a given page, trends and charactaristics of a single page and the group begin to emerge. Somewhat like a profile or theme that a group of pages have that is applied to new pages in the graph.
They also mention its potential for web spam detection which one would assume that a profiling engine might be reasonable at given other tools not as a stand alone spam detection system. As always, those skilled in the art, will be able to find other uses for the platform. If you can define positive and negative charactaristics of a web page; common to a group, then profile away my friend!
For example, applications may use the present invention for the detection of high-caliber blogs and other collections of web pages with a positive characterization.
..a search application may find similar web pages using anchor sets of web pages or a clustering application may find local segments of web pages using anchor set of web pages. For any of these applications, the present invention may advantageously provide a characterization of a web page that may be linked to a known anchor set of web pages.
Anchor your sets
As for the group of pages to which were comparing your entry to they like to call em Anhcors;
An anchor or anchor set as used herein may mean a collection of web pages with a known characterization.
These anchor sets can then be looked at to charactarize webpages that they link to or so the theory goes. New pages would satisfy positive and/or negative charactaristics to the known set and scored accordingly. Once again looking at spam detection as a possible implementation of the methodology they can do a link based analysis;
A clean set of good pages A and a set of spam pages B may
be manually identified, and these anchor sets may be used by the present invention to determine a score for remaining page in the collection of web pages.
Using multiple factors also means a grade can be given to the extent of perceived spaminess
(PSS Percieved Spamminess Scale lol). You can start to see through this example though how various factors can come into play as far as starting initial data sets that become the core group of documents. Depending on the desired usage, the methodology can serve a few purposes beyond being simply another ranking signal.
They love you they love you not
So you have a set of trusted pages and their characteristics and you also have a set of baaaaddd web pages and corresponding characteristics. You then take each new page that you come across in your journey (random crawl) and compare it to the rest of the community. Sounds reasonable, right? And what is a high quality site you ask?
- high caliber blogs
- news sites
- web magazines
- and so forth (personal fav)
Enlightening huh? Guess I wont make any seed lists. What about the bad guys, whats their MO?
- Spam sites
- Low caliber blogs
- Pornography sites
- And so forth (lame)
They surmise that a quality website would not link to a bad neighbourhood and thus relationships characterizing the quality of pages can be culled. Looks like I am dead in the water...
Therefore, web sites and web pages with links from high quality web pages or web sites will be more likely to be high quality, and web site and web pages with links from low quality web pages or web sites will be more likely to be low quality.
This brings to mind the whole issue of hackers and SEO these days. Just hack in, link out to some bad neighbourhoods and wipe their Google and Yahoo search rankings in one felled swoop. And black hatters start linking out to quality sites
Personalize PageRank(PPR) + HarmonyRank(HR) = Relevance?
Think of this like more of a true voting PageRank in the sense that negative characterizations in your link profile can be weighted against the positive to produce a quality score through association. If this process of PPR and HR will make a great search application, WFKs (Who Friggen Knows)? I am going to go a read up some more on some of the documents I came across as well as another look at that Personalized PageRank patent and see what these fellas are up to. If we start to consider link relationships in terms of negative and positive as well as through user performance metrics we might just have something worth looking into... for the moment I have to get back to work...
....for the next installment and more thoughts; stay tuned
Techno babble for the brave; some highlights
Consider for example that the anchor set may be A with a positive characterization and PPR may be the method used to propagate a probability distribution from the vertices of A to reachable vertices u of the graph.
Such a propagation method may be viewed as a web surfer beginning at anchor set A and walking forward through the graph at random.
At each step forward through the graph, the surfer may stop and flip an .alpha.-biased coin in this analogy. If the coin may come up tails, the surfer may begin again at a random entry of anchor set A; otherwise, the surfer may continue the walk to a reachable vertex.
Thus, the quality of a page may represent its overall likelihood of being visited by the surfer in this model.
Recalling that in this example the anchor set A may be the set with a positive characterization, the quality of vertex u may correspond to the "reach ability" of a web page represented by u from the known set of high-quality pages, and such a method may accordingly provide a reasonable approach to quantifying the quality of the web page represented by u.
The method of claim 5 wherein determining the quality measure of the characterization of the web page using the plurality of anchor sets of web pages comprises:
- determining a first quality measure of a first characterization of the web page using an anchor set with a positive characterization of the plurality of anchor sets of web pages;
- determining a second quality measure of a second characterization of the web page using a second anchor set with a negative characterization of the plurality of anchor sets of web pages;
- and dividing the first quality measure of the first characterization of the web page by the sum of the first quality measure of the first characterization of the web page and the second quality measure of the second characterization of the web page to generate the quality measure of the characterization of a web page.
8. The method of claim 7 wherein propagating the probability distribution following the forward walk of vertices of the graph representing the collection of web pages comprises employing a technique of personalized pagerank.
9. The method of claim 7 wherein propagating the probability distribution following the forward walk of vertices of the graph representing the collection of web pages comprises employing a technique of harmonic rank.