(Note; the following is a guest post from my good friend Miss CJ)
How search engineers look for nepotistic links
Paid links are a bit of problem for search engine engineers because they can be misleading. Some links are bought in order to boost rank and others are purchased for legitimate reasons, such as actually offering something interesting to a website visitor. Not all links should be discounted but different weights can be given to allow less important links to get a full "vote".
It's not easy to differentiate between these, but there has been a fair bit of research around it we can look at. If search engines could discount misleading nepotistic links, their performance would improve. In the SEO community, this would be received with mixed emotions not doubt. Google uses methods to detect keyword spamming for example, and uses other text based methods, but their algorithm is open to link spam.
But what are search engineers doing to combat nepotistic links in modern information retrieval?
Engineering link spam detection
Brian Davison has done a lot of research on this topic, he's from Rutgers (New Jersey). In Recognizing Nepotistic Links on the Web, he agrees with others in the community to interpret them as bestowal of patronage in consideration of relationship, rather than of merit. - So links that are getting counted when they shouldn't be.
Some proposed methods are;
- Keeping a blacklist
- Using a heuristic (rules) to discount the links,
- Figuring out when the results are spammy through post-processing.
We appreciate when we think about these techniques, that they are all weak in some way. Heuristics need to be discovered and be accurate, a black list isn't easy to compile and many of these links wouldn't be detected, and detecting spammy results is not as easy as it sounds because you need to assess when they're not spammy first of all. Brian Davison rightly points out that blacklists are unlikely to be used because engines could be labeled as using censorship.
Methods worth looking at;
First off, machine learning is a very useful area of research because it allows us to get computers to recognize patterns in data and to then be able to classify or do whatever we ask it to do with new data. You can use "unsupervised training" which means that you let it figure out the patterns for itself, or "supervised learning" where you give it a starting set to learn from which is correctly annotated or processed.
Brian Davison experimented with the C4.5 algorithm (it's a decision tree method) and a supervised technique. All nepotistic links were labelled as such and then the computer was given a new set of documents. This enabled the discovery of 75 of features proper to nepotistic links.
Some of these had Domain names were identical, Host names without domains also, the pages shared at least some percentage of outgoing links, IP addresses were identical...and a number of other factors.
This method proved that machine learning methods like this were useful at detecting nepotistic links.
Language Model Disagreement
Further work has been carried out by Bencsur, Brio, Csalogany and Uher from the University of budapest. Their paper is called "Detecting Nepotistic Links by Language Model Disagreement". They look at down weighting hyperlinks that are irrelevant to a target page and they do this by using language model disagreement.
The good thing is that no manual intervention is needed, so there are no blacklists for example. They "analyze inter-document relationship over the entire corpus by solving anchor text model comparison and prediction aggregation." Their method deals with link, content and anchor text spam all at the same time.
This algorithm picks out links where the language models between 2 documents disagree. These are fed into a PageRank calculation and they then get the NRank value of that page, which they suggest should be subtracted from that page's PageRank score. The main method used is the Kullback-Leibler divergence between the unigram model of the target and source pages. They still have some research to do here which included using n-grams, smoothing and different penalty functions.
Qi, Nie and Brian Davison (Lehigh University) recently look at using content as well. They looked at things the other way around. Instead of looking for nepotistic links, they look for "qualified links", as in those qualified to recommend another source. They used similarity scores between the source and target page.
These were URL similarity,
- topic vector,
- tfidf content similarity
- tfidf non-anchor text similarity.
They built an algorithm called "Qualified HITS" which is different to HITS because that treats all links equally. The similarity scores of each page are calculated and fed into a classifier which either keeps or removes them from the graph. After that each neighbouring page to the bad links are checked. They improved it by 9%. Their further work involves using a better classifier, setting a better weighting and finding a cheaper way to compute the similarity scores.
This kind of method, using a language model, seems to be far more realistic because the variables that are extracted from known nepotistic links allow for too much confusion. Some non-misleading links may have one or a couple of these traits. The language model is another variable, independent of the actual links.
As we can see from the method proposed below though, link based methods can indeed prove effective.
Random Walks from Spam Seed Sets
"Extracting link spam using biased random walks from spam seed sets" (07) from Wu (Lehigh University) and Chellapilla (Microsoft) looked at using an "automated random walk" model to detect link farms and link exchanges. The system is given a link spam seed and the web graph, the biased random walk is used to extract members of that same seed domain or page. The idea here is to help expand the blacklist.
The random walk is biased because it only jumps to neighbourhoods around the seed set through the use of "decay probabilities". There is of course more work to do, but they did achieve over 95.12% precision in extracting large farms and 80.46% in extracting link exchanges.
It is necessary to also note that were not just looking at paid links now, but also at link spam in general. Paid links are a particular problem area but by no means the only one.
Buying links; Risk V Reward
And so, should you participate in paid links or other link schemes? Search engines don't like it and you should really make sure that the links being bought are legitimate (for traffic/promotion - consider making sure the 'nofollow' is in place). As search engineers get better and better at detecting paid links, it is simply not worth doing in many cases.
Are they really getting better at detecting paid links? You tell me. There have been more than a few link larcenists that have gone underground or boarded the shop up altogether over the last few years that alone may be a ringing endorsement of their abilities.
Something to consider
Tools: you can run Webgraph if you like, and see how it deals with nepotistic links. Basically it uses some of Brian Davison's variables: "pages that are situated on the same host, corresponding to links made for hypertext navigational purposes rather than semantic similarity".
Papers: and here are some related research papers to learn more -
Recognizing Nepotistic Links on the Web (B.Davison) (2000)
Improving web spam classification using rank-time features (AIRWeb 2007)
KM Svore, Q Wu, CJC Burges, A Raman
Detecting nepotistic links by language model disagreement (AIRWweb 2006)
AA Benczúr, I Bíró, K Csalogány, M Uher
Undue influence: eliminating the impact of link plagiarism on web search rankings
B Wu, BD Davison - Proceedings of the 2006 ACM symposium on Applied computing, 2006
Detecting link spam using temporal information (ICDM-2006)
G Shen, B Gao, TY Liu, G Feng, S Song, H Li
Extracting link spam using biased random walks from spam seed sets
B Wu, K Chellapilla - Proceedings of the 3rd international workshop on Adversarial IR, 2007
Detecting spam web pages through content analysis
A Ntoulas, M Najork, M Manasse, D Fetterly - international conference on WWW, 2006
Adversarial Information Retrieval on the Web (AIRWeb 2007)
C Castillo, K Chellapilla, BD Davison
About CJ; is a seriously obsessed search geek (SOSG) that is completing her PhD in Natural Language Processing and Artificial Intelligence. But her better qualities lie in being an SEO practitioner for more than six years now as well. At least that's my take :0) - You can catch more writings on her blog; Science for SEO and be sure to get hooked up with here on Twitter.
I want to thank CJ for once more dropping in to help out with her unique perspectives on search and SEO