|
(Note; the following is a guest post from my good friend Miss CJ)
How search engineers look for nepotistic links
Paid links are a bit of problem for search engine engineers because they can be misleading. Some links are bought in order to boost rank and others are purchased for legitimate reasons, such as actually offering something interesting to a website visitor. Not all links should be discounted but different weights can be given to allow less important links to get a full "vote".
It's not easy to differentiate between these, but there has been a fair bit of research around it we can look at. If search engines could discount misleading nepotistic links, their performance would improve. In the SEO community, this would be received with mixed emotions not doubt. Google uses methods to detect keyword spamming for example, and uses other text based methods, but their algorithm is open to link spam.
But what are search engineers doing to combat nepotistic links in modern information retrieval?

Engineering link spam detection
Brian Davison has done a lot of research on this topic, he's from Rutgers (New Jersey). In Recognizing Nepotistic Links on the Web, he agrees with others in the community to interpret them as bestowal of patronage in consideration of relationship, rather than of merit. - So links that are getting counted when they shouldn't be.
Some proposed methods are;
- Keeping a blacklist
- Using a heuristic (rules) to discount the links,
- Figuring out when the results are spammy through post-processing.
We appreciate when we think about these techniques, that they are all weak in some way. Heuristics need to be discovered and be accurate, a black list isn't easy to compile and many of these links wouldn't be detected, and detecting spammy results is not as easy as it sounds because you need to assess when they're not spammy first of all. Brian Davison rightly points out that blacklists are unlikely to be used because engines could be labeled as using censorship.
Methods worth looking at;
First off, machine learning is a very useful area of research because it allows us to get computers to recognize patterns in data and to then be able to classify or do whatever we ask it to do with new data. You can use "unsupervised training" which means that you let it figure out the patterns for itself, or "supervised learning" where you give it a starting set to learn from which is correctly annotated or processed.
Brian Davison experimented with the C4.5 algorithm (it's a decision tree method) and a supervised technique. All nepotistic links were labelled as such and then the computer was given a new set of documents. This enabled the discovery of 75 of features proper to nepotistic links.
Some of these had Domain names were identical, Host names without domains also, the pages shared at least some percentage of outgoing links, IP addresses were identical...and a number of other factors.
This method proved that machine learning methods like this were useful at detecting nepotistic links.
Language Model Disagreement
Further work has been carried out by Bencsur, Brio, Csalogany and Uher from the University of budapest. Their paper is called "Detecting Nepotistic Links by Language Model Disagreement". They look at down weighting hyperlinks that are irrelevant to a target page and they do this by using language model disagreement.
The good thing is that no manual intervention is needed, so there are no blacklists for example. They "analyze inter-document relationship over the entire corpus by solving anchor text model comparison and prediction aggregation." Their method deals with link, content and anchor text spam all at the same time.
This algorithm picks out links where the language models between 2 documents disagree. These are fed into a PageRank calculation and they then get the NRank value of that page, which they suggest should be subtracted from that page's PageRank score. The main method used is the Kullback-Leibler divergence between the unigram model of the target and source pages. They still have some research to do here which included using n-grams, smoothing and different penalty functions.
Qualified Links
Qi, Nie and Brian Davison (Lehigh University) recently look at using content as well. They looked at things the other way around. Instead of looking for nepotistic links, they look for "qualified links", as in those qualified to recommend another source. They used similarity scores between the source and target page.
These were URL similarity,
- topic vector,
- tfidf content similarity
- tfidf non-anchor text similarity.
They built an algorithm called "Qualified HITS" which is different to HITS because that treats all links equally. The similarity scores of each page are calculated and fed into a classifier which either keeps or removes them from the graph. After that each neighbouring page to the bad links are checked. They improved it by 9%. Their further work involves using a better classifier, setting a better weighting and finding a cheaper way to compute the similarity scores.
This kind of method, using a language model, seems to be far more realistic because the variables that are extracted from known nepotistic links allow for too much confusion. Some non-misleading links may have one or a couple of these traits. The language model is another variable, independent of the actual links.
As we can see from the method proposed below though, link based methods can indeed prove effective.
Random Walks from Spam Seed Sets
"Extracting link spam using biased random walks from spam seed sets" (07) from Wu (Lehigh University) and Chellapilla (Microsoft) looked at using an "automated random walk" model to detect link farms and link exchanges. The system is given a link spam seed and the web graph, the biased random walk is used to extract members of that same seed domain or page. The idea here is to help expand the blacklist.
The random walk is biased because it only jumps to neighbourhoods around the seed set through the use of "decay probabilities". There is of course more work to do, but they did achieve over 95.12% precision in extracting large farms and 80.46% in extracting link exchanges.
It is necessary to also note that were not just looking at paid links now, but also at link spam in general. Paid links are a particular problem area but by no means the only one.

Buying links; Risk V Reward
And so, should you participate in paid links or other link schemes? Search engines don't like it and you should really make sure that the links being bought are legitimate (for traffic/promotion - consider making sure the 'nofollow' is in place). As search engineers get better and better at detecting paid links, it is simply not worth doing in many cases.
Are they really getting better at detecting paid links? You tell me. There have been more than a few link larcenists that have gone underground or boarded the shop up altogether over the last few years that alone may be a ringing endorsement of their abilities.
Something to consider
.
Tools: you can run Webgraph if you like, and see how it deals with nepotistic links. Basically it uses some of Brian Davison's variables: "pages that are situated on the same host, corresponding to links made for hypertext navigational purposes rather than semantic similarity".
Papers: and here are some related research papers to learn more -
Recognizing Nepotistic Links on the Web (B.Davison) (2000)
Improving web spam classification using rank-time features (AIRWeb 2007)
KM Svore, Q Wu, CJC Burges, A Raman
Detecting nepotistic links by language model disagreement (AIRWweb 2006)
AA Benczúr, I Bíró, K Csalogány, M Uher
Undue influence: eliminating the impact of link plagiarism on web search rankings
B Wu, BD Davison - Proceedings of the 2006 ACM symposium on Applied computing, 2006
Detecting link spam using temporal information (ICDM-2006)
G Shen, B Gao, TY Liu, G Feng, S Song, H Li
Extracting link spam using biased random walks from spam seed sets
B Wu, K Chellapilla - Proceedings of the 3rd international workshop on Adversarial IR, 2007
Detecting spam web pages through content analysis
A Ntoulas, M Najork, M Manasse, D Fetterly - international conference on WWW, 2006
Adversarial Information Retrieval on the Web (AIRWeb 2007)
C Castillo, K Chellapilla, BD Davison

About CJ; is a seriously obsessed search geek (SOSG) that is completing her PhD in Natural Language Processing and Artificial Intelligence. But her better qualities lie in being an SEO practitioner for more than six years now as well. At least that's my take :0) - You can catch more writings on her blog; Science for SEO and be sure to get hooked up with here on Twitter.
I want to thank CJ for once more dropping in to help out with her unique perspectives on search and SEO
|
Comments
Congratz for another great post.
can the bots make a decision between Y! Directory which costs $300 and linkfarm.com which may charge $9.95/month...
It's going to take a while for me to work through all those resources but very grateful for them.
A post that will be safely banked in my bookmarks and returned to many times I bet - great work!
Nate
CJ is currently wandering down under in the OZ and I shall pass them along when she awakes...
@Nate... thanks... not sure if ya mean MY posts or CJs... either way - thanks for taking the time to comment...part of what makes blogging enjoyable!!
Nate
P.S. Some of my old sites that I used to manage have recently been flagged and lost all PR. They still rank well, though, which seems to prove that they were manual PR penalties for selling links
I'm glad you found the information useful and that it was accessible for you.
CJ
Navigating through the "spam" is not a happy experience. If the auto process of examining links can catch 95 percent it would be a huge improvement to the quality of the search results.
I hope to see some form of this implemented in the near future.
Great article CJ - I hope it catches on.
However, some major sites do so openly sell links and never seem to be punished... I guess there will always be other than technical reasons having an influence, if, when and which link is to be punished.
By following relatively simple rules in regards to hosting, domain setup, ratio of external links and link neighborhoods we can still get loads of impact from paid links.
Have a look at the backlinks of the top 10 for any heavily monetized & competitive SERP... you'll see the engines seem to have hit a wall.
If paid links are setup correctly its is fundamentally impossible to tell the difference...
search engines couldn't use this algo without understanding of what nepotistic links were spam-promotion links and which were competition methods. they can and i belive they do understand which one is nepotistic but no penalize is being applied. when this global issues will be solved - all sites used this method will be penalized.
A separate problem with this is that it ignores the reality of demographic targeting. While many paid links are not bought with that kind of consideration in mind, I believe it's an emerging trend that will continue to grow as people go to buy links on a direct response model as well.
My only real comment here is that getting tough on backlinks in this manner can set you up for a competitior to create bad backlinks for you deliberately.
No they are probably not going to buy them but they might if the rewards are good enough.
The only safe way to deal with the problem links is to just ignore them, don't give the linked to party any reward for their efforts or investment and they'll quickly stop doing it.
My site has really great content, a very, very unique theme, and over 2,500 visitors a day. I check my backlinks every day, and I honestly cannot detect a single "organic" or natural link. Why aren't my visitors giving me links?
How many millions of visitors do you need each day before you start getting these mythical "organic" links? And how are you supposed to get millions of visitors if you are getting the "organic" links. This is very, very frustrating.
RSS feed for comments to this post