SEO Blog - Internet marketing news and views  

Hunting for paid links; a technical review

Written by David Harry   
Thursday, 05 February 2009 07:57

(Note; the following is a guest post from my good friend Miss CJ)

How search engineers look for nepotistic links

Paid links are a bit of problem for search engine engineers because they can be misleading.  Some links are bought in order to boost rank and others are purchased for legitimate reasons, such as actually offering something interesting to a website visitor.  Not all links should be discounted but different weights can be given to allow less important links to get a full "vote". 

It's not easy to differentiate between these, but there has been a fair bit of research around it we can look at.  If search engines could discount misleading nepotistic links, their performance would improve.  In the SEO community, this would be received with mixed emotions not doubt.  Google uses methods to detect keyword spamming for example, and uses other text based methods, but their algorithm is open to link spam.

But what are search engineers doing to combat nepotistic links in modern information retrieval?

Are paid links worth it?

Engineering link spam detection

Brian Davison has done a lot of research on this topic, he's from Rutgers (New Jersey).  In “Recognizing Nepotistic Links on the Web”, he agrees with others in the community to interpret them as “bestowal of patronage in consideration of relationship, rather than of merit.” - So links that are getting counted when they shouldn't be. 

Some proposed methods are;

  1. Keeping a blacklist
  2. Using a heuristic (rules) to discount the links,
  3. Figuring out when the results are spammy through post-processing. 

We appreciate when we think about these techniques, that they are all weak in some way.  Heuristics need to be discovered and be accurate, a black list isn't easy to compile and many of these links wouldn't be detected, and detecting spammy results is not as easy as it sounds because you need to assess when they're not spammy first of all.  Brian Davison rightly points out that blacklists are unlikely to be used because engines could be labeled as using censorship.

 

Methods worth looking at;

First off, machine learning is a very useful area of research because it allows us to get computers to recognize patterns in data and to then be able to classify or do whatever we ask it to do with new data.  You can use "unsupervised training" which means that you let it figure out the patterns for itself, or "supervised learning" where you give it a starting set to learn from which is correctly annotated or processed.

Brian Davison experimented with the C4.5 algorithm (it's a decision tree method) and a supervised technique.  All nepotistic links were labelled as such and then the computer was given a new set of documents.  This enabled the discovery of 75 of features proper to nepotistic links. 

Some of these had Domain names were identical, Host names without domains also, the pages shared at least some percentage of outgoing links, IP addresses were identical...and a number of other factors.

This method proved that machine learning methods like this were useful at detecting nepotistic links.

 

Language Model Disagreement

Further work has been carried out by Bencsur, Brio, Csalogany and Uher from the University of budapest.  Their paper is called "Detecting Nepotistic Links by Language Model Disagreement". They look at down weighting hyperlinks that are irrelevant to a target page and they do this by using language model disagreement.

The good thing is that no manual intervention is needed, so there are no blacklists for example.  They "analyze inter-document relationship over the entire corpus by solving anchor text model comparison and prediction aggregation."  Their method deals with link, content and anchor text spam all at the same time. 

This algorithm picks out links where the language models between 2 documents disagree.  These are fed into a PageRank calculation and they then get the NRank value of that page, which they suggest should be subtracted from that page's PageRank score.  The main method used is the Kullback-Leibler divergence between the unigram model of the target and source pages.  They still have some research to do here which included using n-grams, smoothing and different penalty functions.

 

Qualified Links

Qi, Nie and Brian Davison (Lehigh University) recently look at using content as well. They looked at things the other way around.  Instead of looking for nepotistic links, they look for "qualified links", as in those qualified to recommend another source.  They used similarity scores between the source and target page. 

These were URL similarity,

  1. topic vector,
  2. tfidf content similarity
  3. tfidf non-anchor text similarity. 

They built an algorithm called "Qualified HITS" which is different to HITS because that treats all links equally.  The similarity scores of each page are calculated and fed into a classifier which either keeps or removes them from the graph.  After that each neighbouring page to the bad links are checked.  They improved it by 9%.  Their further work involves using a better classifier, setting a better weighting and finding a cheaper way to compute the similarity scores. 

This kind of method, using a language model, seems to be far more realistic because the variables that are extracted from known nepotistic links allow for too much confusion.  Some non-misleading links may have one or a couple of these traits.  The language model is another variable, independent of the actual links. 

As we can see from the method proposed below though, link based methods can indeed prove effective.

 

Random Walks from Spam Seed Sets

"Extracting link spam using biased random walks from spam seed sets" (07) from Wu (Lehigh University) and Chellapilla (Microsoft) looked at using an "automated random walk" model to detect link farms and link exchanges.  The system is given a link spam seed and the web graph, the biased random walk is used to extract members of that same seed domain or page. The idea here is to help expand the blacklist. 

The random walk is biased because it only jumps to neighbourhoods around the seed set through the use of "decay probabilities".  There is of course more work to do, but they did achieve over 95.12% precision in extracting large farms and 80.46% in extracting link exchanges.

It is necessary to also note that were not just looking at paid links now, but also at link spam in general.  Paid links are a particular problem area but by no means the only one. 

Link buying companies fade away

 

Buying links; Risk V Reward

And so, should you participate in paid links or other link schemes? Search engines don't like it and you should really make sure that the links being bought are legitimate (for traffic/promotion - consider making sure the 'nofollow' is in place). As search engineers get better and better at detecting paid links, it is simply not worth doing in many cases.

Are they really getting better at detecting paid links? You tell me. There have been more than a few link larcenists that have gone underground or boarded the shop up altogether over the last few years – that alone may be a ringing endorsement of their abilities.

Something to consider….

 

Tools: you can run Webgraph if you like, and see how it deals with nepotistic links.  Basically it uses some of Brian Davison's variables: "pages that are situated on the same host, corresponding to links made for hypertext navigational purposes rather than semantic similarity".

 

Papers: and here are some related research papers to learn more -

Recognizing Nepotistic Links on the Web (B.Davison) (2000)

Improving web spam classification using rank-time features (AIRWeb 2007)
KM Svore, Q Wu, CJC Burges, A Raman

Detecting nepotistic links by language model disagreement (AIRWweb 2006)
AA Benczúr, I Bíró, K Csalogány, M Uher

Undue influence: eliminating the impact of link plagiarism on web search rankings
B Wu, BD Davison - Proceedings of the 2006 ACM symposium on Applied computing, 2006

Detecting link spam using temporal information (ICDM-2006)
G Shen, B Gao, TY Liu, G Feng, S Song, H Li

Extracting link spam using biased random walks from spam seed sets
B Wu, K Chellapilla - Proceedings of the 3rd international workshop on Adversarial IR, 2007

Detecting spam web pages through content analysis
A Ntoulas, M Najork, M Manasse, D Fetterly - international conference on WWW, 2006

Adversarial Information Retrieval on the Web (AIRWeb 2007)
C Castillo, K Chellapilla, BD Davison

 

A bridge between worlds

About CJ; is a seriously obsessed search geek (SOSG) that is completing her PhD in Natural Language Processing and Artificial Intelligence. But her better qualities lie in being an SEO practitioner for more than six years now as well. At least that's my take :0) - You can catch more writings on her blog; Science for SEO and be sure to get hooked up with here on Twitter.

I want to thank CJ for once more dropping in to help out with her unique perspectives on search and SEO…

 

Comments  

 
+2 # Federico 2009-02-05 14:56
Pretty interesting to keep me reading all the way. Great research papers, and a great way to learn the discarding process of paid links in search engines.
Congratz for another great post.
Reply | Reply with quote | Quote
 
 
0 # Shark SEO 2009-02-06 05:12
Really, really interesting piece. It actually got me thinking about the difference between paid links and nepotistic links. It made me think of blogrolls, for some reason.
Reply | Reply with quote | Quote
 
 
+2 # david 2009-02-06 07:36
but can you punish a friendly community site that just wants to promote everyone...?

can the bots make a decision between Y! Directory which costs $300 and linkfarm.com which may charge $9.95/month...
Reply | Reply with quote | Quote
 
 
0 # Ben McKay 2009-02-06 07:39
Truly awesome piece of writing - very helpful indeed.

It's going to take a while for me to work through all those resources but very grateful for them.

A post that will be safely banked in my bookmarks and returned to many times I bet - great work!

:-)
Reply | Reply with quote | Quote
 
 
0 # Nate at Plasticprinters 2009-02-06 11:12
Just wanted to thank you for the post. I have read quite a few of your posts off of sphinn and figured I would comment on one! My company has been thinking about purchasing links and I am glad I read your article!

Nate
Reply | Reply with quote | Quote
 
 
0 # Dave 2009-02-06 11:20
Hey gang... thanks as always for all the kind words - ya'll ROCK!!

CJ is currently wandering down under in the OZ and I shall pass them along when she awakes...

@Nate... thanks... not sure if ya mean MY posts or CJs... either way - thanks for taking the time to comment...part of what makes blogging enjoyable!!
Reply | Reply with quote | Quote
 
 
0 # Nate at Plasticprinters 2009-02-06 13:52
@ Dave... sorry I'm not sure, I have sphinn as my homepage and have read alot of articles posted by the same avator.... Anyways take care! Keep up the good work!

Nate
Reply | Reply with quote | Quote
 
 
0 # Hugo 2009-02-06 13:55
Great post! You can count me as one of those folks that boarded up shop. I did so back in 2007, when it became apparent that the writing was on the wall. I miss cashing several $K per month for doing next to nothing...

P.S. Some of my old sites that I used to manage have recently been flagged and lost all PR. They still rank well, though, which seems to prove that they were manual PR penalties for selling links
Reply | Reply with quote | Quote
 
 
0 # DianeV 2009-02-08 09:31
Appreciate the plain-English style of writing that makes your point loud and clear. Kudos.
Reply | Reply with quote | Quote
 
 
0 # CJ 2009-02-08 20:17
Thanks people,

I'm glad you found the information useful and that it was accessible for you.

CJ :-)
Reply | Reply with quote | Quote
 
 
0 # Daniel F 2009-02-09 04:16
So according to this research, if a link is purchased on a website with related content to the page its linking to, it won't be "discovered"?
Reply | Reply with quote | Quote
 
 
0 # CJ 2009-02-09 06:27
As long as it's not considered "misleading". It will be picked up but will be weighted differently to the malicious links.
Reply | Reply with quote | Quote
 
 
0 # John 2009-02-09 09:04
Good article. I'm also assuming that the engines can investigate purchased links when users report paid links.
Reply | Reply with quote | Quote
 
 
0 # Nick Stamoulis 2009-02-09 11:10
I would never risk purchasing a link especially on a website that is your livelihood. Getting your website unlisted is not worth it.
Reply | Reply with quote | Quote
 
 
0 # gaver 2009-02-09 13:44
Great article - something definitely needs to be done. Casual research (real searches for content) get pretty frustrating on Google when nearly 20 percent of all links returned are link farms or sites made for adwords / adsense.

Navigating through the "spam" is not a happy experience. If the auto process of examining links can catch 95 percent it would be a huge improvement to the quality of the search results.

I hope to see some form of this implemented in the near future.

Great article CJ - I hope it catches on.
Reply | Reply with quote | Quote
 
 
0 # Mart 2009-02-09 15:21
Thanks a lot. It seems to me that these automatic techniques will be quite some sharp instrument in the hands of human anti link spam operators.

However, some major sites do so openly sell links and never seem to be punished... I guess there will always be other than technical reasons having an influence, if, when and which link is to be punished.
Reply | Reply with quote | Quote
 
 
-1 # PT 2009-02-10 04:54
Seems to me most of these filtering factors are already in place.

By following relatively simple rules in regards to hosting, domain setup, ratio of external links and link neighborhoods we can still get loads of impact from paid links.

Have a look at the backlinks of the top 10 for any heavily monetized & competitive SERP... you'll see the engines seem to have hit a wall.

If paid links are setup correctly its is fundamentally impossible to tell the difference...
Reply | Reply with quote | Quote
 
 
-1 # kaosty 2009-02-10 10:28
well its all about detecting nepotistic links. even if someone finds a 100% working algo it makes no sense.

search engines couldn't use this algo without understanding of what nepotistic links were spam-promotion links and which were competition methods. they can and i belive they do understand which one is nepotistic but no penalize is being applied. when this global issues will be solved - all sites used this method will be penalized.
Reply | Reply with quote | Quote
 
 
0 # Jason 2009-02-10 14:07
Very interesting article on this subject. It will be interesting to see what changes are made over the next few years to the algorithms that automate a lot of these detections and what the penalties will be. It will also be interesting to see what legitimate sites are caught in the crossfire.
Reply | Reply with quote | Quote
 
 
0 # Gab Goldenberg - paid link evo 2009-02-22 16:31
So the big brands like Unilever will be see the value of their links to affiliated products/companies dropped? And all those links to Google News integrated in Google Blogger will be dropped? I don't think so...

A separate problem with this is that it ignores the reality of demographic targeting. While many paid links are not bought with that kind of consideration in mind, I believe it's an emerging trend that will continue to grow as people go to buy links on a direct response model as well.
Reply | Reply with quote | Quote
 
 
0 # Gab Goldenberg - paid link evo 2009-02-22 16:32
I mention demographic targeting because in that case, the link may be irrelevant in the context, but valuable to the advertiser because of who is reading, rather than what. E.g. Targeting women for weight loss products.
Reply | Reply with quote | Quote
 
 
+1 # Brent the Debt Relief Guy 2009-03-14 08:11
Fascinating post, thanks CJ & enjoy down-under.

My only real comment here is that getting tough on backlinks in this manner can set you up for a competitior to create bad backlinks for you deliberately.

No they are probably not going to buy them but they might if the rewards are good enough.

The only safe way to deal with the problem links is to just ignore them, don't give the linked to party any reward for their efforts or investment and they'll quickly stop doing it.
Reply | Reply with quote | Quote
 
 
0 # Gab Goldenberg 2009-03-25 20:29
Bill Slawski's interview in Search Marketing Standard shared a number of factors SEs may look at. If you think about them, a number are excellent for finding paid links.
Reply | Reply with quote | Quote
 
 
0 # Jack Strawman 2010-07-16 18:31
Okay, so Google doesn't want us to buy links. Fine, it's not worth the risk. So we spend years writing articles, posting to blogs, adding or sig. to forum post, and begging other webmasters in our niche for one-way links. Years later we have improved, but not nearly to the degree as we hoped. What's wrong. Why aren't we also acquiring "organic" or natural links.

My site has really great content, a very, very unique theme, and over 2,500 visitors a day. I check my backlinks every day, and I honestly cannot detect a single "organic" or natural link. Why aren't my visitors giving me links?
How many millions of visitors do you need each day before you start getting these mythical "organic" links? And how are you supposed to get millions of visitors if you are getting the "organic" links. This is very, very frustrating.
Reply | Reply with quote | Quote
 

Add comment


Security code
Refresh

Search the Site

SEO Training

Tools of the Trade

Banner
Banner
Banner

On Twitter

Follow me on Twitter

Site Designed by Verve Developments.