SEO Blog - Internet marketing news and views  

Host level spam detection

Written by David Harry   
Wednesday, 08 April 2009 08:24

Stay away from bad neighbourhoods!

For starters, what is web spam and what’s its function? In the patents we’re looking at today, they describe spam as websites constructed with random or targeted content and links in order to, “to trick the analysis algorithms used by search engines” into ranking the pages higher than they should (bit of an oxy moron play there). The end game of course being to monetize said traffic with varying forms of advertising… yada yada… we know the deal - And the fly in the ointment?

“However, achieving this is complicated because it can be difficult to identify spam hosts without manually reviewing the content of each host and classifying it as a spam or non-spam host

So what’s a search engine to do? Welcome to the world of rare AIR (Adversarial Information Retrieval). Last time out CJ was walking us through some methods of Paid Link detection and in the past we’ve covered link spam, phrase based and temporal spam detection methods ( to name a few) – this time we’re going to look at Host Level Spam Detection.

Stay away from bad neighbourhoods

Here are the Patents –

System and method for identifying spam hosts using stacked graphical learning - Method of detecting spam hosts based on propagating prediction labels - Detecting spam hosts based on clustering the host graph 

Note; It should be noted that they say it can be on a Server (IP) level, site level or even used on a page level. This is worth bearing in mind when the term ‘host’ is used.

 

Host/Site Level Spam Detection

As with a lot of learning models these days, they start off with a seed set. In this case it is a set of hosts deemed spam or non-spam by a baseline classifier. Then, in once instance, they describe a random walk that is “modified in order to obtain a weighted or skewed characterization of the host”. Essentially it would look for linking anomalies common to spam hosts which can be used in weighting the results.

Now, using this modified RW means they can either follow links from a known/classified spam host or use a probabilistic model to choose other likely profiles of likely/known spam hosts. Essentially it is a given that spam hosts link to other spam hosts in a higher proportion. They describe this scoring as a ‘characterization value‘ - To get a better feel on Yahoo’s evolved TrustRank see last years post on HarmonicRank.

From there they look at clusters of Spam hosts based on how each host is linked to others. Then these clusters can be analyzed to establish if it is a spam or non-spam cluster (of websites). The hosts in the cluster can then be reclassified based on the over-all scoring. If a site within the cluster does not meet a minimum threshold of the cluster then it’s spam score stays the same.

Essentially by classifying and clustering Spam hosts based on inter-linkage, a spamicity score can be calculated at the host and cluster level… think of it as PageRank for spam.

Note; They do mention classifications of spam sites based on links AND content, but I didn’t find much on the content spam detection, outside of a brief section on hiddent text and cloaking. This makes me believe there is another patent in this series we’ve not found

Host level spam detection

Modifying the results set

They describe this entire system as a module that is run on the results of a given query. The search engine does its usual retrieval and the host spamicity values are obtained for the sites and devaluation or outright removal from the SERP comes into play.

Valuations and classifications are based on commonality thresholds set by know web spam sites. Obviously there could be a variance in dampening depending on how one implemented it (ie; slight devaluation for linking to a few bad neighbourhoods).

So it ultimately;

  1. Has a seed set of known spam hosts
  2. Uses a modified random walk model to classify more hosts
  3. Clustering of hosts (into spam and non-spam)
  4. Re-classification (as needed per thresholds) of hosts in a cluster
  5. Sends spamicity values to re-rank query results

This does give us some interesting insight into ways that search engines look at spam clusters and more credence to alternate uses for Trust/Harmonic Rank approaches. As with many learning models, the seed set is somewhat subjective, though important. Once would certainly want to stay on the safe side of any thresholds.

 

Who you link to – be a good neighbour

So what can we do with all this? If you’ve never read up on TrustRank or HarmonicRank concepts, it would be worth doing. They are essentially centered on the concept that ‘good’ sites link to other ‘good’ sites and the inverse, with spam, is also true (proportionately).

This means we should be wary of areas such as;

  • Link responsibly – due diligence should be taken whenever one links out to a source. When in doubt, slap a nofollow (link condom) on it.
  • Blog comments/forum/other – slight devaluations could be had if you have a handful of (followed) Spam links leading from your site you may not have picked up. Be careful with your strategy when users can add content to your site.
  • The switch – doing outbound link audits is generally a good idea to ensure sites you’ve linked to in the past, are still upstanding citizens today.
  • Hacking – more and more we’ve seen link drops from SEO crackers, this is obviously something to be aware of.

This set of patents highlights the need to understand the difference between good and bad neighbourhoods. Would I go as far as to look at the other sites hosted on your server? Possibly as such methods could be used on a page, site or server level. It is entirely possible that if 90% of the sites/pages were deemed spam that you may get devalued at the cluster level. But it’s hard to say without knowing the implementation… Are any of the big 3 using host level spam detection? Likely… anyone care to ring ‘the Matt Phone’…?

 
Until next time…. Stay tuned

 

Notes;

Of interest are these 3 parts mentioned as component pieces;

PageRank.; PageRank is a well known link-based ranking algorithm that computes a score for each page. Various measures related to the PageRank of a page and the PageRank of its in-link neighbors were calculated to obtain a total of 11 PageRank-based features.

TrustRank; Gy ongyi et al. (Z. Gy ongyi, H. Garcia-Molina, and J. Pedersen. Combating Web spam with TrustRank. In VLDB, 2004) introduced the idea that if a page has high PageRank, but it does not have any relationship with a set of known trusted pages then it is likely to be a spam page. TrustRank is an algorithm that, starting from a subset of hand-picked trusted nodes and propagating their labels through the Web graph, estimates a TrustRank score for each page. Using TrustRank the spam mass of a page, i.e., the amount of PageRank received from a spammer, may be estimated.  

Truncated PageRan; Becchetti et al. described Truncated PageRank, a variant of PageRank that diminishes the influence of a page to the PageRank score of its neighbors.

 

Get the Data; they also mentioned much of the initial research was constructed from this data set, publicly available ( and here is the 2007 version)

 

Comments  

 
+1 # alexc 2009-04-09 07:31
Checking whether you host your site on exact IP or even IP subnet that may contain heavily spammed sites is essential: we've got free tool that shows such domains ranked by number of backlinks -
www.majesticseo.com/research/neighbourhood-checker.php
- usually heavily spammed domains would come up on top as spammers go for quantity of backlinks rather than quality.
Reply | Reply with quote | Quote
 
 
0 # AndyW 2009-04-09 17:03
Dude, you are making my head spin :side:
Reply | Reply with quote | Quote
 
 
0 # andrew 2009-04-10 00:29
but for many people who used shared hosting accounts they have no idea who else is on the server
Reply | Reply with quote | Quote
 
 
0 # John 2009-04-10 02:50
andrew: You can run a reverse IP domain check to find out. Earlier this week, I decided to check some of my sites out and I found one host had 1,500 domains on my IP - several of them were really shady.

So, I got a dedicated IP and I've seen significant gains in just two days. Nothing miraculous, but the SERPs are a lot more stable for my sites and some phrases I could never make rank progress on are up 20-30 spots.
Reply | Reply with quote | Quote
 
 
0 # Rob R. 2009-04-10 04:07
A very good post ! Thanx.
So, by choosing a new host for a new site you have to look what sites they are hosting right now :dry:
Reply | Reply with quote | Quote
 
 
-1 # Home Terrorist 2009-04-10 05:48
Some of us don't need to worry about this you see...
Reply | Reply with quote | Quote
 
 
-1 # Nick Stamoulis 2009-04-10 11:16
Search engine spam is def. something that gets in the way and is frustrating to deal with especially when you are trying to grow a business online the right way.
Reply | Reply with quote | Quote
 
 
+1 # Guest 2009-04-11 09:07
@Alext thanks for the tool and pointers... care to write a post here about it? I'd be happy to publish it.

John - that is some interesting evidence, some testing would be interesting. See how much it really plays into it...

Rob - well, as always patents aren't always used. This one though, is a sensible one as most search engines use some type of 'TrustRank' concepts these days. That's why I felt like highlighting this one.

Nick - thanks for dropping in, I noticed you post on spelling it S-E-O not S-P-A-M - I do friggen hate comment spammers... never ending battle.
Reply | Reply with quote | Quote
 
 
+2 # alexc 2009-04-22 20:10
Hey, sorry lost track of this conversation - would be happy to write a short post about it with some real life examples I came across with - drop me an email if you interested.
Reply | Reply with quote | Quote
 
 
0 # rusli zainal sang visioner 2009-07-01 03:07
really informative.. how about comment spam? it's detected too or not? i go everywhere there is a comment spam. is it will penalized by gooler?
Reply | Reply with quote | Quote
 

Add comment


Security code
Refresh

Search the Site

SEO Training

Tools of the Trade

Banner
Banner
Banner

On Twitter

Follow me on Twitter

Site Designed by Verve Developments.