SEO Blog - Internet marketing news and views  

Host level spam detection

Written by David Harry   
Wednesday, 08 April 2009 08:24

Stay away from bad neighbourhoods!

For starters, what is web spam and what’s its function? In the patents we’re looking at today, they describe spam as websites constructed with random or targeted content and links in order to, “to trick the analysis algorithms used by search engines” into ranking the pages higher than they should (bit of an oxy moron play there). The end game of course being to monetize said traffic with varying forms of advertising… yada yada… we know the deal - And the fly in the ointment?

“However, achieving this is complicated because it can be difficult to identify spam hosts without manually reviewing the content of each host and classifying it as a spam or non-spam host

So what’s a search engine to do? Welcome to the world of rare AIR (Adversarial Information Retrieval). Last time out CJ was walking us through some methods of Paid Link detection and in the past we’ve covered link spam, phrase based and temporal spam detection methods ( to name a few) – this time we’re going to look at Host Level Spam Detection.

Stay away from bad neighbourhoods

Here are the Patents –

System and method for identifying spam hosts using stacked graphical learning - Method of detecting spam hosts based on propagating prediction labels - Detecting spam hosts based on clustering the host graph 

Note; It should be noted that they say it can be on a Server (IP) level, site level or even used on a page level. This is worth bearing in mind when the term ‘host’ is used.

 

Host/Site Level Spam Detection

As with a lot of learning models these days, they start off with a seed set. In this case it is a set of hosts deemed spam or non-spam by a baseline classifier. Then, in once instance, they describe a random walk that is “modified in order to obtain a weighted or skewed characterization of the host”. Essentially it would look for linking anomalies common to spam hosts which can be used in weighting the results.

Now, using this modified RW means they can either follow links from a known/classified spam host or use a probabilistic model to choose other likely profiles of likely/known spam hosts. Essentially it is a given that spam hosts link to other spam hosts in a higher proportion. They describe this scoring as a ‘characterization value‘ - To get a better feel on Yahoo’s evolved TrustRank see last years post on HarmonicRank.

From there they look at clusters of Spam hosts based on how each host is linked to others. Then these clusters can be analyzed to establish if it is a spam or non-spam cluster (of websites). The hosts in the cluster can then be reclassified based on the over-all scoring. If a site within the cluster does not meet a minimum threshold of the cluster then it’s spam score stays the same.

Essentially by classifying and clustering Spam hosts based on inter-linkage, a spamicity score can be calculated at the host and cluster level… think of it as PageRank for spam.

Note; They do mention classifications of spam sites based on links AND content, but I didn’t find much on the content spam detection, outside of a brief section on hiddent text and cloaking. This makes me believe there is another patent in this series we’ve not found

Host level spam detection

Modifying the results set

They describe this entire system as a module that is run on the results of a given query. The search engine does its usual retrieval and the host spamicity values are obtained for the sites and devaluation or outright removal from the SERP comes into play.

Valuations and classifications are based on commonality thresholds set by know web spam sites. Obviously there could be a variance in dampening depending on how one implemented it (ie; slight devaluation for linking to a few bad neighbourhoods).

So it ultimately;

  1. Has a seed set of known spam hosts
  2. Uses a modified random walk model to classify more hosts
  3. Clustering of hosts (into spam and non-spam)
  4. Re-classification (as needed per thresholds) of hosts in a cluster
  5. Sends spamicity values to re-rank query results

This does give us some interesting insight into ways that search engines look at spam clusters and more credence to alternate uses for Trust/Harmonic Rank approaches. As with many learning models, the seed set is somewhat subjective, though important. Once would certainly want to stay on the safe side of any thresholds.

 

Who you link to – be a good neighbour

So what can we do with all this? If you’ve never read up on TrustRank or HarmonicRank concepts, it would be worth doing. They are essentially centered on the concept that ‘good’ sites link to other ‘good’ sites and the inverse, with spam, is also true (proportionately).

This means we should be wary of areas such as;

  • Link responsibly – due diligence should be taken whenever one links out to a source. When in doubt, slap a nofollow (link condom) on it.
  • Blog comments/forum/other – slight devaluations could be had if you have a handful of (followed) Spam links leading from your site you may not have picked up. Be careful with your strategy when users can add content to your site.
  • The switch – doing outbound link audits is generally a good idea to ensure sites you’ve linked to in the past, are still upstanding citizens today.
  • Hacking – more and more we’ve seen link drops from SEO crackers, this is obviously something to be aware of.

This set of patents highlights the need to understand the difference between good and bad neighbourhoods. Would I go as far as to look at the other sites hosted on your server? Possibly as such methods could be used on a page, site or server level. It is entirely possible that if 90% of the sites/pages were deemed spam that you may get devalued at the cluster level. But it’s hard to say without knowing the implementation… Are any of the big 3 using host level spam detection? Likely… anyone care to ring ‘the Matt Phone’…?

 
Until next time…. Stay tuned

 

Notes;

Of interest are these 3 parts mentioned as component pieces;

PageRank.; PageRank is a well known link-based ranking algorithm that computes a score for each page. Various measures related to the PageRank of a page and the PageRank of its in-link neighbors were calculated to obtain a total of 11 PageRank-based features.

TrustRank; Gy ongyi et al. (Z. Gy ongyi, H. Garcia-Molina, and J. Pedersen. Combating Web spam with TrustRank. In VLDB, 2004) introduced the idea that if a page has high PageRank, but it does not have any relationship with a set of known trusted pages then it is likely to be a spam page. TrustRank is an algorithm that, starting from a subset of hand-picked trusted nodes and propagating their labels through the Web graph, estimates a TrustRank score for each page. Using TrustRank the spam mass of a page, i.e., the amount of PageRank received from a spammer, may be estimated.  

Truncated PageRan; Becchetti et al. described Truncated PageRank, a variant of PageRank that diminishes the influence of a page to the PageRank score of its neighbors.

 

Get the Data; they also mentioned much of the initial research was constructed from this data set, publicly available ( and here is the 2007 version)

 

Search the Site

SEO Training

Tools of the Trade

Banner
Banner
Banner

On Twitter

Follow me on Twitter

Site Designed by Verve Developments.