Stay away from bad neighbourhoods!
For starters, what is web spam and whats its function? In the patents were looking at today, they describe spam as websites constructed with random or targeted content and links in order to, to trick the analysis algorithms used by search engines into ranking the pages higher than they should (bit of an oxy moron play there). The end game of course being to monetize said traffic with varying forms of advertising
we know the deal - And the fly in the ointment?
However, achieving this is complicated because it can be difficult to identify spam hosts without manually reviewing the content of each host and classifying it as a spam or non-spam host.
So whats a search engine to do? Welcome to the world of rare AIR (Adversarial Information Retrieval). Last time out CJ was walking us through some methods of Paid Link detection and in the past weve covered link spam, phrase based and temporal spam detection methods ( to name a few) this time were going to look at Host Level Spam Detection.
Here are the Patents
System and method for identifying spam hosts using stacked graphical learning - Method of detecting spam hosts based on propagating prediction labels - Detecting spam hosts based on clustering the host graph
Note; It should be noted that they say it can be on a Server (IP) level, site level or even used on a page level. This is worth bearing in mind when the term host is used.
Host/Site Level Spam Detection
As with a lot of learning models these days, they start off with a seed set. In this case it is a set of hosts deemed spam or non-spam by a baseline classifier. Then, in once instance, they describe a random walk that is modified in order to obtain a weighted or skewed characterization of the host. Essentially it would look for linking anomalies common to spam hosts which can be used in weighting the results.
Now, using this modified RW means they can either follow links from a known/classified spam host or use a probabilistic model to choose other likely profiles of likely/known spam hosts. Essentially it is a given that spam hosts link to other spam hosts in a higher proportion. They describe this scoring as a characterization value - To get a better feel on Yahoos evolved TrustRank see last years post on HarmonicRank.
From there they look at clusters of Spam hosts based on how each host is linked to others. Then these clusters can be analyzed to establish if it is a spam or non-spam cluster (of websites). The hosts in the cluster can then be reclassified based on the over-all scoring. If a site within the cluster does not meet a minimum threshold of the cluster then its spam score stays the same.
Essentially by classifying and clustering Spam hosts based on inter-linkage, a spamicity score can be calculated at the host and cluster level
think of it as PageRank for spam.
Note; They do mention classifications of spam sites based on links AND content, but I didnt find much on the content spam detection, outside of a brief section on hiddent text and cloaking. This makes me believe there is another patent in this series weve not found
Modifying the results set
They describe this entire system as a module that is run on the results of a given query. The search engine does its usual retrieval and the host spamicity values are obtained for the sites and devaluation or outright removal from the SERP comes into play.
Valuations and classifications are based on commonality thresholds set by know web spam sites. Obviously there could be a variance in dampening depending on how one implemented it (ie; slight devaluation for linking to a few bad neighbourhoods).
So it ultimately;
- Has a seed set of known spam hosts
- Uses a modified random walk model to classify more hosts
- Clustering of hosts (into spam and non-spam)
- Re-classification (as needed per thresholds) of hosts in a cluster
- Sends spamicity values to re-rank query results
This does give us some interesting insight into ways that search engines look at spam clusters and more credence to alternate uses for Trust/Harmonic Rank approaches. As with many learning models, the seed set is somewhat subjective, though important. Once would certainly want to stay on the safe side of any thresholds.
Who you link to be a good neighbour
So what can we do with all this? If youve never read up on TrustRank or HarmonicRank concepts, it would be worth doing. They are essentially centered on the concept that good sites link to other good sites and the inverse, with spam, is also true (proportionately).
This means we should be wary of areas such as;
- Link responsibly due diligence should be taken whenever one links out to a source. When in doubt, slap a nofollow (link condom) on it.
- Blog comments/forum/other slight devaluations could be had if you have a handful of (followed) Spam links leading from your site you may not have picked up. Be careful with your strategy when users can add content to your site.
- The switch doing outbound link audits is generally a good idea to ensure sites youve linked to in the past, are still upstanding citizens today.
- Hacking more and more weve seen link drops from SEO crackers, this is obviously something to be aware of.
This set of patents highlights the need to understand the difference between good and bad neighbourhoods. Would I go as far as to look at the other sites hosted on your server? Possibly as such methods could be used on a page, site or server level. It is entirely possible that if 90% of the sites/pages were deemed spam that you may get devalued at the cluster level. But its hard to say without knowing the implementation
Are any of the big 3 using host level spam detection? Likely
anyone care to ring the Matt Phone
Until next time
. Stay tuned
Of interest are these 3 parts mentioned as component pieces;
PageRank.; PageRank is a well known link-based ranking algorithm that computes a score for each page. Various measures related to the PageRank of a page and the PageRank of its in-link neighbors were calculated to obtain a total of 11 PageRank-based features.
TrustRank; Gy ongyi et al. (Z. Gy ongyi, H. Garcia-Molina, and J. Pedersen. Combating Web spam with TrustRank. In VLDB, 2004) introduced the idea that if a page has high PageRank, but it does not have any relationship with a set of known trusted pages then it is likely to be a spam page. TrustRank is an algorithm that, starting from a subset of hand-picked trusted nodes and propagating their labels through the Web graph, estimates a TrustRank score for each page. Using TrustRank the spam mass of a page, i.e., the amount of PageRank received from a spammer, may be estimated.
Truncated PageRan; Becchetti et al. described Truncated PageRank, a variant of PageRank that diminishes the influence of a page to the PageRank score of its neighbors.
Get the Data; they also mentioned much of the initial research was constructed from this data set, publicly available ( and here is the 2007 version)