Microsoft on link spam; using temporal tracking
A recent Microsoft search patent came out for a system which detects spam websites by looking at the changes in link information on a given page/set of pages over time. We recently covered some potential ways of going about this with some analysis of Google in Historical ranking factors for link builders be sure to give that a read as well if youre in the mood to saunter down some related journeys. This time we have;
Detecting web spam from changes to links of websites
FiledDecember 14, 2006 : Published June 19, 2008
As we know from the last excursion, link activity over time can unlock potential spam signals for search engines to use. This can be done by looking at a variety of features of the associated link information. As with any analysis the system uses a probabilistic model to judge what is and is not considered to be a spammy link profile.
This is also not limited to inbound links, but to outbound links as well (because a link profile is more than just inbounds right?). The main problems that web spam creates for a search engine are the obvious lack of meaningful search results, but also bandwidth/spidering resources spent crawling/indexing spammy sites. So some yummy SERPs and good for the bottom line as well!!
Link spam temporal footprints
"Spamming" in general refers to a deliberate action taken to unjustifiably increase the popularity or importance of a web page or web site. In the case of link spamming, a spammer can manipulate links to unjustifiably increase the importance of a web page. For example, a spammer may increase a web page's hub score by adding out links to the spammer's web page.
Some examples of tactics link spammers may use also included;
- create a copy of an existing link directory to quickly create a very large out link structure.
- a spammer may provide a web page of useful information with hidden links to spam web pages.
- many web sites, such as blogs and web directories, allow visitors to post links. Spammers can post links to their spam web pages to directly or indirectly increase the importance of the spam web pages.
- a group of spammers may set up a link exchange mechanism in which their web sites point to each other to increase the importance of the web pages of the spammers' web sites.
By looking at the link profiles of spam sites the search engine can create a template of it's linking activity to enable further algorithmic seek and destroy adaptations
As with many probabilistic systems, a set of training documents/websites can be used to train valuations of a spammy link profile. These can come from inputted sites that received a manual review and were deemed to be a spam website. These become the base set used for teaching the algorithm(s) what look for when crawling.
The Feature Set
To start training the system on what to look for and build probability models upon, they describe a few features that can be used, added upon or combined to identify spammy link profiles.
The direct features of a web site may include; the rates at which links are added to or removed such as -
- in link growth rate; number of new links between snapshots
- out link growth rate; new outbound links from target site.
- in link death rate; decrease in number inbound links to website.
- out link death rate; outbound links no longer present
The neighbour features; would measure the same parameters (direct features) of sites linking in as well as sites that are out-linked to.
The correlation features of a web site compares the variances between the direct and neighbour features.
The clustering feature of a web site looks at the rate of change between direct features and neighbour features
The combine feature of a web site may include various combinations of the direct features, neighbour features, correlation features, and clustering features.
The main point being that once one has manually identified a variety of spam websites a link profile emerges for each that can be added to the training set. From there it becomes potentially easier to identify (and graph) other potential spam documents. This is much like using Google Hacks to track down web spam footprints, but link profile footprints is the weapon of choice. Training the systems depends on the data variables best typify a spam profile.
Once a web site has been classified as a spam site its rankings would be dropped as well as crawl schedules terminated; pretty much as one would imagine.
I hear peeps screaming at their monitors about the chance of false positives and potential for unjust horrors; fear not. For starters there is plenty of room for combinations of factors and one would leave low thresholds in any learning model to begin with.
Also, such a system would also likely merely flag the site of have another spam bot drop by for a closer look see. From duplicate/autoGen content checkers to IP delivery/header spies; others would follow. I cant see a pure link based spam detection being fail-safe enough to at lest not have a few other sniffers pay a visit.
There wasnt anything stunningly new here and if I had to guess, it is more about defining their own computational assertions, than it is about breaking new ground.
Once again, for more link related temporal goodiness read; A link builders guide to historical ranking factors.
Until next time; stay tuned.
**Note; On the computation side of things there was much talk of the training mechanisms, Id check out some of the stuff on - Sequential Minimal Optimization over at MS.
"The spam detection system may alternatively use an adaptive boosting technique to train the classifier. Adaptive boosting is an iterative process that runs multiple tests on a collection of training data. Adaptive boosting transforms a weak learning algorithm (an algorithm that performs at a level only slightly better than chance) into a strong learning algorithm (an algorithm that displays a low error rate). The weak learning algorithm is run on different subsets of the training data.
The algorithm concentrates more and more on those examples in which its predecessors tended to show mistakes. The algorithm corrects the errors made by earlier weak learners. The algorithm is adaptive because it adjusts to the error rates of its predecessors. Adaptive boosting combines rough and moderately inaccurate rules of thumb to create a high-performance algorithm. Adaptive boosting combines the results of each separately run test into a single, very accurate classifier.