In case you missed the memo
For the last while covering some web spam detection methods have been on my mind. Regardless if you walk on the dark side, or prance where daisies bloom, knowing how to stay in a search engine’s good graces is always a wise path to chart. To that end, Yahoo put out a patent that deals with link manipulations via reciprocal links. If you didn’t get the memo about recips being dead - at least excessive reciprocal linking - then do come along for the ride. (seems Bill and I have the same Friday night reading as he covered this link spam patent Saturday)
The patent is;
Identifying excessively reciprocal links among web entities – Yahoo – Filed in July 2007 and assigned Jan 8 2009 - Converse; Timothy M.; (Sunnyvale, CA) ; Garg; Priyank Shankar; (San Jose, CA) ; Tsioutsiouliklis; Konstantinos; (San Jose, CA)
Of interest right away is that they mention this being related to;
Link-based spam detection – Yahoo – Filed August 2005 and assigned May 2006 - Barkhin; Pavel; (Sunnyvale, CA) ; Gyongyi; Zoltan Istvan; (Stanford, CA) ; Pedersen; Jan; (Los Altos Hills, CA)
So what? Are reciprocal links now synonymous with spam? Well not entirely, but the folks at Yahoo aren’t thrilled about it. But as you may imagine few in the IR world are thrilled and spend a fair amount of time looking to combat it. Some good reading can be found on the Annual Workshop on Adversarial IR.
What seems to be the problem?
While there are many ways of describing that itch they just can’t scratch, I think it’s summed up well with;
“Web page authors are often aware of the criteria that a search engine will use to rank and sort references to web pages.” And we muck about with said signals “in order to artificially inflate the rankings of references to their web pages within lists of search results”.
Sound familiar? It should you nasty little web authors, Santa ain’t bringing you no PageRank for Xmas this year. Sadly the genie is out of the bottle and sailed on that last ship leaving the port. Regardless of your level of intention, link building is a necessary part of modern internet marketing. Since I couldn’t afford Stanford, here we are and this is the side of the fence I landed on.
But wait there’s hope… because those bad search manipulators are using;
“Spurious references to web pages which are not useful for users and are meant to boost search rankings sometimes push poorer results above web pages that users have previously found interesting or valuable for legitimate reasons.”
Well damn, I’m merely promoting legitimate content to out rank those other nasty spammers… so that’s cool right? But I digress… let’s move along.
Identifying reciprocal links
The main goal of the patent is to automatically identify (over done) reciprocal links that may be orchestrated to inflate ones rankings (Viagra for PageRank). While they offer the standard machine learning tactics to accomplish the automation, it is still based on initial human input to establish a training set. To me this is often the Achilles heel as new tactics need to be identified to be fed into the system.
That being said, they do cover more than a few angles to get the job done. For starters they use the classic definition of links that are common as far as links in and links out. If you have a page with 20 links on it… and 20 links from the same pages of the out-bound, well… it’s likely you’re engaging in reciprocal linking.
In this first approach the number of recips on a page (or domains or sub-domains) is used to establish the likelihood of the page utilizing reciprocal links to inflate ones rankings. If a page has 10 links on it… and each of those pages link back, it is the highest possible signal. By defining a ratio and set of thresholds, pages with a high level of reciprocation can be identified.
At this point links can be devalued, rankings dropped or merely flagged for inspection (one would assume by other spam bots). Even pages with lesser percentage of traditional recips could be flagged if there are enough recips on the page. Now you say that isn’t a biggie, then you’d be right… so we move along.
Other types of reciprocal manipulations
Domain and Sub-domains – another method mentioned is looking at the big picture. That is over all pages of a domain or sub-domain. You might get sneaky (or be a friendly blogger) and try masking them by deep linking to each others website or using sub-domains to get them in under the radar from the system described above. Not necessarily gonna’ work, these can also be analyzed and potentially marked for further inspection. This could include IP addresses, autonomous systems, top level domains, logical sites an so on.
3 way and multilevel links – yeah, you were thinking it weren’t you? So where they, which they term ‘multi-level reciprocal links’. Essentially they are looking beyond simple site to site linking and watching for patterns among sites.
“ node 502 contains a two-level reciprocal link with node 506. Node 506 is an inlink to node 508, which is in turn an inlink to node 502. Node 506 is also an outlink of node 504, which is in turn an outlink of node 502. Similarly, nodes 504 and 508 contain two-level reciprocal links with each other.”
Point being, that by setting various thresholds and levels of inclusion, they can identify potential patterns for closer inspection.
Suspicious Clusters – ok, let’s say you want to take 5 pages with a limited number of links (in content let’s say) and then interlink them (much like the cross domain diagram above) in a planned manner to elude detection. Each page would get a benefit of 4 added links. This could go on an on and seems sneaky and waaay cool. Sorry Charlie… we got ya covered there as well. Please head straight to jail, don’t pass GO.
By analyzing the commonalities, patterns as ratios… they can detect anomalies worth closer inspection. Ultimately leading to devaluation or lost rankings.
How to deal with you nasty spammers
Now obviously as one reads along you can see how there is plenty of room for misinterpretation and the chance for false-positives. One such instance would be a group of pages/domains that are interlinked because they deal with related subject matter and have a valid reason to be linking to each other. Another instance could be a set of company websites that for obvious reasons are linking to each other… another legitimate instance.
This is why, for the most part, pages are marked as suspicious for further investigation. Or as common SEO terminology would have it, the proverbial red flag. They suggest human review, other automated mechanisms, (machine learning/artificial intelligence) or a combination of both for instances of deeper analysis.
Such analysis could include sending over other spam-bots to check the page/domain for signs of other known manipulation tactics (link spam, keyword injections, cloaking etc..). This secondary data can help develop a profile that signals the page is truly web-spam.
Now, instances where there is a high degree of confidence in the manipulation, the page may be removed from the SERPs altogether. It is this automated penalty that you would certainly want to avoid and makes this journey worth the price of admission (I did mention its FREE to read the Trail right?). Another implementation may devalue the links based on the ratio and perceived confidence in the analysis. Also not something one would want happening.
So what does this all mean to you?
They do mention a sort of ‘white list’ of trusted domains that are excluded from analysis not unlike the TrustRank and more recent HarmonicRank concepts (from Yahoo). So one thing you should be doing (if you weren’t already) is being more vigalent in whom you link to and building general authority for your domain.
What else can you do? Simple, just don’t bother with reciprocal linking schemes in your SEO programs. There are plenty of ways to get valuable links these days (especially since the social web arrived) and recips really were dead long ago (U did get the memo right?).
It is always important to pay attention to your link profiles and knowing how search engines treat link spam and more specifically reciprocal links. It can help ensure that you don’t cause yourself undue grief, if only by mistake or circumstance.
But seriously… skip the reciprocal links and build a strong link profile without them, it’s the best advice.
Til next time… happy linking!!
NOTE: As I mentioned it seems Bill and I have the same Friday night reading so be sure to check out his post on Link Spam for a TON of other resources and reading on this topic (tnx Bill, saved me having to list resources… weeee… )
A funny reciprocal links image - Matt Cutts
Link Building’s Cult Of Reciprocity - Search Engine Land
Google Adds "Excessive" To Link Exchange Guidelines - Search Engine Roundtable
Linking Schemes - Google Webmaster Help