How search engines consider link relevance
A good friend of mine sent me an interesting question a while back, Dear Dave, can you give me some pointers on how search engines determine link intent?
Well sheeeeit
its hard to say what constitutes intent when it comes to various aspects of linking and search engines. It really isn't that straight forward and understanding 'query intent' tends to be the most researched area for search peeps.
And as far as patents/papers, I doubt there is really much out there specifically on intent beyond some papers such as; Recognizing Nepotistic Links on the Web Or Detecting Nepotistic Links by Language Model Disagreement and there is more in CJs post on detecting paid links there are a whack of papers at the end of that one.
Ultimately though
Its all about the spam
Most of such valuations (intent) would be in the link spam world as this is where links are evaluated. From that point out, what is actually intent is defined by the engines themselves. What I mean is that an algorithm can merely look for elements common to those manipulating the index via links, it is up to the engineers to decide what to do with pages/sites above a given threshold.
To get some ideas we can look at this Google patent; Document scoring based on link based criteria ...which has stuff such as;
Authority entities one area that can be used to bypass the need to assess value/intent and the potential for link spam is that they will trust authority domains
It may be possible for search engine (125) to make exceptions for documents that are determined to be authoritative in some respect, such as government documents, web directories (e.g., Yahoo), and documents that have shown a relatively steady and high rank over time. For example, if an unusual spike in the number or rate of increase of links to an authoritative document occurs, then search engine 125 may consider such a document not to be spam and, thus, allow a relatively high or even no threshold for (growth of) its rank (over time).
Temporal factors Essentially via link velocity and decay anomalies can be used as a flag for closer inspection in this case the intent being to artificially inflate a link profile.
A typical, "legitimate" document attracts back links slowly. A large spike in the quantity of back links may signal a topical phenomenon (e.g., the CDC web site may develop many links quickly after an outbreak, such as SARS), or signal attempts to spam a search engine From; Historical ranking factors or more specifically; Spam detection using temporal factors.
You can also look at this Microsoft patent on temporal link spam detection; Do link spammers leave footprints
Anchor text anomalies A document that has a non-natural rate of growth often has spikes of new backlinks with similar/identical link text associated with it. Documents that show such spikes over time can have the links capped or otherwise devalued.
One reason for such spikiness may be the addition of a large number of identical anchors from many documents. Another possibility may be the addition of deliberately different anchors from a lot of documents. Also from; the Link builders guide to historical ranking factors.
Document ranking search engines may also look at historical ranking levels. This additional signal can be used to detect link intent when combined with other factors;
(
) search engine 125 may monitor the ranks of documents over time to detect sudden spikes in the ranks of the documents. A spike may indicate either a topical phenomenon (e.g., a hot topic) or an attempt to spam search engine 125 by, for example, trading or purchasing links. Search engine 125 may take measures to prevent spam attempts by, for example, employing hysteresis to allow a rank to grow at a certain rate. In another implementation, the rank for a given document may be allowed a certain maximum threshold of growth over a predefined window of time.
They also discuss doorway domains and name server freshness
that could be used to assess legitimacy of the inbound link
and ultimately, intent ( in co-currence with other signals).
And thats just a few ways they can look for anomalies. What is important I there is no singular approach to assessing the intent of a given link per se. It is more about looking at median scoring of a variety of factors which can either trigger an algorithmic devaluation or raise a flag for closer (human inspection). From there, obviously human judgement becomes involved ( see the Google quality rater document that came out a while back).
Another interesting Google patent is; Method for detecting link spam in hyperlinked databases
What else?
some random thoughts
Document relevance can also be used to assess the intent of links in a profile as those above and given threshold can often mean there is hanky panky afoot. This speaks to intent as it can show the intent to manipulate.
Excessive reciprocals another area is recips or even 2-3-4 ways links. If they establish the expressed intent of such link building approaches are in play, this can also trigger a closer look. By defining a ratio and set of thresholds, pages with a high level of reciprocation can be identified. Some recent stuff I covered from Yahoo sheds light on that end
Which is based from the patent; Identifying excessively reciprocal links among web entities
TrustRank/Harmonic Rank we can also infer that these concepts can also be used to identify web spam and thus linking intent.
(
to) demote those hits whose effective mass renders them likely to be artificially boosted by link-based spam. The determination of the effective mass for a given web document relies on a combination of techniques that in part assess the discrepancy between the link-based popularity (e.g., PageRank) and the trustworthiness (e.g., TrustRank) of a given web document.
From; Link-based spam detection - More on Harmonic rank in; Yahoos Harmonic Rank
Host level spam detection this is another area somewhat related to Trust/Harmonic Rank type approaches. Once more, there is an inherit intent trying to be established, this time based on where a site/page lives as well as touching on TrustRank type concepts;
Page Segmentation in this instance the links can be assessed by location. Lets say there are some links in the sidebar/footer that have the text Advertisers or Sponsors etc
this wouldnt be too difficult for an algorithm to detect and report back as potential manipulations
thus the intent being to game the engines with paid link spam
This method can also be used in context with other approaches mentioned already
For more see; Page segmentation and link building and the SEO implications of page segmentation
An that's about what I have on the technical front
and if there aint enough here
you could always have a look at the AIR Web proceedings. Heres some papers from the 2009 edition;
The human element
Once more, many times assessing intent is not as much an algorithmic activity as a human one, (as humans set the training documents and thresholds). For starters humans set the thresholds for link spam and secondly, humans review complaints from the public. I can say that the recent case I was on about over at Greenpeace most certainly did have some human interactions that changed the approach taken.
As you may notice, the original campaign was altered after the post was written
so there was a human element on my part as well as the engines/authors. Intent often comes down to what I call plausible deniability''. This means you cant state that links are in any way a payment or trade for anything of value. Sure, it can be implied, but one cant overtly state it
ya know?
At the end of the day intent will be more of a subjective assessment algorithmically, or via human assessment or a combination of both. By programming thresholds the search engines can assign intent or send it up the line for human evaluation. There is no real intent algorithm so to speak that I know of
merely ways of identifying link spam, which is the embodiment of malicious intent on the part of the webmaster/optimizer.
I know it may not be the clear cut statements of how search engines look at link intent but they really arent as concerned with intent outside of the spam dept. By utilizing some of the methods mentioned above, in accordance with subjective judgements (TOS and guidelines) they assign intent accordingly
For all those that were ever curious... I hope that helped :0)
|