A Google guide to spam detection
We all know your friend an mine Matt Cutts, over at the Google Web Spam team
but how exactly do they go about detecting and curtailing web spam? While there are a wide variety of tools at their disposal, historical factors are easily one of the more powerful methods.
In the last offering in our series on historical ranking factors, were going to look at some ways they are used to detect spam. Now, this isnt an education for spammers in as much as I wanted to give everyone some knowledge in hopes they arent needlessly flagged for earnest efforts.
(the following items are based on Google historical ranking patents from the last few years)
Obviously one of the more important areas for consideration is link velocity. When ranking documents, a search engine may deem a web page that is one day old with 10 backlinks to be more valuable than a 10 year old document that has 100 back links. Conversely though, a more recent document with a abnormal rate of link growth may be considered to be an attempt to spam the search engine.
A typical legitimate document generally attracts links slowly while a large spike on the link graph can certainly attract the attention as a spammy attempt to manipulate the rankings. Sure it can also denote a topical anomaly (breaking news etc..) but none the less, further inspection by other bots may be the call of the day, upon discovery of such spikes.
What does Google consider link spam? Well that can include;
- exchanging links
- purchasing links
- or gaining links from documents without editorial discretion on making links.
Examples of documents that give links without editorial discretion include guest books, blog comments, forums, referrer logs, and free for all pages that let anyone add a link to a document. Now lets look at link texts
Unique Words, Bigrams and Phrases in Anchor Text
A search engine may monitor the link graph of a document over time and associated behaviors to flag for spam. In simplest terms a natural profile includes the independent decisions of many webmasters/writers and the associated link texts will vary. Often times with a synthetically created profile the graph is relatively spikey you know, multiple targeted link texts ;0)
These spikes are often the addition of many links with identical link texts as well as deliberately differentiated link texts (ie; 3-4 closely related terms). They can watch a given document (or domain) for such anomalies over time and score them accordingly or mark for further inspection from other spam bots.
They ultimately may discount the links to a given page should this tactic be deemed to be in place
It is important to understand that Google has a huge amount of data in your market and can easily see documents that are beyond a given threshold. This is one more reason to observe what top ranked documents in a query space have for link profile and associated link texts.
Another way in which search engines such as Google to detect spam is via query analysis. This simply means the number of related or non-related queries that a given document shows up for. Most web pages will be fairly topically related and only rank for a few queries (and semantically related searches).
If the average document in a results set satisfies, lets say 10 query types and there is a document in the set that appears in 20+ there is every reason to believe there is an anomaly worth investigating further.
This obviously brings to mind cloaking and other devious tactics which would explain such situations. Thus a document can be flagged for closer inspection from other spam-bots.
Beyond query analysis Google may also look at ranking histories to establish possible attempts to spam/manipulate the search engines. Once again by looking at the number of queries a document ranks for over time they can see given patterns and anomalies. A document that jumps rankings over many queries produces signals which the Google web spam team may be interested in.
It is assumed that spikes in rankings and the topical nature therein can be attributed to manipulation via link spam efforts. As such Google may;
- only allow a rank to grow at a certain rate.
- the rank for a given document may be allowed a certain maximum threshold of growth over a predefined window of time.
This can be dependent on a few factors including online chatter surrounding the document in locales such as news articles, forums and so forth in the theory that spam documents do not get discussed in such areas.
Google does discuss exceptions for trusted sources;
It may be possible for search engine to make exceptions for documents that are determined to be authoritative in some respect, such as government documents, web directories (e.g., Yahoo), and documents that have shown a relatively steady and high rank over time.
Domain related information
Many times spammers will create large portfolios of domains in order to created doorway domains or cross-linking schemes. The idea being to create as much traffic as possible in the short term with throw-away domains or otherwise manipulate the rankings.
Information relating to the length of the domain registration and/or the named registrant can be used as a signal of potential spam attempts. Google seems to place value on longer registration periods and known entities.
This isnt to say that your page will rank poorer with a one year reg and private registrant information, but it can be used to give a larger over-all analysis of a suspected spam document. It is more of a potential spam signal than an actual ranking one.
DNS records can also be utilized. For instance, search engine may monitor
- whether physically correct address information exists over a period of time,
- whether contact information for the domain changes relatively often,
- whether there is a relatively high number of changes between different name servers and hosting companies, etc.
What could be considered the lie down with dogs and get fleas approach, a list of known-bad contact information, name servers, and/or IP addresses may be identified, stored, and used in predicting the legitimacy of a domain and, the documents associated with it.
Google also noted that Name Servers themselves can give historical signals for spam detection;
A good name server may have a mix of different domains from different registrars and have a history of hosting those domains, while a bad name server might host mainly pornography or doorway domains, domains with commercial words (a common indicator of spam), or primarily bulk domains from a single registrar, or might be brand new. The newness of a name server might not automatically be a negative factor in determining the legitimacy of the associated domain, but in combination with other factors, such as ones described herein, it could be.
Now that you know
What is important, as I mentioned off the top, is that this is not a guide on how to spam better. It is important is to minimize and factors that could potentially flag you for further inspection of otherwise devalue the efforts made in your SEO.
There are many in the industry that simply look at the process as more links, more links and of course; more links. When one learns the many aspects that can effect the rankings, they will find it much easier to rank with less back links than the competition.
Historical ranking factors are a part of the discipline that are not talked about as much as they should
. Learn them, understand them and apply them your SEO will be strong.
... until next time; stay tuned
Past installments included;
Link builders guide to Historical rankng factors
Understanding historical ranking factors for content creation/management plans
Do link spammers leave footprints?