SEO Blog - Internet marketing news and views  

Are the search engines spying on SEOs?

Written by David Harry   
Monday, 10 August 2009 09:37

Finding link spam via search marketing Forums

OK, sure… the title is a bit egregious, but in many ways so is the patent that came out from the folks at Microsoft last week. As I was perusing my feed reader on Thursday I noticed a patent that DEFINITELY caught my attention (and I even laughed a bunch too..);

Forum Mining for Suspicious Link Spam Sites Detection - Microsoft - Filed; Feb. 06 2008 – Assigned; Aug. 06 2009

Enemy sent by Gates

 

Context-Aware Query Classification

Written by Terry Van Horne   
Tuesday, 04 August 2009 08:42

Modeling Google's Florida Update and Universal Search

(the following is a guest post from Terry Van Horne)

A number of papers have come out of the SIGR conference and for me one of the most interesting is the paper from the Microsoft research team using Conditional Random Field (CRF) and contextual analysis (pdf) to determine query classification. The paper discusses some interesting uses for click data and query "labeling". I found this paper especially intriguing in that as I read through it I kept thinking I was reading a Google paper about Universal Search.

Once I do an overview of the paper I'll write a bit about what I believe is similar and in use on Google. I will be basing my conjecture... er... thesis in the belief that the Florida update did add a taxonomy to the algo for the purposes of classifying queries. Further to that the current Universal Search Algo is just the next step in refining the query classifications and that it was done in a similar fashion to what the Microsoft paper describes. These are observations as indicated by the current blended/Universal Search SERPs.

Universal Search

I'm going to first go through the highlights in the paper and try to give you the elevator version. I'm not as interested in formulas or IR geek stuff because, well, it doesn't really help me do what I do. The thing to keep in mind about papers are they are research they indicate problems that are trying to be solved and the methods that can be used to solve them. As a webmaster I want to be part of that solution by aiding the search engines when I'm building a site. Note aiding not manipulating... there's a difference.


The Microsoft team chose some interesting features of CRF and there is no use in me explaining when they summed it up pretty well...

To answer these questions, we propose to use the Conditional Random Field model (CRF for short) to help incorporate the search context information. We have several motivations for using this model. First, CRF is a sequential learning model which is particularly suitable for capturing the context information of queries. Second, the CRF model does not need any prior knowledge for the type of conditional distribution.

Finally, compared with Hidden Markov Models, the CRF model is more °exible to incorporate richer features, such as the knowledge of an externalWeb directory.

...the advantages of using CRF

"When we use the CRF to model a search context, one of the most important parts is to choose the effective feature functions. In this section, we introduce the features used for building a CRF model of the search context for QC. In general, the features can be divided into two categories. The features that do not rely on the context information are called local features, and those that are dependent on context information are called contextual features."

Three local features are query, pseudo feedback and implicit feedback. I'll try and describe the problem that the research is trying to solve in layman's terms.

 

Taxonomies src=http://www.huomah.com/images/stories/pagePics6/terry3.jpg

 

The query

Query is the starting point for refining the association between query term and classification. The reasons for the query needing the contextual element was summed up well in the paper:

The available training data are usually with limited size and could not cover a sufficient set of query terms that are useful for reflecting the association between queries and category labels.

Long story short, there are not enough query terms associated with category labels to be useful so it has to be refined by adding Psuedo and Implicit feedback.

 

 

Pseudo feedback

  Pseudofeedback is basically using a taxonomy from the web, for example, ODP or Yahoo! where the directory category results are mapped to a category in a target taxonomy and a Confidence score calculated. I will admit the Confidence score took me a bit to understand the math so here is an excerpt that IMO, explains the confidence score pretty well:

Finally, we calculate a general label confidence score: GConf(ct,qt)= Mct,qt/M
where Mct,qt means the number of returned related search results of qt whose category labels are ct after mapping. Intuitively, the GConf score reflects the confidence that qt is labeled as ct gained from pseudo feedback; the larger the score, the higher the confidence.

In a nutshell, a confidence score is attached to the query term based on how many search results matched the category term from the outside/target taxonomy. IMO, this does not have to be a web directory in fact for some queries an offline taxonomy might be a better choice. For example occupational/professional categories used by human resource organizations.

 

 

Implicit (user) Feedback:

An interesting aspect of this research was the use of logs from the web directory to refine the confidence score with user data. Logs reduced the chance of spam and does make this useful for a "compiled confidence score". Where user data like this gets tricky is some people will inflate clicks when it becomes common knowledge among marketers.

Direct Hit tried to use click data many years back. Lets say there were/are lots of rumors as to who may have been gaming it with scripting. That said for the purposes of smoothing the category terms this seems a reasonable solution. For those with the low forehead, this stuff is basically compiled so clicking on links in ODP is not going to affect your Google rankings! There I just saved some mod in a forum a few hrs. of time.

One debate within the SEO community is the use of bounce rate data in rankings. Taking the scenario above where "historical click data" is used to smooth the category terms. I can see where especially "seasonal" bounce rates could appear to be affecting a group of similar sites rankings, I would say it myth to say it is individual site bounce rates. IMO, in the scenario outlined above you would see similar sites drop at the same time as they would be in the same category term and affected equally.

IMO, Google universal search has this characteristic. To do that in the way it is talked about in the community is well... very expensive in computing terms. Just an opinion and not worth debate.

The third element of the Local features is to build a Vector Space Model. Here's a quote from the paper on it's purpose. This is extremely boring stuff but obviously interesting to any IR professional:

(...) we build a Vector Space Model (VSM) [25] for each category from its document collection and make the cosine similarity between the term vector of ct and the term vector of ut as CConf(ct, ut). The snippets of the web pages are used for generating the term vectors....If a user does not click on any URL for qt, or qt is the current query to be classified, this score cannot be calculated.

In layman terms they do some fancy math and build a model that has 1 flaw. If the user didn't click any sites in the Web results or the query term matches the category label then the final confidence score cannot be calculated.

 

The last part of this experiment is to implement the contextual features. Below are quotes I felt best summed up the contextual features.

Since there is no existing approach for query classification that takes into account the context information, we design a naive context-aware approach as the second baseline to further evaluate the modeling power of CRF in this problem.

Yada... Yada... no one has used the context of the search to smooth the query classification.

To use the context information, we consider some features that can reflect the association between adjacent category labels.

and further...

To reduce the bias of training data, besides considering the feature of direct association between adjacent labels, we also consider the structure of the taxonomy. Intuitively, the association between two sibling categories is stronger than that of two non-sibling categories.

and finally...

...the bridging classifier introduced by Shen et al in [27]. The idea of this approach is training a classifier on an intermediate taxonomy and then bridging the queries and the target taxonomy in the online step of QC. Experiments in [27] show this approach outperforms the wining approach in KDD Cup'05.

Lets take the query classifications we compiled and compare them to the search query and refine the query classifcation with each new search query. To some degree real time personalization.

 

The Microsoft team implemented a method that was surprising to me because they used 3 "human labelers" to test if search context affected the ability to classify queries.

A query's final label is voted by the three labelers. Since each query is associated with context information (except for the beginning queries of sessions) and real user clicks which can help determine the meaning or intent of the query, the consistency among the labelers is quite high. For more than 90% queries, the three labelers give the same labels. This is very different from the general query classification problem.

 

Outcome:This stuff works!

 

One of the benefits to having some interest in the real "science of SEO" or IR is not only that it fills the time I'd spend watching the idiot box, it gives me an insight into what I should be doing as an SEO. If Universal search isn't screaming "I like video, pictures, news, local locations and product feeds" then there's a seat waiting for you at the "home for washed up SEOs".

In fact text relevancy has less value then at any time I remember. The number of blue slots on the page for text/link manipulators is limited and has decreased by over 40% since I semi-retired in 2004. That in turn means links are less valuable and real marketing is where the game is being played. Mike Grehan has been alluding to this for over a year.

Product/shopping, News/blogs/articles, Local Search/Maps, video and pictures are a lot more about managing content development and marketing strategy than manipulating Search engine algos and chasing links. IMO, if you gotta' chase/beg links... you likely don't deserve them. My bias is out there for you all to know and chide me, so... let the fun begin.

So, first let me warn you this is pure conjecture I'm not an engineer at Google at the very most if I'm right it was part luck and part watching Google and SE SERPs for many years. The paper I attempted to walk you through is IMO, a model of the Florida update and possibly how Universal/blended search evolved.

User Sessions

 

Evolution of Universal Search

To some degree there are strong ties between the ODP (owned by AOL) and Google. It has long been discussed in some circles that the Florida update added a taxonomy or topics/classification to queries on Google. I saw it in anomalies which are, when you find them, like having a red flag on the results. I knew Google was updating the index whenever these showed up in one of my "legacy SERPs".

A legacy is one I have watched, forever, I know every site in the SERP and when new ones show up they get closely reviewed because that is always a clue as to what is working content wise or the "in vogue spam method" and to be truthful sometimes it's a fine line between them.

Query classifcation or QC is an interesting concept in that to do it algorythmically is difficult. Kind of like hidden text seems easy enough to stop but... at what cost? Too many false positives.

Query classification makes it so much harder to obtain top positions just by link text because all the keywords on the page are mapped to a classification and a classification to a query... hence, it becomes pretty tough for pages to place that don't have the term on the page. Sure it happens but are they terms that matter? It also takes a massive amount of links to do it. Feasible yes, plausible... sure if you got shit for brains and are bored stiff!

Classifications src=http://www.huomah.com/images/stories/pagePics6/terry4.jpg

 

The Human Touch

The human labeling in the Microsoft research was one peiece of the experiment that I didn't see the point in. Then I realized it was to smooth the web/target taxonomy query classifications.

The methodology in that was also interesting in that labelers would act like a person engaged in searching. The research used old Excite data. If memory serves me right Excite used to have data available for research on their site.

Google was on a hiring frenzy for a while and there were questions about what they were doing... labeling? Was Google employing labelers? The other day I sat in on a seminar with my old amigo Marshall D Simmonds, Greg Boser and Todd Malicoat. They came to sort of the same conclusion as myself. 1000's of people were QC labeling Google SERPs. Where the Google method differs is I think they also wanted a variety of media that can't really be ranked algorithmically. So Labelers also were slotting media.

Refining the QC provided by the web taxonomy is what IMO, Google Universal/Blended/Personalized Search is all about. When I look at the Google SERP, there's not ten links, there are ten slots, some times more. As a marketer and SEO I now have to consider media type and brand message when I am considering a marketing strategy for an audience, keyword term or promotion I want to position in the results. SEO is as much about PR and branding as it is links and onpage optimizing. That will be reflected in the "Socialization of SERPs" by social media.

Other resources;

Query type classification for web document retrieval
A unified and discriminative model for query refinement
Hourly analysis of a very large topically categorized web query log
Automatic Web Query Classification Using Labeled and Unlabeled Training Data
Improving Automatic Query Classification via Semi-supervised Learning
Context-Aware Query Suggestion by Mining Click-Through and Session Data
Towards Context-Aware Search by Learning A Very Large Variable Length Hidden Markov Model from Search Logs

 

About the author; Terry is an old school SEO geek that works out of International Website Builders and the founder of SEOPros.org - You can also hook up with him on Twitter. I'd like to REALLY send a huge thanks out to Terry for not only producing a search geek worthy post, but also for being a smart and open minded guy. We didn't meet under the best of circumstances but our love of search marketing pulled us together in the end - I look forward to many years of great chats together!

 

Understanding linking intent; the spam connection

Written by David Harry   
Wednesday, 29 July 2009 08:27

How search engines consider link relevance

A good friend of mine sent me an interesting question a while back, “Dear Dave, can you give me some pointers on how search engines determine link intent?”

Well sheeeeit… it’s hard to say what constitutes ‘intent’ when it comes to various aspects of linking and search engines. It really isn't that straight forward and understanding 'query intent' tends to be the most researched area for search peeps.

And as far as patents/papers, I doubt there is really much out there specifically on intent beyond some papers such as; Recognizing Nepotistic Links on the Web Or Detecting Nepotistic Links by Language Model Disagreement and there is more in CJ’s post on detecting paid links – there are a whack of papers at the end of that one.

Ultimately though…

 

Looking under the Hood; SEO Geeks Resource Library

Written by David Harry   
Tuesday, 26 May 2009 10:45

Over the last while I have been collecting a wide variety of search engine patents and information retrieval research papers. It's all part of the ongoing obsession... Over the last few years more than a few times I hear from folks looking for search related research/patents; it's became time to start a reference library. Time permitting I am looking to add more and more resources for those looking for some 'SEO Higher Learning' -

Why does it matter?

It is an important part of being in the world of SEO - in truth, 2/3rds of the acronym. To me, there is an evolution to consider as SEO is not Dead as some believe. When one starts to look at potential factors affecting search rankings, understanding search engines better is an obvious point of hypothesis genetics. The foundation of your search optimization concepts, programs and analysis is predicated on your level of knowledge - IMHO as always...

For those that are interested... I have put out an ongoing collection;

SEO Geeks Resource Library

SEO Resource LibraryBrowse categoriesSearch the Database

 

Happy hunting fellow research hounds!

I used to have many of these documents local and had always wanted to wrap them up in one location - now it has begun. Time permitting, there will be more documents going up each week. If you can't wait, be sure to get the SEO Geek's Newsletter and/or look through past issues for fresh patent/research paper alerts.

 

 

Host level spam detection

Written by David Harry   
Wednesday, 08 April 2009 08:24

Stay away from bad neighbourhoods!

For starters, what is web spam and what’s its function? In the patents we’re looking at today, they describe spam as websites constructed with random or targeted content and links in order to, “to trick the analysis algorithms used by search engines” into ranking the pages higher than they should (bit of an oxy moron play there). The end game of course being to monetize said traffic with varying forms of advertising… yada yada… we know the deal - And the fly in the ointment?

“However, achieving this is complicated because it can be difficult to identify spam hosts without manually reviewing the content of each host and classifying it as a spam or non-spam host

So what’s a search engine to do? Welcome to the world of rare AIR (Adversarial Information Retrieval). Last time out CJ was walking us through some methods of Paid Link detection and in the past we’ve covered link spam, phrase based and temporal spam detection methods ( to name a few) – this time we’re going to look at Host Level Spam Detection.

Stay away from bad neighbourhoods

 
<< Start < 1 2 4 6 7 8 9 10 > End >>

Page 4 of 11

Search the Site

SEO Training

Tools of the Trade

Banner
Banner
Banner

On Twitter

Follow me on Twitter

Site Designed by Verve Developments.