SEO Blog - Internet marketing news and views  

Latent Semantic Indexing and Google; One more time

Written by David Harry   
Monday, 20 July 2009 08:29

A ranting we will go, A ranting we will go… silly stuff SEOs seem not to know… a ranting we shall go…

Andy and CJ on LSIHiya Kids… over the last while the spectre of LSI (latent semantic indexing) and SEO has raised its ugly head once more…

It began with a piece my good friend Virginia wrote called; SEO strategy for semantic search  - (a good post) which was based, in part on a post from SEO Phill (and comments on the BC blog). Later on, another couple of pals (Andy Beard and CJ) were musing on Twittter about it…

All of this had me wondering how the state of affairs with the LSI-Google train were of late…a look at the Twitter-Hose shows that this topic is alive and well…

the Twitter hose on LSI

For the record, we’ve already covered the LSI madness here on the Trail back in ’07 with; Stay off the LSI bandwagon - but it can’t hurt to look at it again, oui? You see, it is endlessly frustrating to see those peddling so - called ‘LSI Programs’, which are nothing short of pure friggen’ garbage (don’t even get me started with ‘referential integrity’ – sigh)… and so, a ranting we must go.

In researching this point I was (somewhat) shocked at the number of even (cough cough) A-List SEO bloggers that have written about LSI over the years… did anyone do their homework? Or just regurgitate what another ‘expert’ had surmised? Dunno… odd...


What many SEOs don’t (want to?) understand about Google and LSI

The question remains, why do we still keep hearing about the magic bullet that is LSI? I’d have to imagine much of it is ignorance with a side helping of snake-oil.

“Google bought/developed technology that meant their computers could make intelligent decisions on whether a piece of content was good or not. This technology is called Latent Semantic Indexing (LSI).” - (from some SEO snake oil site)

I found the above gem in something a mate sent me… they can’t even decide if it was ‘bought/developed’ never mind how it played into Google’s evolution.

You see the whole thing started when Google purchased Applied Semantics back in 2003 – strangely, for their ad matching technology NOT for an IR approach neccesarily. Google hoped it would, “(…) make online advertising more useful to users, publishers, and advertisers alike

They spoke of their interest in, “Applied Semantics' AdSense product that enables web publishers to understand the key themes on web pages to deliver highly relevant and targeted advertisements.”

Did you catch that? Some odd program called ‘AdSense’ – hmmmm… sound familiar? This (purchase of Applied Semantics) is by no means evidence of Google’s use of LSI/A in the regular index, but t'was the beginning of the bullshit that ensued… grumble mumble…

Moreover, Google also picked up Anna Patterson and her ‘phrase based IR’ methods around the same time (also a semantic analysis approach) but no one in the SEO world picked up on that (guess it didn’t have a catchy acronym hehe …. jackasses). It is this type of subjective blindness that leaves me unsure if I should laugh, scream or cry…

Yes, Google is VERY interested in semantic analysis, most search engines are… but the whole limited LSI view, is actually more suited to PPC peeps not the SEOs necessarily. It was (originally) for AdSense after-all. None the less… would be SEO snake oil is still rampant

LSI - SEO Scams


Getting past LSI

Let us get rid of the whole ‘LSI’ thang ok? Let’s work it back to the parent group; LSA, (latent semantic analysis). Now, we certainly can’t dismiss this approach as most search engines do employ various types of semantic analysis… this part I have no problem with… But is it really simple LSA? I’d also doubt that.

I am more inclined to think along the lines of PLSA (as well as LDA and HTMM) as the engineers over there did seem to have a fancy for it in 07; as noted in this Google Research post on HTMM (Hidden Topic Markov Models). You see, these are more evolved versions of semantic analysis. Most in the IR world I’ve talked to about this agree that simple LSA approaches are limited and aren’t likely being employed at Google.

Google research blog on PLSA

Which all begs the question; are SEO types really just ignorant schmucks that seek to glorify themselves or cash in on this? Why no talk (in SEO circles, IR peeps do fine) about these other technologies? It doesn't take a rocket-scientist (maybe a comp scientist) to venture a guess it is simply folks trying scoon a few $$$$... or straight up ignorance...


Expanded Snippets do not prove LSI

Another area I’d seen mentioned by those stating Google is using LSI is the recent updates to expanded snippets. That part, while possibly using some form of semantic analysis, is part of the Orion algorithm from Ori Allon (more on that here).

Of course, when Google picked up Ori, they also purchased the related patents and those were removed from the AU patent database, so what types of approaches are involved, I’m not entirely sure. It may be a semantic analysis approach, it may not be. But it is NOT using LSI…ok?

Great, glad we’re past that also… whew…


Noticing a connect here?

DUH Magazine - Will SOE Survive?I hear you asking, “Ok Mr. Smart Ass, what is Google doing?” – I have no friggen’ definitive answer, ok ya mook?!?

It is likely that a variety of different signals/approaches are being used to understand concepts via various semantic analysis, (and NLP) methods. I am not that keen on using the oft used catch phrase ‘LSA/I’ as it tends to cloud SEOs abilities to think laterally and subjectively.

Understanding how search engines are using semantic analysis (and other methods) to define concepts is something not well discussed in the industry, which should be. A large share of queries each day are ambiguous and new, (according to Google), they are constantly toying with signals such as semantic analysis (which works hand in hand with query analysis) to better understand what the user is looking for and the context therein.

At the end of the trail, Applied Semantics was (primarily) an ad matching technology using LSI; we can’t assume it was added to the reg search processes (remember, the phrase based approach, for the reg index, was the same year). It’s time for some perspective lest we get another spat of SEOs peddling ‘Google LSI Compliant’ services once more (circa 2005-06)…and making us all look like jackasses – m’kay?

You have been warned; the ignorant, short sighted or snake oil peddlers professing LSI and Google, will get the smack from this author!! – unless yer talking about AdSense…


We shall return to regular programming shortly; thanks for your time on this - see LSI snake oil? You know where to send 'em :0)


More reading;

Phrase based indexing and retrieval methods
Incorporating data based upon user query sessions
Probabilistic latent semantic analysis
Latent Dirichlet allocation 
Hidden Topic Markov Models

And of course if you’re looking to learn more about NLP and other semantic analysis approaches try these blogs;

Science for SEO
Cog Blog
Thought Process
Natural Language Processing Blog
Ontology News
The Lousy Linguist

*Note; It should be understood that I do believe in understanding various semantic approaches to building content that Google will eat up… it is the fascination with the term ‘LSI’ that irks me (and resulting ‘systems’ – sigh)



Search the Site

SEO Training

Tools of the Trade


On Twitter

Follow me on Twitter

Site Designed by Verve Developments.