Latent Semantic Indexing and Google; One more time

Written by David Harry   
Monday, 20 July 2009 08:29

A ranting we will go, A ranting we will go… silly stuff SEOs seem not to know… a ranting we shall go…

Andy and CJ on LSIHiya Kids… over the last while the spectre of LSI (latent semantic indexing) and SEO has raised its ugly head once more…

It began with a piece my good friend Virginia wrote called; SEO strategy for semantic search  - (a good post) which was based, in part on a post from SEO Phill (and comments on the BC blog). Later on, another couple of pals (Andy Beard and CJ) were musing on Twittter about it…

All of this had me wondering how the state of affairs with the LSI-Google train were of late…a look at the Twitter-Hose shows that this topic is alive and well…

the Twitter hose on LSI

For the record, we’ve already covered the LSI madness here on the Trail back in ’07 with; Stay off the LSI bandwagon - but it can’t hurt to look at it again, oui? You see, it is endlessly frustrating to see those peddling so - called ‘LSI Programs’, which are nothing short of pure friggen’ garbage (don’t even get me started with ‘referential integrity’ – sigh)… and so, a ranting we must go.

In researching this point I was (somewhat) shocked at the number of even (cough cough) A-List SEO bloggers that have written about LSI over the years… did anyone do their homework? Or just regurgitate what another ‘expert’ had surmised? Dunno… odd...


What many SEOs don’t (want to?) understand about Google and LSI

The question remains, why do we still keep hearing about the magic bullet that is LSI? I’d have to imagine much of it is ignorance with a side helping of snake-oil.

“Google bought/developed technology that meant their computers could make intelligent decisions on whether a piece of content was good or not. This technology is called Latent Semantic Indexing (LSI).” - (from some SEO snake oil site)

I found the above gem in something a mate sent me… they can’t even decide if it was ‘bought/developed’ never mind how it played into Google’s evolution.

You see the whole thing started when Google purchased Applied Semantics back in 2003 – strangely, for their ad matching technology NOT for an IR approach neccesarily. Google hoped it would, “(…) make online advertising more useful to users, publishers, and advertisers alike

They spoke of their interest in, “Applied Semantics' AdSense product that enables web publishers to understand the key themes on web pages to deliver highly relevant and targeted advertisements.”

Did you catch that? Some odd program called ‘AdSense’ – hmmmm… sound familiar? This (purchase of Applied Semantics) is by no means evidence of Google’s use of LSI/A in the regular index, but t'was the beginning of the bullshit that ensued… grumble mumble…

Moreover, Google also picked up Anna Patterson and her ‘phrase based IR’ methods around the same time (also a semantic analysis approach) but no one in the SEO world picked up on that (guess it didn’t have a catchy acronym hehe …. jackasses). It is this type of subjective blindness that leaves me unsure if I should laugh, scream or cry…

Yes, Google is VERY interested in semantic analysis, most search engines are… but the whole limited LSI view, is actually more suited to PPC peeps not the SEOs necessarily. It was (originally) for AdSense after-all. None the less… would be SEO snake oil is still rampant

LSI - SEO Scams


Getting past LSI

Let us get rid of the whole ‘LSI’ thang ok? Let’s work it back to the parent group; LSA, (latent semantic analysis). Now, we certainly can’t dismiss this approach as most search engines do employ various types of semantic analysis… this part I have no problem with… But is it really simple LSA? I’d also doubt that.

I am more inclined to think along the lines of PLSA (as well as LDA and HTMM) as the engineers over there did seem to have a fancy for it in 07; as noted in this Google Research post on HTMM (Hidden Topic Markov Models). You see, these are more evolved versions of semantic analysis. Most in the IR world I’ve talked to about this agree that simple LSA approaches are limited and aren’t likely being employed at Google.

Google research blog on PLSA

Which all begs the question; are SEO types really just ignorant schmucks that seek to glorify themselves or cash in on this? Why no talk (in SEO circles, IR peeps do fine) about these other technologies? It doesn't take a rocket-scientist (maybe a comp scientist) to venture a guess it is simply folks trying scoon a few $$$$... or straight up ignorance...


Expanded Snippets do not prove LSI

Another area I’d seen mentioned by those stating Google is using LSI is the recent updates to expanded snippets. That part, while possibly using some form of semantic analysis, is part of the Orion algorithm from Ori Allon (more on that here).

Of course, when Google picked up Ori, they also purchased the related patents and those were removed from the AU patent database, so what types of approaches are involved, I’m not entirely sure. It may be a semantic analysis approach, it may not be. But it is NOT using LSI…ok?

Great, glad we’re past that also… whew…


Noticing a connect here?

DUH Magazine - Will SOE Survive?I hear you asking, “Ok Mr. Smart Ass, what is Google doing?” – I have no friggen’ definitive answer, ok ya mook?!?

It is likely that a variety of different signals/approaches are being used to understand concepts via various semantic analysis, (and NLP) methods. I am not that keen on using the oft used catch phrase ‘LSA/I’ as it tends to cloud SEOs abilities to think laterally and subjectively.

Understanding how search engines are using semantic analysis (and other methods) to define concepts is something not well discussed in the industry, which should be. A large share of queries each day are ambiguous and new, (according to Google), they are constantly toying with signals such as semantic analysis (which works hand in hand with query analysis) to better understand what the user is looking for and the context therein.

At the end of the trail, Applied Semantics was (primarily) an ad matching technology using LSI; we can’t assume it was added to the reg search processes (remember, the phrase based approach, for the reg index, was the same year). It’s time for some perspective lest we get another spat of SEOs peddling ‘Google LSI Compliant’ services once more (circa 2005-06)…and making us all look like jackasses – m’kay?

You have been warned; the ignorant, short sighted or snake oil peddlers professing LSI and Google, will get the smack from this author!! – unless yer talking about AdSense…


We shall return to regular programming shortly; thanks for your time on this - see LSI snake oil? You know where to send 'em :0)


More reading;

Phrase based indexing and retrieval methods
Incorporating data based upon user query sessions
Probabilistic latent semantic analysis
Latent Dirichlet allocation 
Hidden Topic Markov Models

And of course if you’re looking to learn more about NLP and other semantic analysis approaches try these blogs;

Science for SEO
Cog Blog
Thought Process
Natural Language Processing Blog
Ontology News
The Lousy Linguist

*Note; It should be understood that I do believe in understanding various semantic approaches to building content that Google will eat up… it is the fascination with the term ‘LSI’ that irks me (and resulting ‘systems’ – sigh)




0 # CJ 2009-07-20 08:52
And I'll a smack too :-)

Wonderful post Dave, and it all needed saying!
Reply | Reply with quote | Quote
+1 # Dave 2009-07-20 09:24
Hiya hun... thanks for playing along this weekend, there was some REAL funny shit we came across, good times all-around!!

It seems like we need to say this every flippin' year as the LSI/Google crap still is rampant out there. I hadn't really checked the state of affairs in a while, but after I did... the ranting came naturally (grrrrrrrrrrr)... we need to form a posse on this shit...

Reply | Reply with quote | Quote
0 # Claire Hawley 2009-07-20 17:19
Great post... I echo your sentiment on these supposed "expert" companies peddling software tools and whitepapers that will solve all your seo problems. Link exchange, submit your site to directories, get followers, blah blah. Best philosophy is to stick to the basics, make good content and focus on what users need to know to fully understand your topic.... regardless of what the algorithm includes or does not include, you'll come out on top.
Reply | Reply with quote | Quote
0 # Dave 2009-07-20 18:55
Hi there Claire, thanks for stopping in! It is certainly always important to remember the marketing mix and try to offer things of value - that goes a long way towards what search engines like as well. One of the easiest sites to work on SEO for are quality ones... quality products, services etc... Even content... links come naturally and one can focus more on-site and technical issues...

As for the snake-oil and confusion in the SEO world on this one, LSI, it is my hope that we can get past such limited concepts, leave IR to the IR peeps, and work with what we do know... Will it happen? I doubt it, this has been going on since 03... no signs of letting up.

Reply | Reply with quote | Quote
0 # Gabriella 2009-07-20 19:39
Okay I have to tell you I have been waiting to see this post. I am glad I have been watching. So it's true this is a case for dimwits and silly twitts. :whistle: lol Once again Dave a ranting you will go and yes, it's spot on!
Reply | Reply with quote | Quote
0 # Dave 2009-07-20 20:00
Hey hey Gabiella, how are we today? Yes, the twitter pop-quiz on LSI and Google was underwhelming... hehe... How are consumers going to know what to trust in SEO when many SEOs don't understand some of these concepts (and related history)??

I hope the post was worth the wait, it's always cathartic to 'go a ranting' and getting these things of me little 'ol chest... sigh...should sleep well this evening!
Reply | Reply with quote | Quote
-1 # CJ 2009-07-21 07:43
Yes indeed:
Reply | Reply with quote | Quote
+1 # Dave 2009-07-21 08:01
..ah yes, nice find...they offer training in;

Latent Semantic Indexing

Latent Semantic Indexing
• An overview of LSI
• How to write copy for LSI search engines
Running an SEO campaign
• Structured approaches to running SEO activity
• Everything you need to know about landing pages, and how to test them

Suhweet... nice to see it's not just the schmucks peddling it... hehe...
Reply | Reply with quote | Quote
+1 # Cory 2009-07-21 21:52
Good post - thought I'd lob this toward the more reading section:
Reply | Reply with quote | Quote
0 # Dave 2009-07-21 22:18
Thanks Cory, you are correct, no discussion of LSI and Google...or SEO for that matter, can be had without talking about Edel G. I had actually linked to a few of his posts in my last rant on this in 2007, (linked to early in this post). As I had linked in the other, I figured to go a diff direction with the resources in this one.

It's interesting that he was on about the post Leslie did as that was the 'referential integrity' that I ever so briefly touched on in this post :unsure:

Thanks for taking the time tho.... Dr. G was always close in my mind when writing this... I've only gone on about it twice - he's on a mission...
Reply | Reply with quote | Quote
+2 # SEO Phill 2009-07-22 02:47
You're quite right about the snake oil. I'm currently doing a bit of a study on how much the view of SEO has changed (or not) since inception. We all know that people STILL harp on about keyword densities in some dark corners of the SEO world.

I think one reason for the continued use of the term LSI is that people don't understand the subject matter at all - it's a terrifically detailed and difficult area - one that's been around for a long time (I believe one of the first attempted uses was in library categorisation systems).

It's true, that given the index Google has (and this is a supposition based on their Backrub paper at Stanford) that LSI is POSSIBLE without changing much. What's often not noted is that it would be so horrifically slow and unwieldy as to vastly limit its use.

Semantics yes, LSI no.
Reply | Reply with quote | Quote
0 # infinity 2009-07-22 14:08

I didn't know that this myth is still around. And so widespread. Bandwagon picking up speed?

LSI - Latent Semantic Indexing - just sounds so fancy, there is "latent" and "semantic" and "indexing". oh wow.

Abusing a fancy vocabulary is the sure-fire way to writing fantastic bullshit. It might even impress some customers: Google LSI Compliant? aha. That isn't even creative - it seems that the other guys are all selling the same stuff. Maybe the latest Must-Have-SEO-Hooha since keyword density and XML-sitemaps for well-linked 10-page-sites.

Abusing vocabularies, like in this case terminology from information retrieval, has a long tradition in esoteric writing and postmodern philosophy. Dazzling the readers with bullshit - could work with SEO as well. I'm waiting for Quantum SEOlogy.

p.s. I really liked the Assholes Inc. :lol: LOL

p.s. II: sorry for posting this again, it seems that half of my comment got filtered - single quotes in BBCode are not allowed?
Reply | Reply with quote | Quote
0 # Dave 2009-07-24 07:37
@Phill - ah yes, good 'ol KW density, that one is soo sad I haven't even written about it... but truly another funny one. I heard of a KW Density Google filter the other day...WTF?

You've hit the nail on the head though, I would be ever so happy if people talked about semantic analysis instead of wrapping programs up in the term LSI - that's where the snake oil comes in...sigh...

@Infinity, np, shall nuke the dupes. It really is sad and I've been doing research/buzz monitoring this week and it is certainly alive and well (hadn't looked around for some in a few years). I do really like "Quantum SEOlogy" - hehe... that's a classic. We should write a post and see how many peeps fall for it... heehee

Thanks on Assholes Inc... have been meaning to follow that up with a site.... maybe soon..
Reply | Reply with quote | Quote
0 # Stancja Wroclaw 2009-07-27 08:15
I'm only waiting till some angry employee makes public all the internal policies and algorithms of google. I wonder what they'll do then.
Reply | Reply with quote | Quote
0 # Sasa von der Linkaufbau-Agentu 2009-07-28 20:55
Agree. I wrote about this a while ago. The article is in german, so I won't link. But with all the hubbub about PageRank dying what people completely overheard is that even Matt Cutts said that they don't use LSI for ranking purposes because it hadn't played out that well. I quoted this in my article: Is What
Reply | Reply with quote | Quote
0 # Sasa von der Linkaufbau-Agentu 2009-07-28 20:57
Ok, your blog swallowed part of my comment. Here is the link
Reply | Reply with quote | Quote
0 # Nimesh 2009-07-29 02:22
Nice Post
Informative One
Thanks for great stuff
Reply | Reply with quote | Quote
0 # Adam J. Humphreys 2010-02-25 11:20
Hi David,

Great post as always. Your in depth analysis of various algorithms sheds quite a bit of light on a controversial subject that nobody has absolute answers on.

One thing I'd like to point out regardless of what search professionals are calling it there's now some features of Google that definitely give evidence of latent indexing or at the very least attention to it (put what ever euphemism you'd like on it).

For example, on the Google SERP we're now seeing synonyms highlighted especially in lower competitive words. I believe over time as their database grows stronger with content so will their web intelligence to see the paralells with more relative, pertinent keywords/content highlighted much like we're currently seeing in newer personalized search.

My other confidence that there is some shape or form of latent indexing is the Google Webmaster Tools interface:

"Below are the most common keywords Google found when crawling your site. These should reflect the subject matter of your site."

It is for the above two defined reasons that I believe it to be important to at the very least insure that content is relevant/on topic with a specific amount of analytics. Bruce Clay's book seems to indicate that a ratio equal to or slightly less than competitors sites for specific keywords as doing more than that could lead to a drop in ranking.

I can certainly tell you I don't go around talking about LSI/LSA to clients because relevancy, and expert keywords for the clients far out ranks sitting on a page measuring how many times a word was mentioned. One has to consider the ROI of the time spent on well written content vs the loss of visitors for content that's congested/repetitive. There are far more important KPI to pay attention to, and for the most part these aren't even followed. It would seem that for now at least the beast that is Google is a very literal machine expecting content in an almost thesis like structure, and slowly adopting to human dynamics.
Reply | Reply with quote | Quote
0 # Gregory Cox 2012-01-20 00:33
Dude, you are a complete analytics dork and that's why I RSS'ed you.

But now that I am reading your article, I agree with what you're shouting. I love LSI, from an artistic perspective, something like "isn't that cool how language is constructed?" perspective instead of "let's kill all competitors with our insanely awesome LSI methods!" one.

And LSI is short-sighted to say the least. Human "robots" running shouting about how important LSI is are missing a critical element in the entire game.

Brand. Message.

How about just a good f**king article worth reading, for Pete's sake. Which, BTW, also makes articles go - by getting them spread around by real, live humans. Sustainability through human popularity.

Thanks for putting it all in perspective, David.
Reply | Reply with quote | Quote
0 # Gregory Cox 2012-01-20 00:35
BTW, I originally found your article through attempting to find information or a tool that amasses singular keyword densities by extracting keyphrases within Google Analytics - so I could THEN create "head terms" for GA advanced segments - in order to THEN analyze conversion ratios for said head terms.

So. Um, if you want to answer that, like, awesome.
Reply | Reply with quote | Quote

