SEO Blog - Internet marketing news and views  

SEO Higher learning

Written by David Harry   
Monday, 05 January 2009 07:19
Article Index
SEO Higher learning
Page 2
All Pages

The SEO geeks guide to information retrieval

A while back I was ranting on Twitter (and elsewhere) about how not too many SEO peeps actually study some of the more technical aspects of how search engines work and IR, (NLP,ML etc..) concepts. You see, it seems logical that those working with (marketing to) search engines might want to know a thing or two about how they work.

The goal of today’s ride isn’t as much to provide you with a framework for learning as it is to start you in a direction. As always, learning how to catch fish is more important than a free lunch. These are simply some interesting directions for SEO technicians looking to dig deeper into the world of knowledge management.

This thing of oursThis thing of ours

To me this thing of ours is an art form.  The pragmatic can argue the semantics of what is art, regardless, to me it is such. And a dimensionally singular SEO is one that is limited in that art. Much like a traditional artist progresses from still life and pencil to painting and sculpting, from brushes to chisels, an SEO should seek out other elements related to the discipline. We should learn knowledge management and algorithms with the same verve we apply to playing in social media.

The goal of this adventure is to hopefully spark the imagination of SEOs by finding out just how interesting the task of indexing the world’s information is. Many times the goodies listed here aren’t a direct line to ranking glory, but great for identifying future trends and having a feel for modern implementations…

And so today we head down the path leading to SEO geek nirvana…. Mount up and let’s ride!!


The big ass list of information retrieval resources

Right away I want to give a little luv out to a few folks that helped me with this. There is no way this humble SEO geek could get there without a little help from my friends;

Bill Slawski Search guru and all around great guy
Marie-Claire Jenkins – my newest friend who brings science to SEO  
And of course Charles – who sends way too many IR papers at me for a Hawaiian shirt dude

…and away…



Probably one of the least liked formats of the online addict, we’ll get into some books to get ya’ll started (and move along smartly to the juicy stuff) – here’s a few of interest;

Readings in Information Retrieval. K. Sparck Jones, P. Willett. Morgan Kaufmann, 1997.

Managing Gigabytes. Compressing and Indexing Documents and Images - I.H. Witten, A. Moffat, T.C. Bell. Morgan Kaufmann, 1999.

Information Retrieval: Algorithms and Heuristics – David A. Grossman, Ophir Frieder

Information Retrieval: Data Structures & Algorithms . Frakes, W. and Baeza-Yates, R., Prentice Hall, 1992.

Mining the Web: Discovering Knowledge from Hypertext Data by Soumen Chakrabarti

Introduction to Information Retrieval, (and  online resources ) by Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze - Cambridge University Press. 2008

Modern Information Retrieval by Ricardo Baeza-Yates and Berthier Ribeiro-Neto

Language Modeling for Information Retrieval - Croft, W. Bruce; Lafferty, John

Google's PageRank and Beyond: The Science of Search Engine Rankings - Amy N. Langville & Carl D. Meyer


(Free) Courses and learning

Next we have some free online courses that one can dig through at their leisure to help you truly get a feel for the issues, problems and methods used in search. This is stuff definitely falls under the uber-geeky and while valuable – isn’t required reading (just damned fun if U ask me).

Introduction to Information Retrieval - Stanford
This is an online version of the book from Cambridge University. This book is the result of a series of courses we have taught at Stanford University and at the University of Stuttgart, in a range of durations including a single quarter, one semester and two quarters. These courses were aimed at early-stage graduate students in computer science.
(can also be found here )

Information Retrieval - A (free online) book by C. J. van RIJSBERGEN
The major change in the second edition of this book is the addition of a new chapter on probabilistic retrieval. This chapter has been included because I think this is one of the most interesting and active areas of research in information retrieval. There are still many problems to be solved so I hope that this particular chapter will be of some help to those who want to advance the state of knowledge in this area. All the other chapters have been updated by including some of the more recent work on the topics covered.

Introduction to Algorithms – MIT
Description; this course teaches techniques for the design and analysis of efficient algorithms, emphasizing methods useful in practice.

Information Retrieval Interaction. - P. Ingwersen. Taylor Graham, 1992.
Focuses on user interaction in IR. The aims of the book are to establish a unifying scientific approach to IR – a synthesis based on the concept of IR interaction and the Cognitive Viewpoint; to present research and developments in the field of information retrieval based on a new categorisation; and to generate a consolidated framework of functional requirements for intermediary analysis and design

Knowledge technologies in context – OpenLearn
Aims to develop an understanding of the relationships between information, interpretation, knowledge and computer-based representations and to summarise the range of different technologies that are available and on the horizon, and how they relate to different kinds of knowledge processes

Artificial Intelligence | Machine Learning – Stanford
Description; This course provides a broad introduction to machine learning and statistical pattern recognition. While not directly related to search, probability and machine learning is an important aspect of modern indexing and retrieval.

Artificial Intelligence | Natural Language Processing - Stanford
Description; this course develops an in-depth understanding of both the algorithms available for the processing of linguistic information and the underlying computational properties of natural languages. Wordlevel, syntactic, and semantic processing from both a linguistic and an algorithmic perspective are considered.

Probabilistic Systems Analysis and Applied Probability – MIT
Description; introduces students to the modeling, quantification, and analysis of uncertainty. Topics covered include: formulation and solution in sample space, random variables, transform techniques, simple random processes and their probability distributions, Markov processes, limit theorems, and elements of statistical inference.

Tutorial: Web Information Retrieval – Google - Monika Henzinger
A slide presentation of more than 150 slides that can give you a good overview and my fav gal CJ said “This presentation by Monika Henzinger is brilliant” – what more does one need to know?


Interesting Research papers

This area truly is endless as there are a never ending supply of excellent research out there. For this area I thought we’d look at some goodies that cover a range of topics that are of importance in the IR world currently. Need more? Drop me a line…

Modern Information Retrieval: A Brief Overview - Amit Singhal - Google, Inc.(pdf)
The field of Information Retrieval (IR) was born in the 1950s out of this necessity. Over the last forty years, the field has matured considerably. Several
IR systems are used on an everyday basis by a wide variety of users. This article is a brief overview of the key advances in the field of Information Retrieval, and a description of where the state-of-the-art is at in the field.

PageRank; The Anatomy of a Large-Scale Hypertextual Web Search Engine – Stanford (pdf)
We have built a large-scale search engine which addresses many of
the problems of existing systems. It makes especially heavy use of the additional structure present in hypertext to provide much higher quality search results. We chose our system name, Google, because it is a common spelling of googol, or 10100 and fits well with our goal of building very large-scale search engines.

Learning Diverse Rankings with Multi-Armed Bandits – Cornell (pdf)
We propose a new learning to rank problem formulation that differs in three fundamental ways. First, unlike most previous methods, we learn from usage data rather than manually labelled relevance judgments. Usage data is available in much larger quantities and at much lower cost.

Google: Online Search & the Battle for Clicks – MIT (pdf)
This paper looks at how the search engine market evolved and how Google became the dominant player it is today.

Anchor Based Proximity Measures – Standford (pdf)
Our measures are based on three different propagation schemes and two different uses of the connectivity structure of the graph.. We then consider a web-specific application of the above measures with two disjoint anchors: good and bad. The key assumption is that good web pages are highly unlikely to link to bad web pages. The goal is to assign a goodness quality score to all web pages.

Link Spam Detection Based on Mass Estimation (pdf) - Standford
This paper introduces the concept of spam mass, a measure of the impact of link spamming on a page’s ranking. We discuss how to estimate spam mass and how the estimates can help identifying pages that benefit significantly from link spamming.

Graph based algorithms in natural language processing – MIT (pdf)
Set of 75 slide from a presentation on NLP.


Query Analysis (research papers)

Mining Search Engine Query Logs via Suggestion (pdf)
In this paper we describe two algorithms for sampling suggestions using only the public suggestion interface. One of the algorithms samples suggestions uniformly at random and the other samples suggestions proportionally to their popularity. These algorithms can be used to mine the hidden
suggestion databases.

Information Re-Retrieval: Repeat Queries in Yahoo’s Logs - Yahoo (pdf)
This paper explores repeat search behavior through the analysis of a one-year Web query log of 114 anonymous users and a separate controlled survey of an additional 119 volunteers. Our study demonstrates that as many as 40% of all queries are re-finding queries. Re-finding appears to be an important behavior for search engines to explicitly support, and we explore how this can be done.


Page Segmentation (research papers)

Vision Based page segmentation algorithm (VIPS) – Microsoft (pdf)
This paper presents an automatic top-down, tag-tree independent approach to detect web content structure. It simulates how a user understands web layout structure based on his visual perception.

Block level link analysis - Microsoft (pdf)
In this paper, we proposed two novel link analysis algorithms called Block Level PageRank (BLPR) and Block Level HITS (BLHITS) which treat the semantic blocks as information units.
By using vision-based page segmentation (VIPS) algorithm [4][5], we extract page-to-block and block-to-page relationships and then construct a page graph and a block graph. Based on this
graph model, the new link analysis algorithms are capable of discovering the intrinsic semantic structure of the web.


Implicit and explicit user feedback (user performance)

Query Chains: Learning to Rank from Implicit Feedback - Cornell (pdf)
This paper presents a novel approach for using clickthrough data to learn ranked retrieval functions for web search results. We observe that users searching the web often perform a sequence, or chain, of queries with a similar information need.

Evaluating the Accuracy of Implicit Feedback from Clicks and Query Reformulations in Web Search (partially funded via grant from Google)
This paper examines the reliability of implicit feedback generated from clickthrough data and
query reformulations in WWW search. Analyzing the users’ decision process using eyetracking
and comparing implicit feedback against manual relevance judgments, we conclude that clicks
are informative but biased.

Improving Web Search Ranking by Incorporating User Behaviour Information - Microsoft (pdf)
We show that incorporating user behavior data can significantly improve ordering of top results in real web search setting. We examine alternatives for incorporating feedback into the ranking process and explore the contributions of user feedback compared to other common web search features.

Optimizing Search Engines using Click through Data – Cornell (pdf)
The goal of this paper is to develop a method that utilizes click through data for training, namely
the query-log of the search engine in connection with the log of links the users clicked on in the presented ranking. Such click through data is available in abundance and can be recorded at very low cost.


Object Level Search (research papers)

Search Objective Gets a Refined Approach – Microsoft (pdf)
Object-Level Vertical Search takes a refined approach that is a significant advance from traditional Web search. The latter paradigm is based on a page-level relevance ranking approach, in which pages that receive links from many other pages are adjudged to have more value by the very fact that they are popular. If more people link to a given page, it must have something to offer—that is the presumption.

Object Level Ranking: Bringing Order to Web Objects – Microsoft (pdf)
This paper introduces PopRank, a domain-independent object-level link analysis
model to rank the objects within a specific domain. Specifically we assign a popularity propagation factor to each type of object relationship, study how di®erent popularity propagation factors for these heterogeneous relationships could affect the popularity ranking, and propose efficient approaches to automatically decide these factors.

Web Object Retrieval – Microsoft (pdf)
In this paper, we propose several language models for Web object retrieval, namely an unstructured object retrieval model, a structured object retrieval model, and a hybrid model with both structured and unstructured retrieval features. We test these models on a paper search engine and compare their performances. We conclude that the hybrid model is the superior by taking into account the extraction errors at varying levels.

Object Level Vertical Search – Microsoft (pdf)
In this paper, we introduce the overview and core technologies of object-level vertical search engines that have been implemented in two working systems: Libra Academic Search ( and Windows Live Product Search ( ).

Corroborate and Learn Facts from the Web - Google (pdf)
This paper describes a robust bootstrapping approach to corroborate facts and learn more facts simultaneously. This approach starts with retrieving relevant pages from a crawl repository for each entity in the seed set. In each learning cycle, known facts of an entity are corroborated first in a relevant
page to find fact mentions.

(and Bill has even more on Object level search)


Information retrieval related videos

This section had to be one of my favortites. I say that because I tend to have them running in the background while working and can listen to them while working. If focussing more, the video lectures from ‘Video Lectures’ have accompanying slides with them which is handy as most lectures we see don’t have that. There are also videos from IM Broadcast, YouTube and others...

The Future of Information Retrieval Part 1 and Part II
Interviews with prominent experts in the field of information retrieval on the future of IR.

Google search quality – Google Roundtable
 Large Scale Search System Infrastructure and Search Quality -Fellows Jeff Dean and Amit Singhal on their insights in how search works at Google.

Overview of how Search Engines work – UC Berkeley
The World Wide Web brings much of the world's knowledge into the reach of nearly everyone with a computer and an internet connection. The availability of huge quantities of information at our fingertips is transforming government, business, and many other aspects of society.

Human-Computer Information Retrieval Lecture - Gary Marchionini
He discusses the intersection of information retrieval and human-computer interaction and the challenges of exploratory search.

Google Whacks for Profit and Fun – Google tech talks
Google study the number of Internet search results returned from multi-word queries based on the number of results returned when each word is searched

Machine learning and translation – Google tech talks
his is an interesting presentation on probabilistic learning and dealing with better understandings of user intent. Kind of heavy lifting for the search geeks, but still worth watching for any SEO.

Information Retrieval and Text Mining - Thomas Hofmann, Brown University
This four hour course will provide an overview of applications of machine learning and statistics to problems in information retrieval and text mining. More specifically, it will cover tasks like document categorization, concept-based information retrieval, question-answering, topic detection and document clustering, information extraction, and recommender systems.

Faceted metadata in search engines – UC Berkely
The availability of huge quantities of information at our fingertips is transforming government, business, and many other aspects of society. Topics include search advertising and auctions, search and privacy, search ranking, internationalization, anti-spam efforts, local search, peer-to-peer search, and search of blogs and online communities.

Universal Modeling: Introduction to modern MDL - A tutorial introduction to the *modern* Minimum Description Length (MDL) Principle, taking into account the many refinements and developments that have taken place in the 1990s. These do not seem to be widely known outside the information theory community. We will especially emphasize the use of MDL in classification.

Comparison of information retrieval techniques: Latent semantic indexing (LSI) and Concept indexing (CI)
Information retrieval in the vector space model is based on literal matching of terms in the documents and the queries. The model is implemented by creating the term-document matrix, which is formed on the base of frequencies of terms in documents.

Extracting Semantic Relations from Query Logs - Ricardo Baeza-Yates, Yahoo! Research
In this paper we study a large query log of more than twenty million queries with the goal of extracting the semantic relations that are implicitly captured in the actions of users submitting queries and clicking answers. Previous query log analyses were mostly done with just the queries and not the actions that followed after them.

Applications of Query Mining - Ricardo Baeza-Yates, Yahoo! Research
More from Yahoo to go along with the last one.

Dirichlet Processes, Chinese Restaurant Processes, and all that - Michael I. Jordan, University of California
Nonparametric Bayesian methods offer a way to make use of the Bayesian calculus without the parametric handcuffs. In this talk I describe several recent explorations in nonparametric Bayesian modeling and inference, including various versions of "Chinese restaurant process priors" that allow flexible structures to be learned and allow sharing of statistical strength among sets of related structures.

Interactively Optimizing Information Systems as a Dueling Bandits Problem - Yisong Yue, Department of Computer Science, Cornell University
We present an online learning framework tailored towards real-time learning from observed user behavior in search engines and other information access systems.

Implicit feedback learning in semantic and collaborative information retrieval systems - Gérard Dupont, EADS, EADS
This presentation try to provide an overview of one way to resolve those gaps: using feedback learning. The aim is to make the system learning on user behaviour in order to better define its current needs. Machine learning algorithms applied on signal coming from user while performing a search can lead to the understanding of what is really relevant to the users and then can be exploited to help him during its tasks.

Practical Applications of Natural Language Processing in Assistive Technology - Google Tech Talks
Ken Ingham, Ph.D. will describe the architecture and motivation behind the development of Amazability, Inc.'s Adept1 product.

Current Approaches to Personalized Web Search - author: Paul-Alexandre Chirita, University of Hannover

Personalized Web Search Engine for Mobile Devices - author: Vasudeva Varma, IIIT Hyderabad

A Social Network Based Approach to Personalized Recommendation of Participatory Media Content - Aaditeshwar Seth, School of Computer Science; University of Waterloo

Machine Learning, Probability and Graphical Models - Sam Roweis, Department of Computer Science, University of Toronto

Proactive Information Retrieval by User Modeling from Eye Tracking - author: Jarkko Salojärvi, Helsinki University of Technology




0 # Jeffrey Smith 2009-01-05 09:04
Dave you weren't kidding, this is a monster post. You gave me enough material to read for weeks.

Watch out Wikipedia, with all of this info, once mastered folding up double rankings like Wiki will be like throwing back a few cold ones.

Great post and thanks for the research bro.
Reply | Reply with quote | Quote
0 # Dave 2009-01-05 09:42
Lol... likewise. As I mentioned the post was greatly from friends and so there is a great deal of new goodiness there for me as well. An adventure became an odyssey.

If you're into it, join our merry band on the next adventure, I can keep you updated on things in the pipeline :0)
Reply | Reply with quote | Quote
0 # waveshoppe 2009-01-05 10:19
Talk about a big breakfast dude!

Dave, if you can, lets keep this page current. While it summarizes our last 4 years of conversation (about IR stuff), I am sure that its not the end. You deserve extra credit for taking on the MS papers as only a few dare to venture into that territory.
Reply | Reply with quote | Quote
+1 # Antares 2009-01-05 11:42
Colleagues greetings! I uzbek SEO master :-) Write still .....
Reply | Reply with quote | Quote
0 # Hugo @ Zeta Interactive 2009-01-05 15:55
Great stuff! I've seen some of it, but a lot of it is new to me.

Looks like I have some reading to do...
Reply | Reply with quote | Quote
0 # Lorna Li 2009-01-05 15:57
Whoa, this is serious stuff. It will take me months to get thru this and fully comprehend...without a day job. How much of a boost to one's earning potential as an SEO do you anticipate from this?
Reply | Reply with quote | Quote
0 # Alex 2009-01-05 16:39
Hey this post is amazing!
Let me just add another great site from Dr. E. Garcia:

Mi Islita is a research site about information retrieval, data mining, and search engine technologies. Our content is frequently referenced or used by both IR scholars and search engine marketers.
Reply | Reply with quote | Quote
0 # MikevanderHeijden 2009-01-05 17:36
Simply wow, awesome post. I have read some of the material but atleast 80% of it is new to me.

Thanks for posting this!
Reply | Reply with quote | Quote
0 # Jun 2009-01-05 19:15
Haven't done reading the patents yet! LOL! Now another reaading materials. Thanks Dave!
Reply | Reply with quote | Quote
+1 # Dave 2009-01-05 20:40
Well thanks ya'll - it was a labour of love among friends.

How much would one profit from the info? Lol... damned marketers, always after the buck huh? he he... To be honest it's all what one does with it - One important area is 'Future-proofing' ones SEO efforts. What works today may not some day soon, watching search evolution ultimately helps one plan ahead in many ways.

As for senor Garcia, we were certainly aware of him (old school here ya know) - but I culled him from the list as his blog doesn't seem all that active anymore unfortunately.

And to the 'wow' 'amazing' stuff... not necessary, just drop by often, hook up on Twitter and just keep this SEO geek from getting lonely ok?

THANK YOU... all for taking the time to comment, makes the effort worth it!! (pretty good chit chat on Sphinn as well if anyone is interested)
Reply | Reply with quote | Quote
0 # Ben McKay 2009-01-06 00:06
Phenomenal resource Dave - I've not seen a compilation of resources like this ever I don't think, so I'm guessing I'm not alone in thanking you for taking the (long) time in compiling it.

I had thought about doing something along the same lines following the guest post I kindly had from Marie-Claire Jenkins from Science for SEO, but the posts you've been putting out there lately are a tad more comprehensive than what I could muster.

I'm heading off to Sphinn to see what the take-up on it is...

Really appreciated, thanks a lot!

Ben ;-)
Reply | Reply with quote | Quote
0 # Jagdeep Singh Pannu 2009-01-06 10:00
Thanks for this awesome post Dave, Bill, Marie and Charles. That's a haystack of information. Bookmarked and Sphinned :-) Will attack the unfamiliar ones one by one. My in-house consultancy stint has kind of thrown me out of sync with the industry as I have a whole lot of e-learning stuff to attend to. Your post is a place, which i can revisit to catch up.
Reply | Reply with quote | Quote
0 # Chris McGiffen 2009-01-06 10:39
Great resource, and nice to see one of my favs up near the top - Managing Gigabytes :D
I would also recomend the less technical, fuzzier "Web Dragons" book by Witten et al as a good starting point for general understanding of IR/SEO issues - although I don't agree with all it says :-)
Reply | Reply with quote | Quote
0 # Dave 2009-01-06 11:33
@jagdeep - THANK YOU for mentioning the rest of my merry band of IR mayhem makers. I will pass along your comment. Once it started it really did need help of others to ensure it was a blanced but useful resources. It wouldn't be what it is without them :woohoo:

@Chris - well, as U might know (from the sounds of it) each study and research paper is skewed. Researchers have varied data sets and to be honest, may not always be that objective - there is plenty to debate.

Thanks for the lead, shall follow it up.
Reply | Reply with quote | Quote
0 # Jagdeep Singh Pannu 2009-01-08 05:21
It would be awesome guys, if you can present and discuss every one of these topics publicly online. If you want to, just let me know and I will set it up in a virtual environment, where anyone can jump in and discuss in real-time. You can embed the event to announce it on your blogs. These sessions are recorded, so anyone can access them later also. You can email me directly if you want to do this. To have a feel, you can follow the link I provided for this post.
Reply | Reply with quote | Quote
0 # Search Engine Optimization 2009-01-08 05:35
Hey thanks for so many resources. And yeah m going to download your SEO handbook. M sure as your articles have proven to be helpful, your book also will be.. :-)
Reply | Reply with quote | Quote
-1 # Bottled aardvark Milk 2009-01-12 01:09
Very useful resources.I like the way you have presented your article.Thanks for sharing!
Reply | Reply with quote | Quote
-1 # Bottled aardvark Milk 2009-01-12 01:11
Do you have any idea about black/white hat seo? where can I get those details?

Bottled aardvark Milk (
Reply | Reply with quote | Quote
0 # Robbert 2009-01-12 09:01
Wow, impressive list. I'll be checking it out tonight. Definitely!
Reply | Reply with quote | Quote
0 # web design bournemouth 2009-01-15 17:02
thanks dave - but could you just summerise for me in a tidy email so i don't have to bother reading it all! :P
Reply | Reply with quote | Quote
0 # Jack 2009-01-16 06:18
Thanks for the extensive list..
Reply | Reply with quote | Quote
0 # Guest 2009-01-16 07:18
Reply | Reply with quote | Quote
0 # Lucas Ng 2009-01-18 21:23
My work colleague pointed me to your site, specifically this monster treasure trove of IR papers and links!

Great stuff, love it when search marketers find resources to really sink their teeth into IR.

I'd like to plug the AIRWEB resource: Following the annual AIRWEB contest is a great way to keep tabs on spamming vs. anti-spam!

See-ya on twitter! @lucasng
Reply | Reply with quote | Quote
0 # Dave 2009-01-19 06:31
Glad you found it useful... more search geeks out here than I thought, at least in this neck of the woods.

Nice call on the AirWeb, I just so happened to mention the 07 and 08 stuff in a post the other day (Tweeted ye the link). AIR is kind of the ying and yang of search to me, what would they to without those nasty manipulators? The world is a imperfect place, what can one do?

Reply | Reply with quote | Quote
+1 # Avi Rappoport, Search Tools Co 2009-05-15 17:17
I just wandered over here and am very impressed by your diligence in assembling all these links. You may also want to keep an eye on the article posts in citeulike, including their IR group

In any case, I'm linking/blogging/citing etc. because this is a very nice resource.
Reply | Reply with quote | Quote
0 # college lamp guy 2009-12-16 15:16
This list has legs...I am still using this reference guide and it's almost a year old. Thank you, Thank you, Thank you!

Reply | Reply with quote | Quote
0 # WaveShoppe 2010-02-03 20:43
By the way Dave, congratulations on your win!
Reply | Reply with quote | Quote
0 # Ben Joven 2010-02-14 00:53
I LOVE MIT'S video courses online!!! =D

The physics courses are awesome! Also MIT offers great courses on Python too, they're available for download on iTunes and I've been learning (slowly but surely)on my iPhone while I run on the treadmill every day.
Reply | Reply with quote | Quote
0 # Matt Pennebaker 2011-10-21 15:03
Good God Man, this is more information than I had in my senior thesis! Great stuff though. Thanks for the post.
Reply | Reply with quote | Quote

Add comment

Security code

Search the Site

SEO Training

Tools of the Trade


On Twitter

Follow me on Twitter

Site Designed by Verve Developments.