The SEO geeks guide to information retrieval
A while back I was ranting on Twitter (and elsewhere) about how not too many SEO peeps actually study some of the more technical aspects of how search engines work and IR, (NLP,ML etc..) concepts. You see, it seems logical that those working with (marketing to) search engines might want to know a thing or two about how they work.
The goal of today’s ride isn’t as much to provide you with a framework for learning as it is to start you in a direction. As always, learning how to catch fish is more important than a free lunch. These are simply some interesting directions for SEO technicians looking to dig deeper into the world of knowledge management.
This thing of ours
To me this thing of ours is an art form. The pragmatic can argue the semantics of what is art, regardless, to me it is such. And a dimensionally singular SEO is one that is limited in that art. Much like a traditional artist progresses from still life and pencil to painting and sculpting, from brushes to chisels, an SEO should seek out other elements related to the discipline. We should learn knowledge management and algorithms with the same verve we apply to playing in social media.
The goal of this adventure is to hopefully spark the imagination of SEOs by finding out just how interesting the task of indexing the world’s information is. Many times the goodies listed here aren’t a direct line to ranking glory, but great for identifying future trends and having a feel for modern implementations…
And so today we head down the path leading to SEO geek nirvana…. Mount up and let’s ride!!
The big ass list of information retrieval resources
Right away I want to give a little luv out to a few folks that helped me with this. There is no way this humble SEO geek could get there without a little help from my friends;
Bill Slawski – Search guru and all around great guy
Marie-Claire Jenkins – my newest friend who brings science to SEO
And of course Charles – who sends way too many IR papers at me for a Hawaiian shirt dude
Probably one of the least liked formats of the online addict, we’ll get into some books to get ya’ll started (and move along smartly to the juicy stuff) – here’s a few of interest;
Readings in Information Retrieval. K. Sparck Jones, P. Willett. Morgan Kaufmann, 1997.
Managing Gigabytes. Compressing and Indexing Documents and Images -
I.H. Witten, A. Moffat, T.C. Bell. Morgan Kaufmann, 1999.
Information Retrieval: Algorithms and Heuristics – David A. Grossman, Ophir Frieder
Information Retrieval: Data Structures & Algorithms . Frakes, W. and Baeza-Yates, R., Prentice Hall, 1992.
Mining the Web: Discovering Knowledge from Hypertext Data by Soumen Chakrabarti
Introduction to Information Retrieval, (and online resources ) by Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze - Cambridge University Press. 2008
Modern Information Retrieval by Ricardo Baeza-Yates and Berthier Ribeiro-Neto
Language Modeling for Information Retrieval - Croft, W. Bruce; Lafferty, John
Google's PageRank and Beyond: The Science of Search Engine Rankings - Amy N. Langville & Carl D. Meyer
(Free) Courses and learning
Next we have some free online courses that one can dig through at their leisure to help you truly get a feel for the issues, problems and methods used in search. This is stuff definitely falls under the uber-geeky and while valuable – isn’t required reading (just damned fun if U ask me).
Introduction to Information Retrieval - Stanford
This is an online version of the book from Cambridge University. This book is the result of a series of courses we have taught at Stanford University and at the University of Stuttgart, in a range of durations including a single quarter, one semester and two quarters. These courses were aimed at early-stage graduate students in computer science.
(can also be found
Information Retrieval - A (free online) book by C. J. van RIJSBERGEN
The major change in the second edition of this book is the addition of a new chapter on probabilistic retrieval. This chapter has been included because I think this is one of the most interesting and active areas of research in information retrieval. There are still many problems to be solved so I hope that this particular chapter will be of some help to those who want to advance the state of knowledge in this area. All the other chapters have been updated by including some of the more recent work on the topics covered.
Introduction to Algorithms – MIT
Description; this course teaches techniques for the design and analysis of efficient algorithms, emphasizing methods useful in practice.
Information Retrieval Interaction. - P. Ingwersen. Taylor Graham, 1992.
Focuses on user interaction in IR. The aims of the book are to establish a unifying scientific approach to IR – a synthesis based on the concept of IR interaction and the Cognitive Viewpoint; to present research and developments in the field of information retrieval based on a new categorisation; and to generate a consolidated framework of functional requirements for intermediary analysis and design
Knowledge technologies in context – OpenLearn
Aims to develop an understanding of the relationships between information, interpretation, knowledge and computer-based representations and to summarise the range of different technologies that are available and on the horizon, and how they relate to different kinds of knowledge processes
Artificial Intelligence | Machine Learning – Stanford
Description; This course provides a broad introduction to machine learning and statistical pattern recognition. While not directly related to search, probability and machine learning is an important aspect of modern indexing and retrieval.
Artificial Intelligence | Natural Language Processing - Stanford
Description; this course develops an in-depth understanding of both the algorithms available for the processing of linguistic information and the underlying computational properties of natural languages. Wordlevel, syntactic, and semantic processing from both a linguistic and an algorithmic perspective are considered.
Probabilistic Systems Analysis and Applied Probability – MIT
Description; introduces students to the modeling, quantification, and analysis of uncertainty. Topics covered include: formulation and solution in sample space, random variables, transform techniques, simple random processes and their probability distributions, Markov processes, limit theorems, and elements of statistical inference.
Tutorial: Web Information Retrieval – Google - Monika Henzinger
A slide presentation of more than 150 slides that can give you a good overview and my fav gal CJ said “This presentation by Monika Henzinger is brilliant” – what more does one need to know?
Interesting Research papers
This area truly is endless as there are a never ending supply of excellent research out there. For this area I thought we’d look at some goodies that cover a range of topics that are of importance in the IR world currently. Need more? Drop me a line…
Modern Information Retrieval: A Brief Overview - Amit Singhal - Google, Inc.(pdf)
The field of Information Retrieval (IR) was born in the 1950s out of this necessity. Over the last forty years, the field has matured considerably. Several
IR systems are used on an everyday basis by a wide variety of users. This article is a brief overview of the key advances in the field of Information Retrieval, and a description of where the state-of-the-art is at in the field.
PageRank; The Anatomy of a Large-Scale Hypertextual Web Search Engine – Stanford (pdf)
We have built a large-scale search engine which addresses many of
the problems of existing systems. It makes especially heavy use of the additional structure present in hypertext to provide much higher quality search results. We chose our system name, Google, because it is a common spelling of googol, or 10100 and fits well with our goal of building very large-scale search engines.
Learning Diverse Rankings with Multi-Armed Bandits – Cornell (pdf)
We propose a new learning to rank problem formulation that differs in three fundamental ways. First, unlike most previous methods, we learn from usage data rather than manually labelled relevance judgments. Usage data is available in much larger quantities and at much lower cost.
Google: Online Search & the Battle for Clicks – MIT (pdf)
This paper looks at how the search engine market evolved and how Google became the dominant player it is today.
Anchor Based Proximity Measures – Standford (pdf)
Our measures are based on three different propagation schemes and two different uses of the connectivity structure of the graph.. We then consider a web-specific application of the above measures with two disjoint anchors: good and bad. The key assumption is that good web pages are highly unlikely to link to bad web pages. The goal is to assign a goodness quality score to all web pages.
Link Spam Detection Based on Mass Estimation (pdf) - Standford
This paper introduces the concept of spam mass, a measure of the impact of link spamming on a page’s ranking. We discuss how to estimate spam mass and how the estimates can help identifying pages that benefit significantly from link spamming.
Graph based algorithms in natural language processing – MIT (pdf)
Set of 75 slide from a presentation on NLP.
Query Analysis (research papers)
Mining Search Engine Query Logs via Suggestion (pdf)
In this paper we describe two algorithms for sampling suggestions using only the public suggestion interface. One of the algorithms samples suggestions uniformly at random and the other samples suggestions proportionally to their popularity. These algorithms can be used to mine the hidden
Information Re-Retrieval: Repeat Queries in Yahoo’s Logs - Yahoo (pdf)
This paper explores repeat search behavior through the analysis of a one-year Web query log of 114 anonymous users and a separate controlled survey of an additional 119 volunteers. Our study demonstrates that as many as 40% of all queries are re-finding queries. Re-finding appears to be an important behavior for search engines to explicitly support, and we explore how this can be done.
Page Segmentation (research papers)
Vision Based page segmentation algorithm (VIPS) – Microsoft (pdf)
This paper presents an automatic top-down, tag-tree independent approach to detect web content structure. It simulates how a user understands web layout structure based on his visual perception.
Block level link analysis - Microsoft (pdf)
In this paper, we proposed two novel link analysis algorithms called Block Level PageRank (BLPR) and Block Level HITS (BLHITS) which treat the semantic blocks as information units.
By using vision-based page segmentation (VIPS) algorithm , we extract page-to-block and block-to-page relationships and then construct a page graph and a block graph. Based on this
graph model, the new link analysis algorithms are capable of discovering the intrinsic semantic structure of the web.
Implicit and explicit user feedback (user performance)
Query Chains: Learning to Rank from Implicit Feedback - Cornell (pdf)
This paper presents a novel approach for using clickthrough data to learn ranked retrieval functions for web search results. We observe that users searching the web often perform a sequence, or chain, of queries with a similar information need.
Evaluating the Accuracy of Implicit Feedback from Clicks and Query Reformulations in Web Search (partially funded via grant from Google)
This paper examines the reliability of implicit feedback generated from clickthrough data and
query reformulations in WWW search. Analyzing the users’ decision process using eyetracking
and comparing implicit feedback against manual relevance judgments, we conclude that clicks
are informative but biased.
Improving Web Search Ranking by Incorporating User Behaviour Information - Microsoft (pdf)
We show that incorporating user behavior data can significantly improve ordering of top results in real web search setting. We examine alternatives for incorporating feedback into the ranking process and explore the contributions of user feedback compared to other common web search features.
Optimizing Search Engines using Click through Data – Cornell (pdf)
The goal of this paper is to develop a method that utilizes click through data for training, namely
the query-log of the search engine in connection with the log of links the users clicked on in the presented ranking. Such click through data is available in abundance and can be recorded at very low cost.
Object Level Search (research papers)
Search Objective Gets a Refined Approach – Microsoft (pdf)
Object-Level Vertical Search takes a refined approach that is a significant advance from traditional Web search. The latter paradigm is based on a page-level relevance ranking approach, in which pages that receive links from many other pages are adjudged to have more value by the very fact that they are popular. If more people link to a given page, it must have something to offer—that is the presumption.
Object Level Ranking: Bringing Order to Web Objects – Microsoft (pdf)
This paper introduces PopRank, a domain-independent object-level link analysis
model to rank the objects within a specific domain. Specifically we assign a popularity propagation factor to each type of object relationship, study how di®erent popularity propagation factors for these heterogeneous relationships could affect the popularity ranking, and propose efficient approaches to automatically decide these factors.
Web Object Retrieval – Microsoft (pdf)
In this paper, we propose several language models for Web object retrieval, namely an unstructured object retrieval model, a structured object retrieval model, and a hybrid model with both structured and unstructured retrieval features. We test these models on a paper search engine and compare their performances. We conclude that the hybrid model is the superior by taking into account the extraction errors at varying levels.
Object Level Vertical Search – Microsoft (pdf)
In this paper, we introduce the overview and core technologies of object-level vertical search engines that have been implemented in two working systems: Libra Academic Search (http://libra.msra.cn) and Windows Live Product Search (http://products.live.com ).
Corroborate and Learn Facts from the Web - Google (pdf)
This paper describes a robust bootstrapping approach to corroborate facts and learn more facts simultaneously. This approach starts with retrieving relevant pages from a crawl repository for each entity in the seed set. In each learning cycle, known facts of an entity are corroborated first in a relevant
page to find fact mentions.
(and Bill has even more on Object level search)
Information retrieval related videos
This section had to be one of my favortites. I say that because I tend to have them running in the background while working and can listen to them while working. If focussing more, the video lectures from ‘Video Lectures’ have accompanying slides with them which is handy as most lectures we see don’t have that. There are also videos from IM Broadcast, YouTube and others...
The Future of Information Retrieval Part 1 and Part II
Interviews with prominent experts in the field of information retrieval on the future of IR.
Google search quality – Google Roundtable
Large Scale Search System Infrastructure and Search Quality -Fellows Jeff Dean and Amit Singhal on their insights in how search works at Google.
Overview of how Search Engines work – UC Berkeley
The World Wide Web brings much of the world's knowledge into the reach of nearly everyone with a computer and an internet connection. The availability of huge quantities of information at our fingertips is transforming government, business, and many other aspects of society.
Human-Computer Information Retrieval Lecture - Gary Marchionini
He discusses the intersection of information retrieval and human-computer interaction and the challenges of exploratory search.
Google Whacks for Profit and Fun – Google tech talks
Google study the number of Internet search results returned from multi-word queries based on the number of results returned when each word is searched
Machine learning and translation – Google tech talks
his is an interesting presentation on probabilistic learning and
dealing with better understandings of user intent. Kind of heavy
lifting for the search geeks, but still worth watching for any SEO.
Information Retrieval and Text Mining - Thomas Hofmann, Brown University
This four hour course will provide an overview of applications of machine learning and statistics to problems in information retrieval and text mining. More specifically, it will cover tasks like document categorization, concept-based information retrieval, question-answering, topic detection and document clustering, information extraction, and recommender systems.
Faceted metadata in search engines – UC Berkely
The availability of huge quantities of information at our fingertips is transforming government, business, and many other aspects of society. Topics include search advertising and auctions, search and privacy, search ranking, internationalization, anti-spam efforts, local search, peer-to-peer search, and search of blogs and online communities.
Universal Modeling: Introduction to modern MDL - A tutorial introduction to the *modern* Minimum Description Length (MDL) Principle, taking into account the many refinements and developments that have taken place in the 1990s. These do not seem to be widely known outside the information theory community. We will especially emphasize the use of MDL in classification.
Comparison of information retrieval techniques: Latent semantic indexing (LSI) and Concept indexing (CI)
Information retrieval in the vector space model is based on literal matching of terms in the documents and the queries. The model is implemented by creating the term-document matrix, which is formed on the base of frequencies of terms in documents.
Extracting Semantic Relations from Query Logs - Ricardo Baeza-Yates, Yahoo! Research
In this paper we study a large query log of more than twenty million queries with the goal of extracting the semantic relations that are implicitly captured in the actions of users submitting queries and clicking answers. Previous query log analyses were mostly done with just the queries and not the actions that followed after them.
Applications of Query Mining - Ricardo Baeza-Yates, Yahoo! Research
More from Yahoo to go along with the last one.
Dirichlet Processes, Chinese Restaurant Processes, and all that - Michael I. Jordan, University of California
Nonparametric Bayesian methods offer a way to make use of the Bayesian calculus without the parametric handcuffs. In this talk I describe several recent explorations in nonparametric Bayesian modeling and inference, including various versions of "Chinese restaurant process priors" that allow flexible structures to be learned and allow sharing of statistical strength among sets of related structures.
Interactively Optimizing Information Systems as a Dueling Bandits Problem - Yisong Yue, Department of Computer Science, Cornell University
Implicit feedback learning in semantic and collaborative information retrieval systems - Gérard Dupont, EADS, EADS
We present an online learning framework tailored towards real-time learning from observed user behavior in search engines and other information access systems.
This presentation try to provide an overview of one way to resolve those gaps: using feedback learning. The aim is to make the system learning on user behaviour in order to better define its current needs. Machine learning algorithms applied on signal coming from user while performing a search can lead to the understanding of what is really relevant to the users and then can be exploited to help him during its tasks.
Practical Applications of Natural Language Processing in Assistive Technology - Google Tech Talks
Ken Ingham, Ph.D. will describe the architecture and motivation behind the development of Amazability, Inc.'s Adept1 product.
Current Approaches to Personalized Web Search -
author: Paul-Alexandre Chirita, University of Hannover
Personalized Web Search Engine for Mobile Devices -
author: Vasudeva Varma, IIIT Hyderabad
A Social Network Based Approach to Personalized Recommendation of Participatory Media Content - Aaditeshwar Seth, School of Computer Science; University of Waterloo
Machine Learning, Probability and Graphical Models - Sam Roweis, Department of Computer Science, University of Toronto
Proactive Information Retrieval by User Modeling from Eye Tracking -
author: Jarkko Salojärvi, Helsinki University of Technology
Now you can call me biased (and you’d be right) but patents are always a great source of insight into how the gang at each of the big 3 may be thinking (now and historically). Bill has been a huge influence and help the last few years and patents are way my increasing fascination started to really kick into high gear. Unfortunately the list is endless and so we’ll keep it to some of the more fundamental and interesting ones worth looking into;
General IR related patents;
System and method for characterizing a web page using multiple anchor sets of web pages – Yahoo (sort of like TrustRank concepts)
Regression framework for learning ranking functions using relative preferences (machine learning) - Yahoo
System and method for determining semantically related terms using an active learning framework – Yahoo
Using link structure for suggesting related queries – Microsoft
Detecting Duplicate and near-duplicate files - Google
Searching to identify web page(s) – Microsoft
Method and system for creating improved search queries – Google
Extraction of information from documents - Microsoft
Personalized search/ behavioural signals
Systems and methods for analyzing a user's web history – Google
Systems and methods for modifying search results based on a user's history - Google
Method and apparatus for learning a probabilistic generative model for text - Google
Search system using user behaviour data - Microsoft
User sensitive (personalized) PageRank – Yahoo
Re-ranking search results based on query log – Microsoft
(covered by Bill )
User Distributed Search Results – Google
Search pogosticking benchmarks – Yahoo
Using search trails to provide enhanced search interaction - Microsoft
Personalization of web page search rankings – Microsoft
Accounting for behavioral variability in web search – Microsoft
User query data mining techniques – Yahoo
Bookmarks and Ranking - Google
Document segmentation based on visual gaps – Google
System and method for detecting a web page – Yahoo
Vision-based document segmentation – Microsoft
Retrieval of structured documents - Microsoft
Systems and methods for analyzing boilerplate - Google
Historical ranking factors
Information retrieval based on historical data - Google
Keyword usage score based on frequent impulse and frequency rate (and covered by Bill) - Microsoft
Calculating importance of documents factoring historical importance. - Microsoft
Temporal ranking of Search results - Microsoft
DOCUMENT SCORING BASED ON QUERY ANALYSIS – Google
System and method for providing preferred country biasing of search results – Google
System for providing geographically relevant content to a search query with local intent – Yahoo
System for determining local intent in a search query - Yahoo
Detecting a user's location, local intent and travel intent from search queries - Microsoft
Phrase Based IR and semantics
Phrase-based indexing in an information retrieval system - Google
Phrase Identification in an Information Retrieval System, - Google
Phrase-Based Generation of Document Descriptions, - Google
Phrase-Based Searching in an Information Retrieval System, - Google
Phrase-based indexing in an information retrieval system - Google
Automatic taxonomy generation in search results using phrases - Google
Diverse Topic Phrase Extraction (using LSA) – Microsoft
Synonym and similar word page search (more semantics) Microsoft
Fact Extraction / Object Level Search
Generating structured information - Google
Learning facts from semi-structured text - Google
Designating data objects for analysis - Google
Detecting spam documents in a phrase based information retrieval – Google
Discovering and determining characteristics of network proxies – Yahoo
Search Ranger System and Double-Funnel Model for Search Spam Analyses and Browser Protection (cloaking) – Microsoft
Web spam page classification using query dependant data – Microsoft
Detecting web spam from changes to links of websites – Microsoft
Method for node ranking in a linked database - Google
Method for scoring documents in a linked database - Google
Methods for ranking nodes in large directed graphs - Google -- covered on SEO by the Sea
You can find a list of more search patents from 2008 here; Google – Yahoo – Microsoft
ACM Transactions on Information Systems (TOIS):
Information Processing and Management (IP&M):
International Journal on Digital Libraries:
Journal of the American Society of Information Science and Technology (JASIST):
Journal of Documentation
Data & Knowledge Engineering:
Information Processing Letters:
Journal of Intelligent Information Systems:
Knowledge and Information Systems:
Foundations and Trends in Information Retrieval:
Resources and tools
10 free NLP tools for the SEO – Science for SEO
Glossary (Modern Information Retrieval) - Berkely
Information retrieval research links @ Search Tools – Search Tools.com
Information Retrieval Links - BUBL
Information Retrieval Systems - LSU
Open Directory: Information Retrieval Links - ODP
Indexing Resources - UBC
IR & Neural Networks, Symbolic Learning, Genetic Algorithms
Stop list (a list of stop words) - MIT
NLP resources - Chris Manning
Text mining links - Weiguo Patrick Fan's
LSA/LSI source code & tools – Science for SEO
Blogs and websites
IR related Blogs
Geeking with Greg
Jeff's Search Engine Caffe
Daniel Lemire's blog
Natural language processing blog
SEO by the Sea
Huomah.com (careful, I hear he's whacked)
Know of others? Please let me know as I am admitedly light here....
Other IR websites;
Information Retrieval Specialist Group
Information Retrieval (newsletters- free and paid) Springer
Fast Search white papers - Fast Search
Wiki – Association for Computational linguistics
Latent Semantic Analysis – Colorado U
That should keep you busy, oui?
And there you have it. When this journey began it was in frustration but along the ride I fell deeper into the abyss that is my fascination (near obsession?) with all things search. Having the holidays to get immersed was timely and hopefully there were some finds that you take with you.
The ultimate goal of this exploration is to merely encourage a sense of understanding, fire up the imagination and hopefully stir the passion. When those in the SEO industry speak of ‘standards’ and pontificate the future, pausing to look at the IR community may better serve us all. Should we seek to avoid being labelled link whores and hype merchants, having deeper technical skills and knowledge may lead us to that end. This is my challenge to all those in the world of search engine optimization.
I can’t thank those that helped with this post enough. Each lead turned me in new directions and fired up my passions again. Gone is the angst of the bickering and drama, replaced by new paths into creative bliss :0)