How search engines could get granular
One area that is worth looking at for SEO in 2009 (for me at least) is page segmentation. Now this approach really isn’t new and I came across papers as far back as 1997 and beyond. But unsurprisingly most IR methods don’t just appear over night. The big three, (of search) have each had various research papers and patents dating back as far as 2003-4. It just seems to have some traction and is sensible as well.
Essentially page segmentation is when a search engine looks to break a given web page down into its component parts. They could analyze a web page and assign various relevance or importance scoring for different regions of a page. Some of the methods include fixed-length page segmentation (FixedPS), DOM, (DomPS) and location based and white spaces (vision based or VIPS) and even a combined approach (CombPS).
As with many IR methodologies they try to improve the signal to noise ratio. In this case by hopefully identify the noisy segments; resources can be focused on the relevant areas of a web page. Furthermore most people do tend to understand web pages in a segmented or structured view. When you arrived at this page did you instinctively know where to find the main content? Aware of common locations for navigation and other elements? Banner blindness? You get the idea.
Advantages of page segmentation
The main advantages are increased relevance and streamlining processing elements. Search engines hope to use page segmentation to be able to asses a more finite understanding of a given pages relevancy, but also (theoretically) be capable of dealing with multi-topic pages, semantically related or not.
The second advantage, processing and resource management, can be achieved as they could define site templates in an attempt to only crawl/index the relevant parts of the page and not the boilerplate elements.
Now, while there are a few ways of going about it, what’s important here is that such systems are sensible not only from a relevancy perspective, but could also help crawling and indexing resource management.
One has to imagine new ideas at the big three will be tempered in a volatile economy. Once a template has been established, indexing a site on a regular basis could be far easier on a search engine (and site owner as well). Just have a little ‘template bot’ crawl a few pages now and again to ensure the profile is unchanged.. but I’m rambling now…
Another implementation (as noted by the Google patent) could be pages that have a number of listings that are geographic in nature. As search for ‘stone oven pizza, Toronto’ could produce better results as larger listings of pizza shops in Toronto could be segmented and digested by more finite parameters than normal.
“The text associated with the smallest hierarchical level surrounding a business listing may be associated with that business listing” – Patent; Document segmentation based on visual gaps