How search engines could get granular
One area that is worth looking at for SEO in 2009 (for me at least) is page segmentation. Now this approach really isn’t new and I came across papers as far back as 1997 and beyond. But unsurprisingly most IR methods don’t just appear over night. The big three, (of search) have each had various research papers and patents dating back as far as 2003-4. It just seems to have some traction and is sensible as well.
Essentially page segmentation is when a search engine looks to break a given web page down into its component parts. They could analyze a web page and assign various relevance or importance scoring for different regions of a page. Some of the methods include fixed-length page segmentation (FixedPS), DOM, (DomPS) and location based and white spaces (vision based or VIPS) and even a combined approach (CombPS).
As with many IR methodologies they try to improve the signal to noise ratio. In this case by hopefully identify the noisy segments; resources can be focused on the relevant areas of a web page. Furthermore most people do tend to understand web pages in a segmented or structured view. When you arrived at this page did you instinctively know where to find the main content? Aware of common locations for navigation and other elements? Banner blindness? You get the idea.
Advantages of page segmentation
The main advantages are increased relevance and streamlining processing elements. Search engines hope to use page segmentation to be able to asses a more finite understanding of a given pages relevancy, but also (theoretically) be capable of dealing with multi-topic pages, semantically related or not.
The second advantage, processing and resource management, can be achieved as they could define site templates in an attempt to only crawl/index the relevant parts of the page and not the boilerplate elements.
Now, while there are a few ways of going about it, what’s important here is that such systems are sensible not only from a relevancy perspective, but could also help crawling and indexing resource management.
One has to imagine new ideas at the big three will be tempered in a volatile economy. Once a template has been established, indexing a site on a regular basis could be far easier on a search engine (and site owner as well). Just have a little ‘template bot’ crawl a few pages now and again to ensure the profile is unchanged.. but I’m rambling now…
Another implementation (as noted by the Google patent) could be pages that have a number of listings that are geographic in nature. As search for ‘stone oven pizza, Toronto’ could produce better results as larger listings of pizza shops in Toronto could be segmented and digested by more finite parameters than normal.
“The text associated with the smallest hierarchical level surrounding a business listing may be associated with that business listing” – Patent; Document segmentation based on visual gaps
Segmenting the page
The nuts and bolts I shant trouble you with (links later as always) but it varies from code analysis (DOM) approaches to vision based. The main idea is establishing common (boilerplate) segments of a web page… And from there the systems can be set to even more granular levels to find an optimal rate (playing with the dials).
Boilerplate type elements can include; Headers, Footers, Navigational elements found throughout a site or a single page. When looking at multiple pages common elements can also be identified such as the phrase ‘Copyright 2009’ for example. Within a piece of content common boiler plate elements such as a copyright notice or navigational links (Home, ‘Contact Us’) can also be identified and disregarded as needed. The same can be said of advertisements and other blocks of information found throughout a website.
By disregarding such boiler plate text during indexation the search engine can also attain greater relevance and save processing resources.
Now, obviously we’re a content centric bunch in the SEO world and so understanding how they might look to identify ‘content’ areas of the page is paramount (more later on the why). Elements often cited are;
- Number of images in the block
- Size of the images
- Number of links
- Anchor text length
- Number of words
- Length of form elements
- Text formatting elements (<strong> <text> <i> <em>
- Other code elements (<table> <p> <hr> <ul> <td>)
- Background color of a node (or child node)
Also the size and position of the block can give added signals as to where the core content of the page resides. In most situations for search optimizers, we deal in this space. This is where we really play – this isn’t 03 and site wide footer links sucked long ago.
Now it’s not all dill and pickles…there are a few potential issues with page segmentation systems.
Not all people value the same parts of a page given the different types of sites out there. For example some may look at the stories on the home page of a social media site while others may look to the latest comments… because of this setting the thresholds and valuations of segments is problematic. Even the boilerplate concepts suffer from this (as the ‘Top 10’ and comments are on every page at a site like Sphinn)
One might also have varied feelings of importance to navigational elements when not directly on a page of interest (thus editorial areas can also vary). As with the above example, navigating to the ‘search marketing’ segment may be of greatest interest to me. What of a side panel element of ‘related topics’ which may or may not be of great importance to the viewer? The point being that it’s not a golden egg entirely.
As you can see, adoption still has a few mountains to traverse, but even in a limited capacity there are uses… one really jumps to mind (and shame on you if you weren’t already asking)…
Yes, that’s right… you know we just had to end up here right? One of the things that really drew me into this topic more and more over the last year is the potential implication for links. One would have to imagine the link dominant search engines would welcome a system that could potentially provide more granular levels of link valuations (on site and off).
To begin with, page segmentation can help bolster link analysis methods such as page rank, HITS and their ilk. Or so the story goes. Consider a page with a variety of semantically or not so related content, complete with links (internal or external). Traditional analysis tells us that the page is treated as a whole and thus link relevance can be effected from a lack of focused theme. If search engines can begin to break out blocks of information, independent of the whole, new valuations can be had for links from within a single document. In short there could be more link juice to go around.
Another interesting element would be the ability to build links to a multi-semantic page with diverse anchor texts. Many times in SEO one creates target page(s) built around terms and builds related links to that page. This has always made ecommerce SEO a struggle between clicks to purchase and SEO readiness as far as structural elements and ‘landing pages’ are concerned. Page segmentation methods mean we could build more diversified link profiles to a given page (such as a main category page in the case of the ecommerce example).
Think of links from a block-to-page level and page-to-block (instead of say PageRank which is page-to-page). One can see how greater relevance from link analysis can be had.
Now this can be a doubled edge sword in that block level link analysis (such as BLPR) could play into the valuation of links. This could mean wholesale devaluation or dampened of certain link types. This could include;
- Advertising blocks
- Blog comment links
- Header/footer links
- Site wide links
- Forum signatures
As you can imagine, this would put a high premium of editorial (diverse) links within content. The ‘boilerplate’ models could also easily pick out mass paid link programs and article marketing links as common (unnatural) elements.
I think it is interesting that much of what has been looked at in page segmentation, to me, has some interesting implementations in spam detection. For starters more complex template and content analysis is bound to turn up many boiler plate websites such as those produced by web spammers. On a granular level, poorly generated content for spam sites could in itself create boilerplate text right within the content.
Beyond that certain template systems employed by spammers over and over can be identified and cross referenced with other factors (link spam analysis, IPs, whois) to profile spammers. And speaking of link spam, these systems could also identify common locations which boilerplate link texts are showing up for a given link profile. Ultimately any IR system should have some spam detection capabilities and these methods satisfy that on a few levels.
As mentioned above some elements of block level link analysis could be used to identify linking schemes in concert with existing methods. Consider large scale paid blog reviews or article marketing campaigns where the template changes from site to site, blog to blog, but the main content (once identified) contains identical anchor texts/author bios. Analysis on a page by page basis wouldn’t be as effective as a block by block analysis.
What could it mean to your SEO?
And so what does any of this mean to you? Well, to be honest we don’t know if any of the big three implement page segmentation concepts on any level, though Microsoft certainly has had a strong addiction over the years. As with many search applications, the end user experience is a running concern. Many of the adaptations we’d make with such methods in mind would ultimately make for a better and more concise end user experience. At very least we can improve usability and prepare for potential search evolutions all at the same time.
Some key take-away could be;
- Create distinct blocks when constructing pages and ensure it is obvious where the content it.
- Use CSS to define text types; when possible ensure content text face is unique
- Use tables or unique div background colors for child (related) topics within content.
- Present content in semantic page formatting; title’s, italics and bold to segment concepts
Ultimately if such systems did gain traction it would become increasingly possible to rank a single page for a variety of terms, beyond the abilities we currently have for targeting. One instance one might find for these concepts are ecommerce applications and varied product lines (though semantically related). Let’s look at the following example;
*courtesy Wave Shoppe – Hawaiian clothing
In an instance such as this formatting and targeting the text within each of the blocks becomes of great importance. We could also consider alternating the background colors of the product nodes to define them as unique segments. You can also ensure the upper and lower segments are properly targeted as a parent or child node of the over-all presentation.
For all we know search engines could look at top performing pages in a query space and analyze them for semantic block and other segmentation variables to create new signals for other pages in the set (query space). We can say that there is interest, potential and even potential advantages from processing and spam detection stand points. What we can’t say is how deep or valuable page segmentation will be to search engines in the future.
So far Microsoft and Yahoo seem to have the most interest, although I wouldn’t count Google out as what I read of ‘block-level-pagerank’ seemed promising. I wouldn’t go changing how you optimize just yet, but tuck this one away. It is something to watch for in 2009… just in case.
VIPS a Vision-based Page Segmentation Algorithm – Microsoft : Nov. 1 2003 : Deng Cai, Shipeng Yu, Ji-Rong Wen and Wei-Ying Ma
Block Based Web Search – Microsoft research : 2004 : Deng Cai, Shipeng Yu2, Ji-Rong Wen, Wei-Ying Ma
Block-level Link Analysis : Microsoft : 2004 : Deng Cai, Xiaofei He, Ji-Rong Wen, Wei-Ying Ma
Learning Block Importance Models for Web Pages - Microsoft : 2004 -
Page-level Template Detection via Isotonic Smoothing : Yahoo : May 2007 : Deepayan Chakrabarti, Ravi Kumar, Kunal Punera
Also somewhat related is the webtables project which is similar to Google set’s, but looks for data or layout elements via HTML etc… (Hat tip to Bill on that angle);
Uncovering the relational Web - Google
WebTables: Exploring the Power of Tables on the Web - Google
Vision-based document segmentation – Microsoft : filed; July 28, 2003 and awarded Sept.23 2008 : Wen; Ji-Rong (Beijing, CN), Yu; Shipeng (Beijing, CN), Cai; Deng (Beijing, CN), Ma; Wei-Ying (Beijing, CN) - and on SEO by the Sea
Method and system for calculating importance of a block within a display page – Microsoft : filed Apr 2004 and assigned April 2008 : Ma; Wei-Ying (Beijing, CN), Wen; Ji-Rong (Beijing, CN), Song; Ruihua (Beijing, CN), Liu; Haifeng (Toronto, CA) : (great coverage from Bill as always)
Method and system for identifying object information : Microsoft: filed April 2005 and assigned June 2008 : Wen; Ji-Rong (Beijing, CN), Ma; Wei-Ying (Beijing, CN), Nie; Zaiqing (Beijing, CN)
Retrieval of structured documents – Microsoft : filed Mar. 2006 and awarded sept 23 2008 : Wen; Ji-Rong (Beijing, CN), Cui; Hang (National University of Singapore, SG)
System and method for detecting a web page template – Yahoo : filed May 2007 and assigned Nov 2008 : Chakrabarti; Deepayan; (Mountain View, CA) ; Punera; Kunal; (Austin, TX) ; Ravikumar; Shanmugasundaram; (Berkeley, CA)
System and method for smoothing hierarchical data using isotonic regression : Yahoo : filed May 2007 and assigned Nov 2008 : Chakrabarti; Deepayan; (Mountain View, CA) ; Punera; Kunal; (Austin, TX) ; Ravikumar; Shanmugasundaram; (Berkeley, CA)
Techniques for approximating the visual layout of a web page and determining the portion of the page containing the significant content – Yahoo – filed August 2006 and assigned Feb 2008 - Kesari; Anandsudhakar
Document segmentation based on visual gaps – Google : filed Dec 30 2004 and assigned Sept 2008 : Daniel Egnor ( and coverage on SEO by the Sea)
Systems and methods for analyzing boilerplate – Google : filed March 2004 and assigned Feb 2008 : Stephen R. Lawrence;
You can also check out a few posts recently made by an industry mate that was thinking about this at the same time as I was writing this (as we compared notes some) see; Page segmentation and a possible future and 4 types of Page Segmentation explained