SEO Blog - Internet marketing news and views  

Page segmentation and link building

Written by David Harry   
Monday, 13 July 2009 08:28

Do you know where to find prime link real estate?

Brother Bill expertly covered (yet another) patent on page segmentation the other day; this time from Yahoo. It is certainly an area that SEOs really might want to be paying more attention to these days. Considering that each of the big three have dabbled into it (to varying degrees) over the last ½ decade, there is every reason to believe that something might be here (all that ‘where there is smoke’… yada yada).

Google Page segmentation has widely been developed for use in OCR applications to better understand text/image relationships. More recently, we’ve seen this area of expertise brought to the online world. One important aspect that should be of particular interest to those in the SEO world is in the link related ramifications.

Let us consider that Google, the king of link reliant search engines, would certainly find a huge benefit from this approach for valuating links (or even indexation decisions).

An interesting example is the recent update to the Google Blog search that fixed the issue of Google indexing blog roll links (problematic when using the link: command). As soon as that news broke I thought, ‘How’d they do that I wonder?’ – Obviously page segmentation came to mind.

Benefits of page segmentation

Now, as fastidious little detectives, the first thing we need to look for is motive. Why would a search engine want to do this? A few of the potential benefits include;

  1. Crawling/indexing resources – once the template structure of a site is prioritized, infrequent crawling of, or discarding certain layers would save on computing and storage costs.
  2. Topic Drifting – some pages will have more than one topic and search engines can struggle to properly index/categorize them.
  3. False positives – it can also better deal with problems that arise when a page contains links/citations to other topics not directly related to the content of the page. This approach would help improve the quality of the results in avoiding such false positives.
  4. Spam Detection – obviously understanding boilerplate and other inherently spammy elements would be an important part of the value and attraction.
  5. Paid links – it can also could go a long way in fighting link spam and paid links even. By devaluing segments or altogether passing them over, it makes the practice less enticing and handicaps the activity.

These are but a few ways that page segmentation can be a useful tool for search engines. There are more, but I just wanted to get you up to speed (for more see last post on it). What we’re concerned with this time is how it has the potential to change the world of links as we know it… m’kay?

Page segmentation link values

Links in the chain

What is most important is how these types of approaches would affect link valuations and by association, link building campaigns. There are a few distinct areas where link builders might want to consider that can affect strength of a given link location;

  • Segment indexation – one of the benefits of page segmentation is deciding which parts of the page should or should not be indexed. For example we could suppose that a side panel segment titled ‘Our Sponsors’ may be passed over altogether when the page is indexed. As you might imagine, having a link in a part of the page that the search engine doesn’t index, could be problematic.
  • Topical diversity – imagine if a page has multiple topics and multiple link types going to that page. If the search engine can establish which links are to which segments, some segments may be of more value than others.
  • Segment location – if search engines can understand the (boiler plate) template of a page as well as common segments, it would be easy to refine the values. We could find that various areas of the page would be valued more than others in a kind of PageRank dampening.

It should be noted that most of the patents/papers on the topic have gone far into the whole link valuation area, but it seems quite intuitive that this would be a valuable tool. This means there would likely be a premium on the editorial/content areas of a given web page.

One paper that does go into it is Microsoft’s; Block Level Link Analysis (PDF) where they discuss a Block Level PageRank, (BLPR) approach.

Some possible areas that a search engine might dampen the value of links are;

  1. Header/footer links – easily definable in most templates.
  2. Navigation links – internal site links in segments
  3. Blog rolls – as mentioned before, this could already be in play.
  4. Advertiser/Supporter side bar links
  5. Forum signatures – recurring template elements
  6. Blog comments – sorry mister comment spammer :0(
  7. Social user pages – profile pages etc.. common elements
  8. Directory/link pages – very easy boilerplate to identify

As you can imagine, this would put a premium on links within the content of a page and varied degrees of devaluation for other placements.

Time to link ahead

Ok, sure… this is all interesting, but is it real? There certainly does seem to be ‘something’ going on with link valuations the last few years, but we can’t be sure it has anything to do with segmentation. But I can’t shake the feeling that this is an important discipline to be aware of. Another interesting tidbit that I can across, which further denotes Google’s interest in some form of segmentation, was a comment in a post about HTMM;

“(…) shifting topics within a document, and in so doing, provides a topic segmentation within the document,” – Google Research 2007

And they (partially) sponsored the; ICDAR2009 Page Segmentation Competition

Anyway, there is at least enough logic and anecdotal evidence to at do some testing and see what we find. It is likely a better route for SEOs these days than ‘real time social search’ if you ask me. This is certainly something that we’ll try and get some of the Dojo warriors working on, (I’ll report back on that).

For now, it is definitely an area to watch for and consider when future proofing your link building efforts. And really, aren’t the editorial links what one is after anyway? And why is that? How does Google know which links are which? …. Maybe they’re further ahead on this than we know… time to break out some fashionable tin foil I’d say… because where there is smoke… well, you know the rest.

If ‘Block Level PageRank’, (BLPR) isn’t in your lexicon, maybe it’s time it was…



Microsoft Granted Patent on Vision-Based Document Segmentation (VIPS) – SEO by the Sea
Google and Document Segmentation Indexing for Local Search – SEO by the Sea
Page segmentation; ignore at your own peril - fantomaster

Research Papers
VIPS a Vision-based Page Segmentation Algorithm – Microsoft : Nov. 1 2003 : Deng Cai, Shipeng Yu, Ji-Rong Wen  and Wei-Ying Ma
Block Based Web Search – Microsoft research : 2004 : Deng Cai, Shipeng Yu2, Ji-Rong Wen, Wei-Ying Ma
Block-level Link Analysis : Microsoft : 2004 : Deng Cai, Xiaofei He, Ji-Rong Wen,  Wei-Ying Ma
Learning Block Importance Models for Web Pages - Microsoft : 2004 -
Page-level Template Detection via Isotonic Smoothing : Yahoo : May 2007 : Deepayan Chakrabarti, Ravi Kumar, Kunal Punera

Also somewhat related is the webtables project which is similar to Google set’s, but looks for data or layout elements via HTML etc… (Hat tip to Bill on that angle);Uncovering the relational Web - Google : WebTables: Exploring the Power of Tables on the Web - Google


Vision-based document segmentation – Microsoft : filed; July 28, 2003 and awarded Sept.23 2008 : Wen; Ji-Rong (Beijing, CN), Yu; Shipeng (Beijing, CN), Cai; Deng (Beijing, CN), Ma; Wei-Ying (Beijing, CN) - and on SEO by the Sea
Method and system for calculating importance of a block within a display page – Microsoft : filed Apr 2004 and assigned April 2008 : Ma; Wei-Ying (Beijing, CN), Wen; Ji-Rong (Beijing, CN), Song; Ruihua (Beijing, CN), Liu; Haifeng (Toronto, CA) : (great coverage from Bill as always)
Method and system for identifying object information : Microsoft: filed April 2005 and assigned June 2008 : Wen; Ji-Rong (Beijing, CN), Ma; Wei-Ying (Beijing, CN), Nie; Zaiqing (Beijing, CN)
Retrieval of structured documents – Microsoft : filed Mar. 2006 and awarded sept 23 2008 : Wen; Ji-Rong (Beijing, CN), Cui; Hang (National University of Singapore, SG)

System and method for detecting a web page template – Yahoo : filed May 2007 and assigned Nov 2008 : Chakrabarti; Deepayan; (Mountain View, CA) ; Punera; Kunal; (Austin, TX) ; Ravikumar; Shanmugasundaram; (Berkeley, CA)
System and method for smoothing hierarchical data using isotonic regression : Yahoo : filed May 2007 and assigned Nov 2008 : Chakrabarti; Deepayan; (Mountain View, CA) ; Punera; Kunal; (Austin, TX) ; Ravikumar; Shanmugasundaram; (Berkeley, CA)
Techniques for approximating the visual layout of a web page and determining the portion of the page containing the significant content – Yahoo – filed August 2006 and assigned Feb 2008 - Kesari; Anandsudhakar

Document segmentation based on visual gaps – Google : filed Dec 30 2004 and assigned Sept 2008 : Daniel Egnor
Systems and methods for analyzing boilerplate – Google : filed March 2004 and assigned Feb 2008 : Stephen R. Lawrence;


Search the Site

SEO Training

Tools of the Trade


On Twitter

Follow me on Twitter

Site Designed by Verve Developments.