Do you know where to find prime link real estate?
Brother Bill expertly covered (yet another) patent on page segmentation the other day; this time from Yahoo. It is certainly an area that SEOs really might want to be paying more attention to these days. Considering that each of the big three have dabbled into it (to varying degrees) over the last ½ decade, there is every reason to believe that something might be here (all that where there is smoke
yada yada).
Page segmentation has widely been developed for use in OCR applications to better understand text/image relationships. More recently, weve seen this area of expertise brought to the online world. One important aspect that should be of particular interest to those in the SEO world is in the link related ramifications.
Let us consider that Google, the king of link reliant search engines, would certainly find a huge benefit from this approach for valuating links (or even indexation decisions).
An interesting example is the recent update to the Google Blog search that fixed the issue of Google indexing blog roll links (problematic when using the link: command). As soon as that news broke I thought, Howd they do that I wonder? Obviously page segmentation came to mind.
Benefits of page segmentation
Now, as fastidious little detectives, the first thing we need to look for is motive. Why would a search engine want to do this? A few of the potential benefits include;
- Crawling/indexing resources once the template structure of a site is prioritized, infrequent crawling of, or discarding certain layers would save on computing and storage costs.
- Topic Drifting some pages will have more than one topic and search engines can struggle to properly index/categorize them.
- False positives it can also better deal with problems that arise when a page contains links/citations to other topics not directly related to the content of the page. This approach would help improve the quality of the results in avoiding such false positives.
- Spam Detection obviously understanding boilerplate and other inherently spammy elements would be an important part of the value and attraction.
- Paid links it can also could go a long way in fighting link spam and paid links even. By devaluing segments or altogether passing them over, it makes the practice less enticing and handicaps the activity.
These are but a few ways that page segmentation can be a useful tool for search engines. There are more, but I just wanted to get you up to speed (for more see last post on it). What were concerned with this time is how it has the potential to change the world of links as we know it
mkay?

Links in the chain
What is most important is how these types of approaches would affect link valuations and by association, link building campaigns. There are a few distinct areas where link builders might want to consider that can affect strength of a given link location;
- Segment indexation one of the benefits of page segmentation is deciding which parts of the page should or should not be indexed. For example we could suppose that a side panel segment titled Our Sponsors may be passed over altogether when the page is indexed. As you might imagine, having a link in a part of the page that the search engine doesnt index, could be problematic.
- Topical diversity imagine if a page has multiple topics and multiple link types going to that page. If the search engine can establish which links are to which segments, some segments may be of more value than others.
- Segment location if search engines can understand the (boiler plate) template of a page as well as common segments, it would be easy to refine the values. We could find that various areas of the page would be valued more than others in a kind of PageRank dampening.
It should be noted that most of the patents/papers on the topic have gone far into the whole link valuation area, but it seems quite intuitive that this would be a valuable tool. This means there would likely be a premium on the editorial/content areas of a given web page.
One paper that does go into it is Microsofts; Block Level Link Analysis (PDF) where they discuss a Block Level PageRank, (BLPR) approach.
Some possible areas that a search engine might dampen the value of links are;
- Header/footer links easily definable in most templates.
- Navigation links internal site links in segments
- Blog rolls as mentioned before, this could already be in play.
- Advertiser/Supporter side bar links
- Forum signatures recurring template elements
- Blog comments sorry mister comment spammer :0(
- Social user pages profile pages etc.. common elements
- Directory/link pages very easy boilerplate to identify
As you can imagine, this would put a premium on links within the content of a page and varied degrees of devaluation for other placements.
Time to link ahead
Ok, sure
this is all interesting, but is it real? There certainly does seem to be something going on with link valuations the last few years, but we cant be sure it has anything to do with segmentation. But I cant shake the feeling that this is an important discipline to be aware of. Another interesting tidbit that I can across, which further denotes Googles interest in some form of segmentation, was a comment in a post about HTMM;
(
) shifting topics within a document, and in so doing, provides a topic segmentation within the document, Google Research 2007
And they (partially) sponsored the; ICDAR2009 Page Segmentation Competition

Anyway, there is at least enough logic and anecdotal evidence to at do some testing and see what we find. It is likely a better route for SEOs these days than real time social search if you ask me. This is certainly something that well try and get some of the Dojo warriors working on, (Ill report back on that).
For now, it is definitely an area to watch for and consider when future proofing your link building efforts. And really, arent the editorial links what one is after anyway? And why is that? How does Google know which links are which?
. Maybe theyre further ahead on this than we know
time to break out some fashionable tin foil Id say
because where there is smoke
well, you know the rest.
If Block Level PageRank, (BLPR) isnt in your lexicon, maybe its time it was
Resources;
Posts;
Microsoft Granted Patent on Vision-Based Document Segmentation (VIPS) SEO by the Sea
Google and Document Segmentation Indexing for Local Search SEO by the Sea
Page segmentation; ignore at your own peril - fantomaster
Research Papers
VIPS a Vision-based Page Segmentation Algorithm Microsoft : Nov. 1 2003 : Deng Cai, Shipeng Yu, Ji-Rong Wen and Wei-Ying Ma
Block Based Web Search Microsoft research : 2004 : Deng Cai, Shipeng Yu2, Ji-Rong Wen, Wei-Ying Ma
Block-level Link Analysis : Microsoft : 2004 : Deng Cai, Xiaofei He, Ji-Rong Wen, Wei-Ying Ma
Learning Block Importance Models for Web Pages - Microsoft : 2004 -
Page-level Template Detection via Isotonic Smoothing : Yahoo : May 2007 : Deepayan Chakrabarti, Ravi Kumar, Kunal Punera
Also somewhat related is the webtables project which is similar to Google sets, but looks for data or layout elements via HTML etc
(Hat tip to Bill on that angle);Uncovering the relational Web - Google : WebTables: Exploring the Power of Tables on the Web - Google
Patents
Microsoft;
Vision-based document segmentation Microsoft : filed; July 28, 2003 and awarded Sept.23 2008 : Wen; Ji-Rong (Beijing, CN), Yu; Shipeng (Beijing, CN), Cai; Deng (Beijing, CN), Ma; Wei-Ying (Beijing, CN) - and on SEO by the Sea
Method and system for calculating importance of a block within a display page Microsoft : filed Apr 2004 and assigned April 2008 : Ma; Wei-Ying (Beijing, CN), Wen; Ji-Rong (Beijing, CN), Song; Ruihua (Beijing, CN), Liu; Haifeng (Toronto, CA) : (great coverage from Bill as always)
Method and system for identifying object information : Microsoft: filed April 2005 and assigned June 2008 : Wen; Ji-Rong (Beijing, CN), Ma; Wei-Ying (Beijing, CN), Nie; Zaiqing (Beijing, CN)
Retrieval of structured documents Microsoft : filed Mar. 2006 and awarded sept 23 2008 : Wen; Ji-Rong (Beijing, CN), Cui; Hang (National University of Singapore, SG)
Yahoo
System and method for detecting a web page template Yahoo : filed May 2007 and assigned Nov 2008 : Chakrabarti; Deepayan; (Mountain View, CA) ; Punera; Kunal; (Austin, TX) ; Ravikumar; Shanmugasundaram; (Berkeley, CA)
System and method for smoothing hierarchical data using isotonic regression : Yahoo : filed May 2007 and assigned Nov 2008 : Chakrabarti; Deepayan; (Mountain View, CA) ; Punera; Kunal; (Austin, TX) ; Ravikumar; Shanmugasundaram; (Berkeley, CA)
Techniques for approximating the visual layout of a web page and determining the portion of the page containing the significant content Yahoo filed August 2006 and assigned Feb 2008 - Kesari; Anandsudhakar
Google
Document segmentation based on visual gaps Google : filed Dec 30 2004 and assigned Sept 2008 : Daniel Egnor
Systems and methods for analyzing boilerplate Google : filed March 2004 and assigned Feb 2008 : Stephen R. Lawrence;
|
Comments
1. Header/footer links
2. Navigation links
3. Blog rolls
4. Advertiser/Supporter side bar links
5. Forum signatures
6. Blog comments
I believe the Nav links to inner pages will always be important to the SE's for internal linking, however I think that external links will properly get a bigger bonus by being in a coherent paragraph which Google will monitor for LSI, etc to see how relevant the text is. :side:
@Cewek - well, those segments would likely be valued differently. It's hard to say, I am simply putting out some likely scenario's really. I'd imagine the external links are the ones that would be treated differently, internal...it's hard to say.
@Bob - Microsoft is the one that's done the most work in this area. To me though, Google wouldn't certainly have an interest in such approaches as links are their life blood. Combine that with potential improvements in relevance (multi-topic pages) and even resource management (indexation) and it does seem like a valuable tool. Furthermore, considering it's history in OCR, Google also would make use of it with their Book search efforts of late...
There is a Javascript application that I use that makes blocks of onpage text more readable.
Here: http://readable-app.appspot.com/
It finds the main section of text on the page and enlarges it.
In 9 out of 10 pages it successfully takes the text without the menu, footer and other unnecessary parts.
If these guys can do it with JQuery then I'm sure Google can with their search bot
With the recent change to nofollow, I think its a possibility that it is a dying link attribute, and so Google has to, of course, evolve beyond it. Makes sense that devaluing blog comments would be one of the ways in which they aim to evolve.
@Noah - I'd also imagine that comment sections of blogs would be an area of interest for sure... as would forum signatures (not that forums are as relevant these days).
Although there is no clear evidence how and which area will be identified as the most important block of a website or what factors will decide the important area of website unless google start showing some specific section of your website in SERP's, You can still tweak your pages and identify the unimportant blocks like advertising. block and the area near it. Try to change the blocks of content on your webpage and analyze the results.
Also, those who choose to exchange/buy links should really more carefully consider where their links will be.
Semantic separated content could also be nested inside of other semantic content. This would allow a page index-ranking algorithm to use pre-separated semantic content. Since the page-ranking algorithm was designed to do this on abstracts that were already separated it seems logical that this should be something a publisher should be able to indicate about how data is related and consequently is best found. If this tag were found on a page the tag would then take precedence over the segmentation algorithm unless it meets a better-fit criteria.
This idea clearly fits within the game theoretic model of allowing a self-determined social publisher/searcher vs. a hierarchically imposed model.
RSS feed for comments to this post