SEO Blog - Internet marketing news and views  

Page segmentation and link building

Written by David Harry   
Monday, 13 July 2009 08:28

Do you know where to find prime link real estate?

Brother Bill expertly covered (yet another) patent on page segmentation the other day; this time from Yahoo. It is certainly an area that SEOs really might want to be paying more attention to these days. Considering that each of the big three have dabbled into it (to varying degrees) over the last ½ decade, there is every reason to believe that something might be here (all that ‘where there is smoke’… yada yada).

Google Page segmentation has widely been developed for use in OCR applications to better understand text/image relationships. More recently, we’ve seen this area of expertise brought to the online world. One important aspect that should be of particular interest to those in the SEO world is in the link related ramifications.

Let us consider that Google, the king of link reliant search engines, would certainly find a huge benefit from this approach for valuating links (or even indexation decisions).

An interesting example is the recent update to the Google Blog search that fixed the issue of Google indexing blog roll links (problematic when using the link: command). As soon as that news broke I thought, ‘How’d they do that I wonder?’ – Obviously page segmentation came to mind.


Benefits of page segmentation

Now, as fastidious little detectives, the first thing we need to look for is motive. Why would a search engine want to do this? A few of the potential benefits include;

  1. Crawling/indexing resources – once the template structure of a site is prioritized, infrequent crawling of, or discarding certain layers would save on computing and storage costs.
  2. Topic Drifting – some pages will have more than one topic and search engines can struggle to properly index/categorize them.
  3. False positives – it can also better deal with problems that arise when a page contains links/citations to other topics not directly related to the content of the page. This approach would help improve the quality of the results in avoiding such false positives.
  4. Spam Detection – obviously understanding boilerplate and other inherently spammy elements would be an important part of the value and attraction.
  5. Paid links – it can also could go a long way in fighting link spam and paid links even. By devaluing segments or altogether passing them over, it makes the practice less enticing and handicaps the activity.

These are but a few ways that page segmentation can be a useful tool for search engines. There are more, but I just wanted to get you up to speed (for more see last post on it). What we’re concerned with this time is how it has the potential to change the world of links as we know it… m’kay?

Page segmentation link values

Links in the chain

What is most important is how these types of approaches would affect link valuations and by association, link building campaigns. There are a few distinct areas where link builders might want to consider that can affect strength of a given link location;

  • Segment indexation – one of the benefits of page segmentation is deciding which parts of the page should or should not be indexed. For example we could suppose that a side panel segment titled ‘Our Sponsors’ may be passed over altogether when the page is indexed. As you might imagine, having a link in a part of the page that the search engine doesn’t index, could be problematic.
  • Topical diversity – imagine if a page has multiple topics and multiple link types going to that page. If the search engine can establish which links are to which segments, some segments may be of more value than others.
  • Segment location – if search engines can understand the (boiler plate) template of a page as well as common segments, it would be easy to refine the values. We could find that various areas of the page would be valued more than others in a kind of PageRank dampening.

It should be noted that most of the patents/papers on the topic have gone far into the whole link valuation area, but it seems quite intuitive that this would be a valuable tool. This means there would likely be a premium on the editorial/content areas of a given web page.

One paper that does go into it is Microsoft’s; Block Level Link Analysis (PDF) where they discuss a Block Level PageRank, (BLPR) approach.

Some possible areas that a search engine might dampen the value of links are;

  1. Header/footer links – easily definable in most templates.
  2. Navigation links – internal site links in segments
  3. Blog rolls – as mentioned before, this could already be in play.
  4. Advertiser/Supporter side bar links
  5. Forum signatures – recurring template elements
  6. Blog comments – sorry mister comment spammer :0(
  7. Social user pages – profile pages etc.. common elements
  8. Directory/link pages – very easy boilerplate to identify

As you can imagine, this would put a premium on links within the content of a page and varied degrees of devaluation for other placements.


Time to link ahead

Ok, sure… this is all interesting, but is it real? There certainly does seem to be ‘something’ going on with link valuations the last few years, but we can’t be sure it has anything to do with segmentation. But I can’t shake the feeling that this is an important discipline to be aware of. Another interesting tidbit that I can across, which further denotes Google’s interest in some form of segmentation, was a comment in a post about HTMM;

“(…) shifting topics within a document, and in so doing, provides a topic segmentation within the document,” – Google Research 2007

And they (partially) sponsored the; ICDAR2009 Page Segmentation Competition


Anyway, there is at least enough logic and anecdotal evidence to at do some testing and see what we find. It is likely a better route for SEOs these days than ‘real time social search’ if you ask me. This is certainly something that we’ll try and get some of the Dojo warriors working on, (I’ll report back on that).

For now, it is definitely an area to watch for and consider when future proofing your link building efforts. And really, aren’t the editorial links what one is after anyway? And why is that? How does Google know which links are which? …. Maybe they’re further ahead on this than we know… time to break out some fashionable tin foil I’d say… because where there is smoke… well, you know the rest.

If ‘Block Level PageRank’, (BLPR) isn’t in your lexicon, maybe it’s time it was…

 

Resources;

Posts;
Microsoft Granted Patent on Vision-Based Document Segmentation (VIPS) – SEO by the Sea
Google and Document Segmentation Indexing for Local Search – SEO by the Sea
Page segmentation; ignore at your own peril - fantomaster

Research Papers
VIPS a Vision-based Page Segmentation Algorithm – Microsoft : Nov. 1 2003 : Deng Cai, Shipeng Yu, Ji-Rong Wen  and Wei-Ying Ma
Block Based Web Search – Microsoft research : 2004 : Deng Cai, Shipeng Yu2, Ji-Rong Wen, Wei-Ying Ma
Block-level Link Analysis : Microsoft : 2004 : Deng Cai, Xiaofei He, Ji-Rong Wen,  Wei-Ying Ma
Learning Block Importance Models for Web Pages - Microsoft : 2004 -
Page-level Template Detection via Isotonic Smoothing : Yahoo : May 2007 : Deepayan Chakrabarti, Ravi Kumar, Kunal Punera

Also somewhat related is the webtables project which is similar to Google set’s, but looks for data or layout elements via HTML etc… (Hat tip to Bill on that angle);Uncovering the relational Web - Google : WebTables: Exploring the Power of Tables on the Web - Google

Patents

Microsoft;
Vision-based document segmentation – Microsoft : filed; July 28, 2003 and awarded Sept.23 2008 : Wen; Ji-Rong (Beijing, CN), Yu; Shipeng (Beijing, CN), Cai; Deng (Beijing, CN), Ma; Wei-Ying (Beijing, CN) - and on SEO by the Sea
Method and system for calculating importance of a block within a display page – Microsoft : filed Apr 2004 and assigned April 2008 : Ma; Wei-Ying (Beijing, CN), Wen; Ji-Rong (Beijing, CN), Song; Ruihua (Beijing, CN), Liu; Haifeng (Toronto, CA) : (great coverage from Bill as always)
Method and system for identifying object information : Microsoft: filed April 2005 and assigned June 2008 : Wen; Ji-Rong (Beijing, CN), Ma; Wei-Ying (Beijing, CN), Nie; Zaiqing (Beijing, CN)
Retrieval of structured documents – Microsoft : filed Mar. 2006 and awarded sept 23 2008 : Wen; Ji-Rong (Beijing, CN), Cui; Hang (National University of Singapore, SG)

Yahoo
System and method for detecting a web page template – Yahoo : filed May 2007 and assigned Nov 2008 : Chakrabarti; Deepayan; (Mountain View, CA) ; Punera; Kunal; (Austin, TX) ; Ravikumar; Shanmugasundaram; (Berkeley, CA)
System and method for smoothing hierarchical data using isotonic regression : Yahoo : filed May 2007 and assigned Nov 2008 : Chakrabarti; Deepayan; (Mountain View, CA) ; Punera; Kunal; (Austin, TX) ; Ravikumar; Shanmugasundaram; (Berkeley, CA)
Techniques for approximating the visual layout of a web page and determining the portion of the page containing the significant content – Yahoo – filed August 2006 and assigned Feb 2008 - Kesari; Anandsudhakar

Google
Document segmentation based on visual gaps – Google : filed Dec 30 2004 and assigned Sept 2008 : Daniel Egnor
Systems and methods for analyzing boilerplate – Google : filed March 2004 and assigned Feb 2008 : Stephen R. Lawrence;

 

Comments  

 
0 # Michael Martinez 2009-07-13 15:40
I think the link spamming -- er, SEO community is about even with the times on this issue. Most people are now asking for links embedded in main body text.
Reply | Reply with quote | Quote
 
 
0 # Cassiano Travareli 2009-07-14 07:40
Very interesting your research and view point! I think it will reduce a lot the spams inside the blogs!
Reply | Reply with quote | Quote
 
 
0 # cewek gadis 2009-07-14 12:47
you say this is not valuable link :
1. Header/footer links
2. Navigation links
3. Blog rolls
4. Advertiser/Supporter side bar links
5. Forum signatures
6. Blog comments
Reply | Reply with quote | Quote
 
 
+1 # Dudibob 2009-07-15 02:46
Makes perfect sense to me, there's been some suggestions of this in the past and I think Google will or is going down this route.

I believe the Nav links to inner pages will always be important to the SE's for internal linking, however I think that external links will properly get a bigger bonus by being in a coherent paragraph which Google will monitor for LSI, etc to see how relevant the text is. :side:
Reply | Reply with quote | Quote
 
 
0 # Dave 2009-07-15 08:53
@Michael - I'd have to agree that we're all aware of the power of (ahem) editorial links in content. I had written about this earlier in the year and the recent patent award simply had me wanting to review the approaches again in case some folks hadn't gotten the message.

@Cewek - well, those segments would likely be valued differently. It's hard to say, I am simply putting out some likely scenario's really. I'd imagine the external links are the ones that would be treated differently, internal...it's hard to say.

@Bob - Microsoft is the one that's done the most work in this area. To me though, Google wouldn't certainly have an interest in such approaches as links are their life blood. Combine that with potential improvements in relevance (multi-topic pages) and even resource management (indexation) and it does seem like a valuable tool. Furthermore, considering it's history in OCR, Google also would make use of it with their Book search efforts of late...
Reply | Reply with quote | Quote
 
 
+2 # Andy 2009-07-17 12:55
Yep, makes sense to me.

There is a Javascript application that I use that makes blocks of onpage text more readable.

Here: http://readable-app.appspot.com/

It finds the main section of text on the page and enlarges it.

In 9 out of 10 pages it successfully takes the text without the menu, footer and other unnecessary parts.

If these guys can do it with JQuery then I'm sure Google can with their search bot
Reply | Reply with quote | Quote
 
 
0 # Noah 2009-07-17 14:47
Veeeery interesting. It wouldn't surprise me to see comment links get hit with the sucky link stick.

With the recent change to nofollow, I think its a possibility that it is a dying link attribute, and so Google has to, of course, evolve beyond it. Makes sense that devaluing blog comments would be one of the ways in which they aim to evolve.
Reply | Reply with quote | Quote
 
 
0 # Dave 2009-07-18 09:16
@Andy, that's a very cool tool bro... thanks for dropping in with that. I'd certainly say they 'can' do it, it's always been more about what benefits there are to add another layer to the process. I'd say it's quite likely that some kind of related approach is out there... not that they'd tell us...

@Noah - I'd also imagine that comment sections of blogs would be an area of interest for sure... as would forum signatures (not that forums are as relevant these days).
Reply | Reply with quote | Quote
 
 
+2 # seoz87 2009-07-21 05:15
Recently I researched that area a lot. Read all the patents and articles related to page segmentation. Unfortunately I can't share the actual research with you guys but one thing I am telling you is that Google has started structuring websites. Its now pays special attention to some specific areas of a website. RICH SNIPPETS, AUTHOR NAMES< POST DATE, POSTS are the signs that clearly indicating that at the end of this year, we will witness some big changes in SERP'S. Check the results cnet.com or linkedin.com in Google. its emphasizing on some special section of these big websites, This thing will soon be implemented to all the websites.
Although there is no clear evidence how and which area will be identified as the most important block of a website or what factors will decide the important area of website unless google start showing some specific section of your website in SERP's, You can still tweak your pages and identify the unimportant blocks like advertising. block and the area near it. Try to change the blocks of content on your webpage and analyze the results.
Reply | Reply with quote | Quote
 
 
+1 # William Atkin 2009-08-27 17:43
Very good post idea, many people do not even think about link position and how it affects quality scores.
Reply | Reply with quote | Quote
 
 
0 # S Emerson 2009-08-31 06:34
This is certainly something people should think about when building their website/blog theme.

Also, those who choose to exchange/buy links should really more carefully consider where their links will be.
Reply | Reply with quote | Quote
 
 
0 # Michiel Van Kets 2009-09-28 20:15
ok, what about this one - keep messing around - at least it will look natural ;-)
Reply | Reply with quote | Quote
 
 
+2 # Michael Baun 2009-09-29 23:52
I think there should be an HTML markup tag that group’s semantic content together.
Semantic separated content could also be nested inside of other semantic content. This would allow a page index-ranking algorithm to use pre-separated semantic content. Since the page-ranking algorithm was designed to do this on abstracts that were already separated it seems logical that this should be something a publisher should be able to indicate about how data is related and consequently is best found. If this tag were found on a page the tag would then take precedence over the segmentation algorithm unless it meets a better-fit criteria.
This idea clearly fits within the game theoretic model of allowing a self-determined social publisher/searcher vs. a hierarchically imposed model.
Reply | Reply with quote | Quote
 

Add comment


Security code
Refresh

Search the Site

SEO Training

Tools of the Trade

Banner
Banner
Banner

On Twitter

Follow me on Twitter

Site Designed by Verve Developments.