SEO Blog - Internet marketing news and views  

Detecting and dealing with duplicate content

Written by David Harry   
Monday, 11 January 2010 15:08

(the following is a guest post from Mark Thompson)

Duplicate content and plagiarism can be an easy way for a website to get penalized by the search engines or possibly banned (ED; in extreme cases). The search engines have gotten much better at being able to check for duplicate content. If you are interested, here is what Google has to say about duplicate content. For website owners, bloggers, and writers there are a number of tools you can use to identify duplicate content.

This post will talk discuss tools used to identify plagiarism, how to deal with duplicate content and what limitations there are for having duplicate content on your site.

Duplicate Content Detection

 

Tools To Help Identify Duplicate Content

One of the best and fastest ways to check for duplicate content is to take a snippet of the content and enter it in quotes in a google search. For example, here is an article taken from cnn. I have taken the first sentence and put it in Google.
"Debbie Burk books a four-star hotel in Chicago, hoping to avoid a particular property, which is rated a half star lower."

results-google

 

As you can see Google will do a great job of finding other sources that have the exact same content on their site. In this case, since CNN is a major news outlet, there are a lot of other sites that pick up their stories and syndicate them on their sites. However if you take a snippet from one of your product or service pages, there should be no detections of plagiarism unless someone has copied it. If you notice that no results show up when you add quotes, try taking out the quotes. Sometimes you will find results that look extremely similar. Usually someone will just modify the content slightly in hopes that it will not get picked up by the search engines as duplicate content.

 

Duplicate content detection tools;

Here are some of the best duplicate content tools on the web that will not only check for other copies of your content on the web, but will identify internal issues you may have with your site.

1. Copyscape

Copyscape is probably the most popular duplicate content tool out there. This free service detects copies of your web pages across the web. The free version only returns no more than 10 results for any search, and you are limited to the number of searches you can perform. However they do offer two other premium services for users who need to be able to gather more in-depth duplicate content research. Copyscape premium offers a more comprehensive search for plagiarism detection along with features like batch search (up to 10,000 pages), copy and paste, manage cases of plagiarism, exclude certain sites, compare two urls, and automatic checks using the API.

copyscape-screenshot

 

2. Plagiarism Checker

This tool will allow you to enter a keyword, phrase, or sentence into the search field and it will return Google results of any other sites that have the same words entered. One cool feature the tool will allow you to do is set up a google alert, so it notifies you if someone copies your content.

plagiarism-checker

 

3. Plagiarism Detect

Plagiarism detect offers a free and premium version, similar to Copyscape. The free version of this tool will allow you to upload text and word doc files for analysis and will return detections found. The premium version has many other features including comparing two documents side-by-side, a more advanced algorithm and a Microsoft Word plugin, so you can check for plagiarism directly from word.

plagiarism-detect

 

4. Plagium

This plagiarism tool will display a visually pleasing diagram of detections of other websites that have copied your content. Plagium will show a calendar of when it was discovered. This tool will allow you to search over the entire web or strictly news sources. You may also refine by language and only check for duplicates in a specific language.

copy-usage

 

5. Virante

Virante offers a different type of duplicate content tool, that checks more for internal duplicate content issues. The issues it will check include www vs non-www redirect issues, similar pages on your site, issues with index.html vs /, properly returning 404 error pages for any pages that are missing, any PR issues between the www and non-www.virante

 

6. WebConfs

The WebConfs tool will take two urls and determine the percentage of similarity between the two urls. The lower the percentage the less similar the two pages are. webconfs

 

Dealing With Duplicate Content

Rand Fishkin from SEOmoz recently did a Whiteboard Friday on duplicate content and how to deal with it. There are a number of good points that he brings up that I wanted to expand on.

"How much duplicate content is ok on my site?"

This is a grey area as to exactly how much duplicate content you can have on your site. If Google notices that your entire site is made up of duplicate content, they will most likely remove the majority of your pages from the index and/or penalize your site in the SERPs. However if you are using duplicate content in moderation (quote, section of a press release, product description) you will not have to worry about any penalty. A rule of thumb is to use content from other sources when it makes sense for the user and how it relates to the other content on the page.

"What if someone else publishes my content and it gets indexed first, do they get credit?"

Google has many ways of identifying the original source for a piece of content. They look at domain trust/authority, PR, inbound links, contextual links back to the original source within the duplicate content. Say your article gets picked up by a number of mainstream news sources, the odds are that even though those sites are authoritative, because they will most likely link back to the source, that will tell Google that your site is the original source. Like Rand said in the video, if a site syndicates your content, usually your domain trust will determine if Google keeps that page in its index.

"Being Unique is Not Enough!"

To me this is by far the biggest point to make when talking about duplicate content. Many clients think that if you change the title or move a few words around on the page, that it is unique and that will be enough. This is entirely not true. You need to add value and put your own spin on a topic/discussion. Content that goes viral is usually something that is completely unique, has exceptional value, and it a unique way of presenting the information.

 

What To Do About It

If someone copies your content there are a few things you can do to have it removed.
  • Contact the Site: Email the website and politely ask them to remove it
  • Submit a Spam Report Request: Send a spam request to Google, notifying them about a duplicate site or page.
  • File an Infringement Notification: Visit the DMCA page on Google's site and follow the instructions needed to properly file a notice of infringement.

Watch the Entire Video: SEOmoz Whiteboard Friday - Dealing with Duplicate Content

 

Related Posts:

 

Mark Thompson

About the Author; Mark is the Internet Marketing Manager for Atlantic BT, a full-service Web Design & Marketing company, located in Raleigh, North Carolina. He also is the creator of StayOnSearch, a search marketing blog dedicated to SEO's and Internet Marketing professionals. Follow Mark on Twitter.

 

Are YOU interested in writing on the FireHorse Trail? Get in touch today with a story/post idea!

 

Comments  

 
0 # Gerald Weber 2010-01-11 16:17
Another way you can combat online plagiarism is to contact the thief's hosting company and file a DMCA complaint there. Most hosting companies already have a process in place to file such complaints and they are required by law to remove the plagiarized content. However you must prove the content is plagiarized, they are not just going to take your word for it. You can use Internet Archive to prove who published the content to the web first.
Reply | Reply with quote | Quote
 
 
0 # Domenick 2010-01-11 17:28
Good post Mark, people should know what options are available to them. I came across a site last night htt://www.myows.com, I believe it's still in beta, but it's like a copyright content management solution, where you can put all your articles or photos or whatever it is you created, looks promising.
Reply | Reply with quote | Quote
 
 
0 # Dana Lookadoo 2010-01-12 04:56
Mark, this is very well done piece about duplicate content, tools and addressing how much is enough, or not.

You also did well in laying to rest fears about syndication. The same is true regarding scraper sites. However, I often wonder at what point is it worth fighting the DMCA battle.

I'd be interested to hear your thoughts on scrapers and how you handle them ... or do you ignore them?
Reply | Reply with quote | Quote
 
 
0 # Miguel 2010-01-12 17:46
HI Mark, really great post. Well structured, good resources linked to.

But I have to argue your point that dupe content is so detrimental. Google is far less strict on dupe content these days. Check out my post on this a while back.

http://www.organicseoconsultant.com/duplicate-content-who-cares/

I really have not seen any instances of any ban or penalty for dupe content in about 5 years. And I have worked across hundreds of client sites.

I wanted to add one other tip here. You can also try contacting the host of the site that is stealing your content and have them remove the site. If the person is out of the US, which usually is the case, then you are screwed. The only thing you can do at that point is try to scare their host into taking down the site.
Reply | Reply with quote | Quote
 
 
0 # Robert Bravery 2010-01-14 15:19
Great article, and great list of tools and resources.

Besides the correct credit due for duplicate content one has to look at points of benefit. I do see a problem when a site takes full credit without for content that they shouldn't

But the nature of the internet, its a free medium, duplicate content is part and parcel of that.

I think that in some cases duplicate content can actually benefit you. Provided it is done in the correct way.

I agree with @Miguel, Google is far less strict with duplicate content now days, and in fact in most cases can determine which is the original r first posted content. My article on this goes a long way to explain my thoughts on this. Google Does Not Penalise Duplicate Content. ("http://www.integralwebsolu tions.co.za/Blog/EntryId/402/Can-content-scraped-from-your-site-actually-benefit-you.aspx")

Also there is a benefit to duplicate content, as long as the author does not believe that they are loosing income or credibility as a result.

A lot of my articles are scraped and duplicated all over the internet. Some with my permission, some with out. Some give me credit with a back link some don't But it is something that one has to live with.

I wrote an interesting article about this.
(Do you allow more than one URL link in the post. If not, I understand.)Can content scraped from your site actually benefit you? ("http://www.integralwebsolu tions.co.za/Blog/EntryId/402/Can-content-scraped-from-your-site-actually-benefit-you.aspx")
Reply | Reply with quote | Quote
 
 
+4 # waqas 2010-01-16 09:53
Excellent Posting but my personal experience is almost 50% contents of any web page or articles may be copied or duplicate in most cases then why Google rank that web pages?
Reply | Reply with quote | Quote
 
 
0 # johndavid 2010-04-06 08:53
Duplicate content is definitely a drag for anyone who creates their own, legitimate content. This would seem like the perfect opportunity for a Service (Google) where you could submit/register a couple paragraphs and keywords from everything you post to this Service at the time you first publish it. The Service would then verify the existence of your content on your site based on date/time.

The Service would then search/spider the web for your content over time and generate a monthly report of any direct matches of your content and provide links to the potential duplicates. Kinda like Google Alerts on steroids. Hmm.

Texas SEO Company
Reply | Reply with quote | Quote
 
 
+2 # Anubhav 2010-06-02 01:49
This is bullshit. I have a site which I have copied from many others. The best thing is I enjoy good SERP. No one can get me to remove those, although many tried. LOL. Try me if you can. Thats a challenge. Your advises are not practical.
Reply | Reply with quote | Quote
 

Add comment


Security code
Refresh

Search the Site

SEO Training

Tools of the Trade

Banner
Banner
Banner

On Twitter

Follow me on Twitter

Site Designed by Verve Developments.