Business & Revenue Model Presentation for Bizcamp Dublin Video of Caelen at BizCamp 2009
Sep 21

We publish a lot of web pages on RevaHealth.com, tens of millions of them in fact. One of the SEO problems we run into because of this is that a lot of them are very similar to other pages on the site. For example our page for cosmetic surgeons in London is extremely similar to our page for cosmetic surgeons in the EC district of London. This results in the search engines sometimes thinking that we are publishing duplicate pages, even though the pages are perfectly valid and are distinct pages from a newly landed user’s perspective.

The SEO problem with duplicate pages is that Google doesn’t want to clog up its index with a whole bunch of duplicate content, so it tries to cull the duplicate content from its index. In our case it only includes cosmetic surgeons in London in its search results.

In the past we included machine generated text on each page, in an effort to describe on the page in a way people could easily understand what the content was without having to write tens of millions of descriptions by hand. However, because this particular block of text was quite similar from page to page, it hindered us rather than helped us in relation to duplicate content. So we set about trying to find a way to increase the originality of each page.

On a side note, it is possible for you to take control over your own duplicate content and to tell the search engines which page is the original or most important version of a page rather than letting them make that decision for you. You can use either canonical URLs or 301 redirects, something we’ll discuss in another blog post. For now, this is something that we do already, but as the pages are actually valid, non-duplicate pages for our visitors, we think that this shouldn’t be necessary.

So, going back to looking at how to increase the originality of the content on our pages, we took our search results for Dentists in Mexico as our test bed. For 50% of the locations in Mexico we added 2-3 paragraphs of location descriptions taken from Wikipedia. Wikipedia has relevant content that can be re-used on other sites thanks to the GNU Free Documentation License. The link to the original source of the text was included underneath.

We were hoping that syndicating content from Wikipedia could alleviate the duplicate content issue along with giving our visitors a better experience. We let the test run for three months.

The Results

Although our results shouldn’t be regarded as complete, we found that the inclusion of Wikipedia content on our search results pages had no effect on whether the page was included in the main Google index.

However, we also found that all pages with Wikipedia content that were already in the search results dropped by around 3 positions, while all control pages gained on average 2 positions!

Search engines want and reward original content. It is known that Google uses document similarity techniques to keep searchers from finding redundant content in search results. Our experiment left no doubt about it. I only wonder how will Google solve the problem in the current large scale web syndication era if it is possible to find 5 exactly the same articles on the top 5 sites in the results for many of Google searches, e.g.

http://www.google.com/#q=Get+Motivated+to+Create+New+AdSense+Content

4 Responses to “Using Wikipedia Content to Combat Duplicate Content”

  1. Bit of an old one, but on the Syndicated content issue Google recommend syndicated articles linking back to the source.
    http://www.youtube.com/watch?v=niINTKXT-zs

    I don’t think their view on this has moved on too much.

  2. Caelen King says:

    Hi John

    Yep, we did that. To be honest we didn’t really find the results to this experiment to be a surprise, however it was interesting to formalize it. In that it reduced the position in the SERPs even though the inclusion of the material increased the user experience.

  3. Caelen says:

    We deleted the wikipedia content and after a week all the pages from the experiment recovered by 1-2 positions.

  4. [...] engine blog in an online marketing list, but it’s one of the best places to read about SEO, Duplicate Content or revenue [...]

Leave a Reply

preload preload preload