Google Content “Theft” Follow Up

About two weeks ago I posted a story with a new and unique word that I invented which would allow me to find a specific post being re-aggregated and indexed within Google. This would lead to a duplicate content penalty and hurt my SEO.

The result are in and the problem is far worse than I expected.

For starters, I don’t think the term “theft” is anywhere near appropriate but I can’t think of a better term. I have no problem syndicating my content but I don’t want these sites republishing the full story which would then be indexed by Google. They need to use robots.txt to block these pages.

I’d really appreciate it if you guys could help me come up with a better term.

So who are the big troublemakers? Let’s see…

Rojo (ouch) is number one. Findory is next followed by my category feed for the ‘Google’ tag on my blog (I’m going to have to fix that). Next is Jordo Media (never heard of these guys) followed by Kinja, myFeedz, feeds4all, Findory Blogs, and then Informatory.

Oddly enough the permalink for my post doesn’t show up at all. It might be pushed out by my category feed.

It would be nice to have a noindex meta tag that I could use within my full-content feed but this isn’t possible since it has to go in the head of the generated page.

I’ve attached a full copy of the Google result for historical purposes.

200612111018


  1. Hi.
    informatory only aggregates upto 300 characters of text and always publish BOTH the source link and the link to the homepage of the source.
    That must be Ok i think.

    Stefan Svartling / informatory

  2. Hi.
    informatory only aggregates upto 300 characters of text and always publish BOTH the source link and the link to the homepage of the source.
    That must be Ok i think.

    Stefan Svartling / informatory

  3. Very interesting. It looks like Yahoo and MSN both do a better job of identifying your blog as the original source.

    Regarding noindex for feeds, there has been some attempts at standardization for this in the past; see http://www.rassoc.com/gregr/weblog/archive.aspx?post=791 and http://www.bloglines.com/about/specs/fac-1.0 . I dont think there’s been much in the way of adoption though.

  4. Hi Kevin,

    We surely don’t want to be seen as troublemakers in the blogosphere. I should make it clear that we aggregate whatever content we get from people’s RSS feeds. (some choose to publish the entire posts, while others just publish a short description of the article). Also, at all times, we do keep the original URL of the post and the source feed and make them visible on our website.

    However, if I correctly understood the full argument you’re trying to make, we are hurting your page rank. Is this right? But isn’t this something all aggregators do? I mean, once an author decides to syndicate his articles, isn’t this one of the risks he must cope with? I don’t think it can even be seen as a risk. Rather, it’s something intrinsic to the whole concept of syndication.

    Don’t get me wrong, I’m not advocating Google scams or copyright infringements here. Are you saying we shouldn’t be publishing the entire content we get from the feed? Or just that we shouldn’t let Google re-index this content?

    All the best,
    Marius Zaharia, http://www.myFeedz.com

  5. Hey Marius. Thanks for the response.

    > We surely don’t want to be seen as troublemakers in the blogosphere. I should
    > make it clear that we aggregate whatever content we get from people’s RSS
    > feeds. (some choose to publish the entire posts, while others just publish a
    > short description of the article). Also, at all times, we do keep the original
    > URL of the post and the source feed and make them visible on our website.

    Great…. yeah. I really want to publish my full-text RSS feed and this problem
    really started to only now become obvious.

    > However, if I correctly understood the full argument you’re trying to make, we
    > are hurting your page rank. Is this right?

    Yup.

    > But isn’t this something all aggregators do?

    Rojo didn’t earlier on. We had a robots.txt which blocked most of the
    site. Half of this was to protect publishers from having their content indexed.

    I left the company more than a year ago and I think the robots.txt went up since
    then. I still need to contact them about this but it might be redundant (see
    below).

    > I mean, once an author decides to syndicate his articles, isn’t this one of
    > the risks he must cope with? I don’t think it can even be seen as a
    > risk. Rather, it’s something intrinsic to the whole concept of syndication.

    Maybe…. but I’d like to avoid it altogether if possible.

    What I’m going to play with is tuning my blog for better SEO so that it can
    index the permalinks better. Then I’m going to make sure that each full-content
    feed includes a link back to the original permalink. I think it would then be
    possible for Google to figure out that my blog is the authoritative source.

    Might fix the whole problem but right now I’m not sure.

    > Don’t get me wrong, I’m not advocating Google scams or copyright infringements
    > here. Are you saying we shouldn’t be publishing the entire content we get from
    > the feed? Or just that we shouldn’t let Google re-index this content?

    I think ideally that the content should be published by your site but either
    blocked by robots.txt or a noindex meta tag. This way you can build your
    aggregator and I can blog without having to worry about people finding my
    content on Google.

    I think the issue is that EVERY post is penalized so it just pushes my blog
    farther down the food chain. :-/

    Kevin

  6. I think informatory help you get a better pagerank, because I link back to BOTH your startpage and the articles permalink.
    That way you get better pagerank.

  7. I forget where, possibly Jakob Nielsen possibly some other “expert” said you should allow people to have your entire post via RSS. As a blog reader I like that better. I can read them offline, they can download in the background automagically.

    However the author should have ownership and control over his work. Before this whole blogosphere, people stole content from my site without permission and put it up elsewhere. But now with RSS feeds and scraper sites my words are ending up all over the internet.

    I catch some of the scrapers because I have non-blog content I link to and they keep those links in. I catch them by having a Google Alert for link:www.mydomain.whatever You could possibly do something similar with a Technorati alert that would just target the blogosphere, Google added an option to just search blogs with Google Alerts I believe.

    Knowing that people are scraping my content and perverting it, they usually chomp just around the keyword they are trying to optimize for, hasn’t led to me doing much to stop them though. I have done some SEO to my site, I’ve even gone a little overboard for some stupid keywords and phrases which hasn’t helped in all likelihood.

    Creating a Google Alert for each permalink would be a bit much but you could make one for just the feedblog.org portion. You could put in the year and even the month and that would limit the number of results returned, but honestly it shouldn’t be a big problem.

    If you subscribe to the Google PR, the algorithm will eventually sort itself out, but in the mean time it is stupid not to rank first for something you wrote, but it happens… I’ve seen where a story has been reblogged and the later blogger has more PR and his quoted version outranks the original. So this isn’t just RSS aggregators perhaps, but a sympton of Google’s overall algorithm.

  8. Aggregating full content is wrong, but aggregating a snippet of text, just like Google and all other search engines do is not.
    It’s like quoting someone on your own blog.
    What’s wrong with that?
    Even if you get better rank in search results?

    As I said, if you get linked, you’ll get better pagerank in the end.

  9. Hi Kevin,

    Feeds4all aggregates content from feeds (RSS, ATOM etc) and makes this content available through a directory or through search. So content is never ‘grabbed’ from the blog or web site it self.

    Feeds4all does more than replicating that feed content. It adds value to each feed and each article by serving links to related feeds and articles thus guiding visitors to more/similar information they are looking for.

    This works both ways: Once YOUR feed/article is found, Feeds4all serves not only a lead to your article and site but also related feeds and articles. When other feeds/articles are found that are related to the content of your site, Feeds4all serves the feed and articles on your site.

    By the way, served articles are always accompanied with information of the source (1 – the Article URL and a hyperlink leading to the full article on the actual blog/site. 2 – The name of the blog/site and a hyperlink to the home page of the blog/site) and lists of articles from a particular feed are completed with name of and hyperlink to the originating site. The feed url is available too.
    Each feed page contains the statement: “The content of this feed is property of the original publisher.” and each article/item page contains the statement: “The content of this article is property of the original publisher”.

    I hope you will reconsider your qualification for Feeds4all as ‘a troublemaker’.

    Regards,

    Fred

  10. Hi Kevin,

    Check this out and let me know what you think:
    http://blog.myfeedz.com/2007/02/16/myfeedz-released-on-adobe-labs/

    However, it will take a while before search engines revisit the website and reindex content.

    All the best,
    Marius






%d bloggers like this: