Written on January 20th, 2008 at 12:01 am by Darren Rowse
Fighting Scrapers With Your Left Jab
This guest post was submitted by Patrick who blogs at Piggy Bank Pie Writing Services.
I started 2008 with a post that went viral on StumbleUpon and BloggingZoom. Even Skellie and Caroline Middlebrook took the story back to their respective blogs. For those interested, the post in question was How I Received 850 Visitors Without Using Social Media Sites.
Image by Dave Hogg
But high visibility also comes with a price. Once a few bad guys hear about your blog, they hit you in the back like poor losers, running away with your own content to monetize their site. The opponents are called scrapers. Let’s challenge them on a five round match to see who deserves the title. Gentlemen, let’s have a clean fight, play by the rules, no punch below the belt and hitting behind the head.
Definition Of A Scraper
Using hacking tools, scrapers subscribe to your site with the intention of stealing your content right off your syndicated feed. Once you publish a new article, the program fetches your entire post from the RSS feed and publishes a carbon copy on the scraper’s site. If you haven’t taken some precautions, search engine crawlers can index the scraper’s content before yours, and even punish you for duplicate content.
Why They Do This?
Ever heard of Made For AdSense? Scrapers need content to feed their contextual ads such as Google AdSense. Since they are unable to write their own -hey, no hit below the belt- they steel yours and publish it on their blog where most of the time AdSense is used heavily. Scrapers often try to target specific keywords and their goal is to steal articles that help them rank high in Search Engine Results Page. With better ranking comes better traffic and obviously, better click-through rate on their ads.
What Can You Do?
I would LOVE to write the ultimate solution for preventing scrapers from playing against the rule. However, it is not that easy, and the process might be time consuming. But still, if you are minded to jump into the ring, here’s a five round fight strategy that could potentially bring your opponent down.
Let’s get ready to rumble.
1. License Your Content
The very first step I would recommend is to use a license service such as Creative Common. By licensing your content, at least you inform visitors that articles published on your site are subject to copyright laws. This allows you to specify under which conditions your work can be distributed. You can visit this page to choose the proper license for your work.
2. Add a Link To Your Orignal Post in Your RSS Feed
Joost de Valk, Shoemoney’s well-known webdeveloper, just wrote a WordPress plugin called RSS Footer that automates the process of adding a link in your RSS feed that points to the original source of your post. Here’s what Matt Cutts, a Google engineer, said recently in a interview about linking to the original source of an article:
“…if the syndicated article has a link to the original source of that article, then it is pretty much guaranteed the original home of that article will always have the higher PageRank, compared to all the syndicated copies. And that just makes it that much easier for us to do duplicate content detection and say: “You know what, this is the original article; this is the good one, so go with that.”
Installing and configuring RSS Footer is a piece of cake, I highly recommend you give it a shot.
3. Report Scrapers To AdSense
Visiting your scraper’s site could help you gain a few points in the fight. Have you ever clicked the Ads by Google link on AdSense ads? This opens up a page where you can subscribe to both AdWords and AdSense. However, if you look at the bottom of the page you will notice a link that says Send Google your thoughts on the site or the ads you just saw. The beauty of this link is that it knows where you are coming from -your scraper- and it fires up a questionnaire regarding the relevance of your scraper’s ads. Now is the time to throw a left jab:
- Click Report a Violation?
- This brings up a question asking if the issue is with the website or ads, select website;
- You will now be asked which policy is violated, select The site is hosting/distributing my copyrighted content;
- Finally, use the text box under Add additional information here to explain your story to the referee.
4. Report Scrapers To Google
Now is the time to send your opponent to the floor for a first count of 8. Go to google.com, type your scraper’s domain in the search field and hit the Google Search button. If it finds your scraper’s site, this means his website is indexed by Google. Now go to Google’s page to Report a Spam Result and proceed as followed:
- Exact query that shows a problem: Type what you entered in Google’s search box to find your scraper’s site
- Resulting Google page that shows problem: Enter the complete URL of the Google page returning the search result
- The specific web page or site that is misbehaving: Type you scraper’s domain name
- Type(s) of problem (check all that apply): Select Duplicate site or pages
- Enter you story in the Additional details text box and click the Submit button.
5. Report Scrapers To Their Web Hosting Service
This is the ultimate opportunity to hit with a multi-punch combination. Go to whoishostingthis.com and type your scraper’s domain in the search box. This brings you a link to your scraper’s web hosting company. Once you are on the home page of the provider, look for a contact page. Use either online chat, email or a contact form to explain the situation. If you are required to provide a full and complete DMCA, I suggest you visit this page to get the DMCA form. If you go through all of this and your scraper gets kicked out by his web hosting service, consider you’ve won the fight by unanimous decision.
Summary
While this may not be an instant solution for preventing scrapers to steal content, it can surely make their life more difficult. If everything goes well and the scraper gets banned from AdSense, Google and his service provider, well, that’s a technical knockout. Now let’s just hope he’ll be out of the ring once and for all.
Has your content ever been stolen by scrapers? Have you tried some of the above strategies? Do you have other ideas to share? Please join the conversation over to comments.



110 Responses to “Fighting Scrapers With Your Left Jab” - Add Yours
Vikingblogger
January 20th, 2008 12:32 am
As I’m still really small i don’t have this problem – but i can really a see it being a problem for bigger sites.
Really good tip about the RSS Footer – thank you, I will check it out.
Staale aka Vikingblogger
Markus
January 20th, 2008 12:34 am
Although my German blog is fairly new (one week) my content has already been stolen :(
But the most common is, that only a small excerpt is “stolen”, usually with the backlink to my page.
But I don’t even like this. The whole blog which does this only contains of this small excerpts (from other blogs, too) and is, somethimes, ranked higher than mine.
My question: Is it legal to copy almost every beginning (and the headline of course) of my content?
I know, that you can quote someone and usually it’s cool when you’re featured by another “regular” blogger. But a collection of excerpts? Is that O.K.?
Lex G
January 20th, 2008 12:59 am
And you can use http://www.copyscape.com to find duplicate pages on the web …
Scrapers aren’t the only problem though. Some people just steal your concepts and turn it into their own article. If you’re a starting blogger who doesn’t have enough back links or ranking then these thieves basically benefit from YOUR work, cause they will rank more easily and have more exposure then the original blogger has …
All I can say … be careful ….
Other Joost
January 20th, 2008 1:07 am
Useful post. Thanks.
The nice thing about MFA sites though is that their trust rank is usually so low that they are hardly ever seen as the writer of quality content. And Google is usually pretty good at seeing through them.
The worrisome scrapers are the ones that actually have a worthwhile site with some original content and call themselves ‘aggregators’. They are harder to catch by Google et al. and they could be ranked above the original article.
Thomas (twofatbrothers.com)
January 20th, 2008 1:29 am
Wow! This is an awesome post. I’ve had a little bit of trouble with scrapers in the past. This looks like a 50-45 fight to me! Thanks a lot!
Denis
January 20th, 2008 1:31 am
Nice article. Motivated me enough to install the plugin :)
Tejvan Pettinger
January 20th, 2008 1:44 am
I have a lot of scrapers, taking content,mostly they get ignored, but sometimes Google ranks scrapers higher than the original. The worst was when a scraper took an article and got on the homepage of Digg with it. – That was the worst experience
Caryn
January 20th, 2008 1:51 am
I had my content stolen by scrapers when I had my previous blog. I was so furious, but I felt helpless, too. I had no idea what I could do about it other than to send them nasty comments, which of course doesn’t help. I’ll have to keep these tips in mind. Thanks!
BW
January 20th, 2008 2:02 am
Nicely worked out article.
Gives some great steps to take if you find someone has stolen your work.
This will come in handy I am sure
Thanks
Dan Cole
January 20th, 2008 2:04 am
When I write posts, I often times include my website name within the post. Then when it gets scraped, it doesn’t make any sense. But I also like the RSS footer idea.
David Airey wrote a similar post about people stealing you images. One person stole his post and images, but now their site looks really bad. The post is How to deter thieves from stealing your images and server bandwidth.
Mike Goad
January 20th, 2008 2:09 am
Great post. Now I have some work to do internalizing it and getting my blogs “plugged in.”
McClellanville Real Estate
January 20th, 2008 2:16 am
I’m saving this for later, because I know I’ll need it, THANK YOU!
Ruchir
January 20th, 2008 2:17 am
Following all these steps are just too tiresome, especially if you get scraped by a thousand and one scrapers. It might be viable for small blogs but it’s just impossible for the big blogs. Seriously, rather than wasting my time on contacting AdSense, Google and their hosts, I’d rather spend my time on marketing my blog….
David
January 20th, 2008 2:24 am
>> they steel yours and publish it on their blog
*steal*
Gilbert
January 20th, 2008 2:24 am
Setting up Google Alerts on article titles can be quite useful as well and there’s also a handy site called http://www.copyscape.com which will let you run a few free queries each day to help find content which is so similar to your that it must have either scraped or copied and then rewritten.
David LaFerney
January 20th, 2008 2:30 am
This is great information, but the paranoid in me is wondering what happens if someone used these recourses to wrongly accuse you of content theft. Does the accused get a chance to defend their self or could this be a great/awful dirty trick of conviction by accusation? If I were a thieving scraper looking at losing my precious Adsense account I would probably swear that YOU were the real villain.
Maritzia
January 20th, 2008 2:42 am
Great info! I keep you on my rss feed just for all the great tips like this.
The Geek
January 20th, 2008 2:58 am
Here’s what I’ve done:
1) Modified related posts plugin to show related posts in the feed.
2) Added a link back to my homepage in the RSS.
This gives me a minimum of 6 links back to my page on every scraped article. I’ll probably add the link to the original source as well… good idea for the RSS footer plugin.
When it comes right down to it, at some point it’s not worth worrying about scrapers anymore… it’s a losing battle.
Unless you are losing traffic because of the scrapers, Google is pretty good about detecting them and removing them from the index.
The tips in this article are good, though… I’ve had a couple of instances where fairly big sites were outright stealing my content, and one instance a long time ago where they outranked me with stolen content… so I used some of these tips with good results.
Sue
January 20th, 2008 3:01 am
Unfortunately, I couldn’t get RSS Feed to work with Feedburner, even after resyncing. And yes, my little site gets scraped. :(
The other alternative is Angsuman’s Feed Copyrighter, which does work well with Feedburner, both through the feed and email. Its only limitation is that it doesn’t add the post URL, which if I get brave enough, I might try to figure out how to add.
Also, be aware you WILL have to file a DMCA report with Google. And it can’t be through email. It must be faxed or snail mailed in.
And one of the best resources for help is through Plagiarism Today, Jonathan Bailey’s site.
Mathias
January 20th, 2008 3:04 am
Hi Guys!
I have had excatly that problem you´re describing here. First thing I did was, writing a story about that guy and his “work”. But that didn´t work very well… So there were more scrapers…
Some days ago I´ve found the Story at Chris Gerret´s Blog about RSS-Sticky and implementet it into my Blog. That way it should be a minimum of help.
Mathias
germany
Stock Trader Guy
January 20th, 2008 3:20 am
Went through and added my code to all my first paragraphs of my $$ site
Melanie Langenhan
January 20th, 2008 3:46 am
Darren, Patrick,
thank you for this valuable information. I’ll install the plugin soon, too.
It is offending if one writes a unique article with absolutely unique and maybe personal content and some Scrapers, as you call them, put upon you and steal your work.
Fortunately I kind of see a growing community of fair online users, supporting each other and still making online profits.
Shawn
January 20th, 2008 3:58 am
Very nice.
One point to ad: I read recently where a lot of scraper sites are self-hosted, meaning that when you complain to their web hosting service, you are complaining directly to the scraper himself, masquerading as a hosting company through a reseller account. They will then “suspend” the account for a few months, then bring it back when no one is looking. They host hundreds of similar accounts so that there will always be accounts that are not “suspended.”
alanj878
January 20th, 2008 4:06 am
I am glad no one has taken any content from http://www.livelymoney.blogspot.com and thank you for those tips and i hope you become like problogger one day
JW
January 20th, 2008 4:27 am
I find a lot of my posts get “comments” from other sites who essentially will automatically pingback to my posts with an excerpt and a link to the original.
Is that considered being scraped? I always delete the comments because the sites who pingback in that manner never have original content, they seem to just be compilations of many sources. Why is this done?
TIA for any help!!
mgroves
January 20th, 2008 4:29 am
I wouldn’t call them “hacking” tools. There’s no hacking going on. They are just taking your content. That’s like calling a Xerox maching or a CD burner a “hacking tool”.
Wrong and bad? Yes. Hacking? No.
jon
January 20th, 2008 4:33 am
I am very happy no one has taken any content from http://www.bloggers-help.blogspot.com
http://www.flexsamples.blogspot.com
http://www.flexexamples.blogspot.com
these blog are great..
thanks
Google Tutor
January 20th, 2008 4:34 am
I’ll second Joost’s plugin, works great and its a nice solution
mgroves
January 20th, 2008 4:39 am
I am glad no one has taken any content from [insert my own spamming URL here].
Trula
January 20th, 2008 4:43 am
#2 is awesome, I’m going to do that ASAP. my fitness blog got scraped and so did my garden blog, humph.
Frugal Dad
January 20th, 2008 5:04 am
Thanks for the tip on Joost’s RSS footer plug-in. I’ll add it as a precaution.
Sharon Hurley Hall
January 20th, 2008 5:18 am
Great post, Patrick. You’ve reminded me to install that RSS Footer plugin, which is languishing in my downloads folder.
JEMi
January 20th, 2008 5:26 am
very useful posting
I’ll be sure to d/l that plugin
Thanks!
Jon Symons
January 20th, 2008 5:27 am
Just install Joost’s plugin, or better yet make a habit of deep linking your content from within your own posts.
Besides that the other steps are a waste of time and not productive. Get over it, would be my advice.
Costa Rica SEO in Paradise
January 20th, 2008 5:46 am
My blog came standard with follow comments and the footer link in the RSS. Serendipity ( http://s9y.org/ ) is my favorite blogging platforms for these and many other reasons. All good advice though.
Ron
January 20th, 2008 5:55 am
Great post since I didn’t even know that such a program existed to scrape my content.
I look forward to the day when I am getting enough traffic for this plugin to be usefull.
Patrick @ Piggy Bank Pie
January 20th, 2008 5:56 am
@alanj878: Thanks for the kind words. I hope too, you can help me to get there by subscribing to my RSS feed at http://feeds.feedburner.com/piggybankpie :-)
Anyone using Creative Common license?
Jake
January 20th, 2008 6:08 am
The problem is that Google Adsense doesn’t really care. I get scraped often and have reported to Google. They have emailed me back a canned response and never followed through with a ban.
The Shoemoney option is definitely the best that I see.
Stephan Miller
January 20th, 2008 6:36 am
Adding a Related Post for Feeds plugins also helps giving you more links back to articles on your site from the scraper site.
Some of these sites have the gall to give your original post a pingback.
Sarah (Real Life)
January 20th, 2008 7:00 am
Thanks so much for collecting all this information in one place. This happened to me a while back, and I got them to stop, but next time, I’ll be able to review this info, too!
The site actually published the part of the feed that says, “unsubscribe,” so I unsubscribed them from mine and the other blogs they were scraping! Ha! Take that!
Iesha
January 20th, 2008 7:02 am
Good info. Adding to ‘To Do’ list….
Mohsin
January 20th, 2008 7:09 am
This guide is worth saving to delicious Patrick. Excellent post.
John
January 20th, 2008 7:39 am
Excellent plugin by Joost. Thanks for that tip since I am getting killed by other plugs ripping my feed this may help me a lot.
Mandy - The Photographer Blog
January 20th, 2008 7:46 am
Thanks for this I had no idea how I could fight this, I’ll be going to sort all this out asap…
Mark@mytropicalescape
January 20th, 2008 8:39 am
Nice job Patrick! Very timely piece for me…I just checked my Technorati information to find this link:
http://cestgratuit.hautetfort.com/archive/2008/01/17/the-ten-most-inspirational-bloggers-of-2007.html
They stole my post, word for word. What do you do?
Marija
January 20th, 2008 8:41 am
Great post Darren, so far I was hoping that my scraper would get off my back some day, but I see it’s time to take action. Although, many people say that I should just ignore it. Hmmm. BTW, you haven’t mentioned sending an email to a scraper, or posting a comment? You think they wouldn’t listen?
Nikole Gipps
January 20th, 2008 9:07 am
I am happy to say that the only people scraping my feed are posting some lame excerpts. I don’t really think that is hurting me, as the excerpt link goes back to my site too … but then again I don’t run on advertising revenue so it may be a different story if you were.
So what do you think? Is the syndication of an excerpt posting copyrighted content too? Or would they be in violation only for full posts?
I don’t get how anyone would make money off an adsense site that just posts 2-sentence excerpts anyway.
Nikole Gipps
January 20th, 2008 9:42 am
I looked at the license, but the part that bothered me was modifications. It seems like your choices were to allow modifications totally, or to not at all. Would not allowing modifications not allow people to post excerpts of it for discussion sake, like if the wanted to make a long quote?
Patrick @ Piggy Bank Pie
January 20th, 2008 10:10 am
@ Marija: If Comments are turned on, scrapers obviously moderate them, so yes the guy gets your comment but you in return get an instant delete. As for the email, most of the scrapers do not have a contact page. I found an email address via the scraper’s domain whois information, tried emailing the owner of the domain, but again, obviously zero answer.
@ Nikole Gipps: Either full or partial scraping is the same, no doubt about it.
Thanks to all for the kind words.
Patrick
Nikole Gipps
January 20th, 2008 10:20 am
JW – I get a lot of the same thing.
wordvixen
January 20th, 2008 10:43 am
Oh my. I just started a brand new blog. I mean brandnew blog. The very first post was scraped within minutes!
Strangely, they only posted the first paragraph and linked back to me…. but not as in “full post can be found at”. The link back appeared as though I was a contributor to their site.
I haven’t done anything about it yet except to grouse, but I plan to use a few plugins I’ve heard about recently and then do something about the content stealer.
Frank C
January 20th, 2008 1:34 pm
Remember that most excerpt scappers never see your feed. They get a feed from Google Blog Search, BlogTopList, Technorati or some other aggregation/search site that you ping when you make a post. Use the RSS-Footer plugin to insure that your link is in the feed. You can also use RSS tagging plugins to cram extra tags into the feed. This will make the excerpt splogger post a keyword rich post that highlights your site as the source. Make enough of them your ‘friend’ and Google will think your site is an authority site on that keyword.
Serge Kozak
January 20th, 2008 1:51 pm
Everybody pushes the boundaries of legally publishing content. Even with CC. I give you an example; the photo from Flickr used here has no Model Relace. Virgin Mobiles just had the same issue of lifting a CC image but not having a Model Relace. So nobody is a saint here.
Some more info on this here: http://danheller.blogspot.com/2008/01/creative-commons-and-photography.html
David
January 20th, 2008 2:07 pm
A superb post – I get scraped constantly and I love the RSS footer plugin, great stuff!
techminds
January 20th, 2008 5:09 pm
great post Patrick.
Maybe i’m just tired but, how did the above comment by “wordvixen” happen?
New blog, scraped in minutes and already saw the content on another site?
How did she know she was scraped and saw on the other site so fast?
If this is not odd to you, don’t mind me. It is just that I have only seen this happen on not so new sites. Oh well, just the same, it sucks when you get scraped. Having a good following w/ loyal readers helps and it’s definitely not the kind of flattery from mimicry that anyone wants.
JW
January 20th, 2008 5:22 pm
* Are there any plugins out there to prevent “excerpt scraping” of 1-2 sentences?
….and along those lines…
*Are there any plugins to prevent those pingbacks from appearing as comments in my posts?
Seems there are always a few sites I can count on to borrow pieces of my posts.
I’d love to see the “partial scraping” issue tackled somewhere, as I have found no good resources out there for battling this.
Thanks for any help!
Karin
January 20th, 2008 11:51 pm
As far as i know, I was scraped at least 4 times last year and fought back. My experience too is that Google Adsense people doesn’t care.
A couple of months ago I found a site republishing my content, hosted by freehostia.com. I wrote an abuse mail. They terminated the site immediately.
The easiest and quickest way to find out if your content is republished without your consent is to copy and paste a unique line from the beginning of your blog into Google.
.
Famous Quotes
January 21st, 2008 12:58 am
This post is very timely and helpful for me. My blog is a little over 2 months old,Google likes it and has given it a PR1. Unfortunately, scrapers also seem to like it and are scraping my content. Now I know of a few options to fight back!
Stephen Hopson/Adversity University
January 21st, 2008 1:06 am
The link you provided to the RSS footer plug-in isn’t working. Tried to google for the page but cannot find my way there.
Where is it?
Thanks.
Frank C
January 21st, 2008 2:43 am
@JW – You can prevent excerpt scrapping by not submitting your blog to Google or any other blog aggregation service. These scrappers don’t get their sentences from your feed, they get them from blog search engine feeds. You could even turn off your RSS feed entirely. Of course, these steps are highly counterproductive if you want a blog that someone other than yourself reads.
@TechMinds – Aggregated excerpt feed scrapping can happen as soon as your blog entry is indexed by Google. If you do things right, on purpose or by accident, you can actually get indexed very quickly and thus scrapped quickly, even if you’re a brand new blog.
SusanneUK
January 21st, 2008 2:53 am
Well I have just learnt a thing or two.. didn’t even know what a scraper was until I ran across this article.. thanks very much for the concise information, very easy to understand and I am off to install that plugin right now..
Thanks again
Sue
Guardian Angel
January 21st, 2008 3:51 am
Although I am just a small time blogger, i am using Copyscape with their widget on my blogs. But my friends advised me to also use Creative Common, and since it’s Darren now who is mentioning at Tip # 1, I will check it out at once.
Thanks for being a consistent guide to us, Darren.
David Bradley
January 21st, 2008 5:53 am
Joost’s plugin, RSS footer, doesn’t solve the problem, but it is a great way to take the piss out of scrapers…
…going straight to the webhost/domain registrar with a cease&desist is still the only sure-fire way to stop any particular scraper
db
Selene
January 21st, 2008 6:32 am
I had never heard of Creative Common before and will definetly check the site out even though my blog is fairly new. Do you think it is ever too early to get your material copyrighted?
Hagrin
January 21st, 2008 7:46 am
@JW & Frank C –
Actually, Frank’s suggestion is poor for exactly the reasons he mentioned so I wouldn’t even take it as a suggestion.
As someone who has written scrapers before (i.e. I wrote a scraper for lottery numbers and sports scores), the only way to prevent partial scraping (especially “screen” scraping) is to analyze your site’s logs.
If you believe scraping is going on (or you have identified a scraper), you can probably determine who it is by taking a look at the items scraped, seeing what time they are posted at that site and then looking for that pattern in your logs. Then, if you see it coming from the same IP, subnet or an unique user-agent, you can prevent them from accessing your content. You can usually do that through your CMS or create an ACL at the domain level.
Simple Mindz
January 21st, 2008 10:52 am
I never heard of this before. Thanks for writing about it. Since I am paranoid – I will now go install the plugin. ;)
Patrick @ Piggy Bank Pie
January 21st, 2008 12:10 pm
@ Selene: It’s never too early. And adding Creative Common is such an easy process that I would do it if I were you.
@ JW: There are no plugins that prevent either full or partial scraping.
Patrick
Frank C
January 21st, 2008 12:24 pm
@Hagrin – Most scrapping, particularly excerpt scraping, is very lazy, PHP 101 kind of programming. They just read the Google Blog Search, or other major aggregator’s, feed and republish that. They’ll never even come to your site and thus won’t show up on your logs. RSS allows it but the benefits of syndication greatly outweigh those that abuse it.
Fiar
January 21st, 2008 2:38 pm
I don’t get what the downside is. I want as many people as possible to share my thoughts. I get a backlink from the spammer, and there is a copy of my work. It spreads and multiplies, just as I intend.
So where’s the problem? What, because someone else is better at profiting off my work than I am? That’s the issue. Not reuse of my content, but my own inability to convert.
I don’t own my content. If I wanted control of my thoughts… If I wanted to keep them to myself, I would keep them to myself, and not post them on a public and indexed webpage.
I suggest others do the same. It’s about sharing, and the moment of creation, and the joy of creation. Not about control of distribution while at the same time – Ironically, and stupidly – hoping to expand distribution.
You can’t do both. If you want to control distribution, go ahead and try, but don’t expect it to reach a large scale distribution.
If your ideas are so important that you want the world to see them, which is more important: You controlling the way it is distributed, or the world being exposed to the idea? Hmm.
Oh, right. It’s how you can profit from it, versus how someone else can profit from it.
I guess the sharing of the idea takes a back seat.
Dave H.
January 21st, 2008 4:32 pm
“I give you an example; the photo from Flickr used here has no Model Relace.”
It doesn’t need one – it’s a picture of a public sporting event.
“(Creative Commons) This allows you to specify under which conditions your work can be distributed.”
The problem is that people can ignore the conditions you specify. I suspect it was entirely unintentional, but this entry is actually a perfect example. The Flickr picture used above has an “attribution use” CC license, but there’s no attribution given.
proinvestorsblog
January 21st, 2008 4:37 pm
Great information, I am new to blogging and did not know about scapers, and I will be sure to keep a eye out for them using my rss to feed their site. Thanks, Great info
Darren Rowse
January 21st, 2008 4:58 pm
Dave – actually the attribution is that the image is a live link. Click it and you go to the source.
wordvixen
January 21st, 2008 5:21 pm
Techminds- no problem. I have that blog set to accept trackbacks, but to moderate all comments. Since the scraped feed (technically an excerpt) linked back to me, it showed up as a trackback. WordPress emailed me that I had a comment to moderate. I checked it, followed the trackback link, and voila! There was my content.
Rog
January 21st, 2008 6:01 pm
Woah, back the truck up!!! Licensing your work under a Creative Commons license gives explicit permission for people to reproduce your work somewhere else. It does so with certain specific requirements, but the fundamental right that all CC licences allow is the right to copy! I’m not an expert, so make sure you’re not handing the rights over when you apply a CC license. Simply asserting copyright is a much better approach if you don’t want your work copied.
Dave H.
January 21st, 2008 6:04 pm
From CreativeCommons.org:
… the proper way of accrediting your use of a work when you’re making a verbatim use is: (1) to keep intact any copyright notices for the Work; (2) credit the author, licensor and/or other parties (such as a wiki or journal) in the manner they specify; (3) the title of the Work; and (4) the Uniform Resource Identifier for the work if specified by the author and/or licensor.
While your “live link” fulfills point #4, it doesn’t fulfill any of the first 3. That’s why Flickr’s “blog this” button provides the photographer’s name and the title of the picture.
Mike Goad
January 21st, 2008 6:27 pm
Fiar: You said: “I don’t own my content. If I wanted control of my thoughts… If I wanted to keep them to myself, I would keep them to myself, and not post them on a public and indexed webpage.”
I have to disagree. You DO own your own content. It’s called intellectual property and it belongs to the person who created it unless it is done as a work for hire, in which case the people paying you own it. As soon as what you write is saved in any way, it’s yours. If anyone else uses it, then they are infringing upon your copyright, unless their use of your material is covered under the applicable copyright laws or you have given them permission. Copyright law under international treaties provides the author of materials with specific rights. One of these rights is the right to determine where one’s copyrighted material will be published. When I publish material on my blog, I know it will be going to my website, as well as out to feeds for others to read. When those feeds get hijacked and republished on someone else’s web site, they are infringing on my “copy right” to determine where copies will be published. Copy Right, Copy Sense
David Bradley
January 21st, 2008 7:53 pm
@Mike G
I don’t like scrapers any more than the next guy, but if you’re publishing an RSS feed then by one definition it’s intrinsically up for grabs – really simple *syndication*. By the other definition – RDF site summary – perhaps it isn’t. I guess you need to define RSS on your site and specify whether other sites are allowed to syndicate it. If you say no, then you would probably have a stronger legal case if you wanted to go that far.
However, there’s a world of difference between a site syndicating your feed (it’s not actually scraping at all, that’s pulling content when there is no feed) and one that tries to pass off as the original site republishing content, sidebars, banners etc etc with their own ads.
Passing off is fraud and so illegal in anyone’s book. I’ve had several sites shut down for trying to pass off Sciencebase.com! Go straight to their host or domain registrar with a stiff letter of complaint, they’ll invariably deny it’s their responsibility, but usually within 3-4 days you’ll find the scraper has morphed into a host’s holding page. They’lld eventually stop ranking for your keywords in the SERPs too.
db
Darren Rowse
January 21st, 2008 11:18 pm
Point taken Dave H – actually I usually include a text link and live link the image but this time around got lazy. Fixed it.
Fiar
January 21st, 2008 11:27 pm
@Mike Goad. I have to disagree. Just because there is an idea called “Intellectual property” doesn’t make it correct. I don’t own my content. Plain and simple, or maybe it wasn’t clear the first time around. If I wanted to “own” my thoughts, then I would keep them to myself, safely locked inside my own head, where no one would have access to them.
That would defeat the purpose of creation.
Jennifer Kyrnin
January 22nd, 2008 5:40 am
This is a great article, and let me tell you that it does work. It can be slow (Google requires that all complaints of this nature be sent via snail mail), but the stolen content does get taken down.
Patrick @ Piggy Bank Pie
January 22nd, 2008 6:42 am
@ Rog: Read Dave H.’s comment right next to yours. You’ll find the reason why Creative Common makes sense.
@ Fiar: I understand your point about distributing/syndicating your content, you get “publicity” and back link. However, if your scraper gets visitors from SERP using *your* content and he’s making revenues from AdSense out of it, then this is not right. That would be like recording the latest Celine Dion’s single and selling it, you know what I mean? p.s.: I do not own any of her CD ;-)
Patrick
Shane
January 22nd, 2008 9:48 am
Thanks for the RSS Footer plugin tip.
techminds
January 22nd, 2008 1:02 pm
Don’t hate me but, is the point getting a bit blurred or is it me? The discussion seems to be typical, with a lot of thank u’s and concerned publishers linking back to their sites and then gets healthy when we see the question/point of “Fiar” about, what the point is and the promotion aspect.
I agree that it is a problem, but what problem specifically?
Excerpts that give a link to the source from lazy & tacky sites that scrape feeds or ppl that take an entire article/post and make it their own? The latter I can see as a problem with respects to copyright but the former seems a bit in the neutral area, if distribution is factored in etc.
Maybe it’s just me but when I read from other well known sites that have used content (even headlines/titles) from other well known site(s) to make content (for their own benefit monetary or not) without regard for linking to the actual source from which it came, seems just as wrong. (i.e. If a b5media site writes about something that came from a news site like Financial Times, then the source should not be Engadget)
As for Original thought, this could be debatable these days. If we search for it, I am sure we will even see this subject is being repeated, not to say that the plugin is not new and useful but that the need for such a plugin shows concern for a repeating problem.
A variable that seems to factor in heavily is google and the ranking of such content being from the original author or not. Like it or not, dependence on google has skewed our thinking somewhat in my opinion. Not to sound naive, but before google how did you deal with this issue? Before rss or Al Gore’s internet ? – lol
Is not Google scraping? Do they have any original content? Not to leave out Yahoo and/or any others benefiting from content they did not create. Isn’t a no-name site w/ an excerpt doing the same thing. We are happy that google does it b/c we might get lots of traffic, but pissed at the no-name b/c we don’t?
Oh, and what about video? that can be embedded, giving the full video content and not just a summary of the video, is the same in the sense that it could be a problem.
*M.Goad is right, in that intellectual property rights do exists b/c everyone has the right to control what they create. Control being the key.
*Fiar is right, in that he has the right to NOT control the distribution of his/her own content/ideas, but to be known as an authority he needs credit for the work he’s creating.
A last thought is, what the correct response would be to an abuser vs. jumping the gun and taking actions suggested in the post. If I have a registered user make a post while I’m on vacation and it doesn’t please someone; I would not be pleased to know that other actions had been made without contacting me or the site. This doesn’t happen off the net, so it doesn’t seem logical to do it on the net.
OK, I’m done now. I think I have confused myself more by trying to write about this, but hopefully this will spur a continued healthy discussion.
Thanks for your time, and please do hate me for expressing a view and hoping to learn more about this important issue.
jax @ techminds
Sebastian
January 23rd, 2008 12:55 am
Sites that live off scraped content from blogs are pretty much short-lived. Chances are that before search engines and a host comply to a DMCA complaint the domain has payed already and the scraper laughs on you all the way to the bank.
Also, some scrapers replace all HREF values from scraped posts with affiliate links, so the backlink trick doesn’t work.
You can try to block the scraper’s bots, but then the scrapers will just take the contents from GoogleReader or so.
You can try to delay your feed output –if your content isn’t that timely– to make sure that search engines index the original first. With Google’s BlitzIndexing™, that’s a matter of minutes, or an hour at most, depending on the blog’s popularity.
In fact, there’s no defense.
You can out the scrapers, if you find them behind several proxy layers that hide their identity, but that comes with risks.
Aaron Hall
January 23rd, 2008 8:19 am
Regarding copyright, Mike Goad made some good points and refuted some misguided information.
I state this as a copyright attorney. Further, my comments here relate only to copyright law in the United States.
If the content that a blog author drafts is original, the author owns the copyright on the content regardless of whether a copyright symbol is included.
Also, there is no need to register a copyright to have a copyright on material, but registering a copyright gives the author additional protection under the law.
Two useful explanations on this topic can be found here:
1) Can I Copy Another Blogger’s Text?
2) Copyright and Trademark for Blogs and Domain Names
jonathon
January 24th, 2008 4:07 am
very useful information i hope i never get hit by a Scraper, but you never know who watching your site 24:7.
Internet Hunger
January 24th, 2008 6:02 am
Really helpful info, thanks for posting this Darren. I can’t believe how many scrapers have been jumping on my content lately. I guess that’s what I get for updating almost daily.
Frank C
January 24th, 2008 8:04 am
@ Jonathon and Internet Hunger – It’s all in the keywords. That’s what the scapper software looks for. For example, if you post about ‘iPhone’ and the scrapper has set up the program to look for blogs that use that keyword in a post, most likely through an aggregator like Google Blog Search, it’s scrapped.
I did a little experiment on this a couple of months ago where I did three consecutive posts about the iPhone over about an hour and all three got scraped by the same excerpt splogs. When you know this is happening that’s when you can turn this to your advantage using feed plugins.
Advertising Photographer
January 24th, 2008 4:23 pm
Dave H: The Model Relace is not for the photographer, he/she has already gave up the rights (by making CC, witch is up to the photographer after all) but the Model (Boxer). This is not an editorial usage (the article is not about that match or boxer) so it is considered commercial use. Thus putting the owner of this blog liable.
jonathon
January 31st, 2008 6:47 pm
thanks Frank C very helpful
Troy
February 6th, 2008 10:16 am
Good list, I’ve linked to it!
You can first check if your site is being scraped using a service like copyscape (copyscape.com) that will search for duplicate content from your URL.
david
February 6th, 2008 11:46 am
It’s not just blog posts they’re scaping. If your site ranks on the first page, they’ll rip off your title, scape your content and put it in the source code description, and use your ranking to boost up their crap link farms or crap product. type in garcia weightloss in google and look for dietadvices.com or any site ending in paran.com – they’ve knocked out my second and third page listings, are posting duplicate content & I’ve reported these guys but they keep coming back. If I find them, I’m going to beat the hell out of them.
pKay
April 5th, 2008 11:25 pm
Excellent!! I would like to keep this in mind when the innevitable occurs (and someone steals stuff from my site)
Yes, while it wont stop these shitheads to scrap your blog, at least it will make their life difficutl!!
Keep up the great work!
Cheers!
James Joyner
June 4th, 2008 6:05 am
I’ve pretty much given up on Google doing anything about the problem. Every time I fill out the form they want me to do an exhaustive DMCA filing and fax it back to them. Given how quickly these things pop up, it’s just not worth the effort.
And it’s completely untrue that Google figures out which one is the original. I’ve got some scrapers out there scraping my content from other scrapers!
Veron
June 17th, 2008 2:49 am
I just realised that content scrapers duplicating content off my blog are showing up in Google searches, while my site is nowhere to be found! It’s hurting my site traffic really badly.
See what I mean. For example: In Google Images, when searching for “site:sparklette.net
graffiti”, all the results that show up are other websites rather than
the original http://sparklette.net/
It’s frustrating and almost heartbreaking. It’s puzzling why Google is recognising these content scrapers rather than my blog, which has a PR 5.
What’s even more puzzling is that just barely 2 weeks ago, the exact same keyword searches had shown my site as one of the search results. Yet today, this is no longer true.
I am lost.
Mandy
June 20th, 2008 6:43 pm
Thank you so much for explaining the means to at least be able to fight back.
Over the last few weeks I too have suffered. It has taken 2 years of hard work to build up a reputable website and within 2 weeks a relatively new ‘news source’: copies RSS and to add insult to injury scrapes the site using mega amounts of my bandwidth.
We were doing fine in search engines up until, our titles were top of the list within minutes of posting.
Not now since the the ‘news source’ they are.
quote:
“Global Social News Platform aspires to connect people through news and create global communities of interest. Where topics and ideas: offer continuously updated news from thousands of sources”:
The ultimate snub is you no longer need to directly access the ‘News site’ as you can now install their widgets.
Thanks once again for the heads up, the gloves are on round one begins. :o)
David
June 21st, 2008 6:05 am
Darren,
Just wanted to say thanks. It took me a little over a month to get my scraper banned off google. I used most of your tactics, including calling the guy on the phone.
Yes, it was quite time consuming, and I’m sure it’ll happen again, but the payoff was SO worth it. The bastard thought he was invincible, until his affiliate partner was notified (by me) and his site suddenly went poof.
Not quite as satisfying as a punch in the face, but better than nothing!
Veron
June 21st, 2008 1:09 pm
For self-hosted sites that would like to combat content scrappers direct linking to your images, I would like to share this solution that I recently found and implemented on my blog. Basically, it uses .htaccess to prevent any site other than your own to link to your images (or any file, for that matter).
I’m not sure if I’m allowed to paste in the URL here. So just google for “Preventing Image Bandwidth Theft With .htaccess”. There would be a whole list of sites that give the detailed instructions.
Voila! No more bandwidth stolen!
Soy
July 2nd, 2008 6:11 am
I just got scraped and have followed the steps you outlined!
Now I hope the scumbags get booted from Google, dropped by their host, and mauled by a wild, diseased kangaroo all in the same day.
I’ll report back with my results.
Thanks!
TrafficNymphomaniac.com
July 4th, 2008 11:06 am
I had an excerpt from one of my articles on link popularity scrapped with no attribution. But, because the content contained a link to one of my Web sites I did not take action.
But, if I did not benefit from the backlink, I would have complained directly to the Web site host.
Robert A. Kearse
Neecy
July 18th, 2008 7:42 am
I thought written words,,,ie books etc were aoutmatically copyrighted?,,or is that only art?
also,,how can you tell if you have been scraped?
Thank you
Neecy
July 18th, 2008 7:43 am
DISEASED KANGAROO…THAT is hilarious!
Online Internet News
July 24th, 2008 5:18 am
I just started a blog. I have a total of 13 posts on it and have already been scraped by someone or something. This is very annoying and I will follow the items listed above and hopefully the issue will be resolved.
I wonder if putting a link to the blog homepage in every post will help?
Thanks for the tips.
John
Fiar
July 24th, 2008 10:19 am
That’s actually a good strategy to take advantage of the scrapers. Most scrapers will only post snippets, so put the link into the first sentence or two. Actually, chock your posts full of internal links, but always fit one in the first or second sentence, preferably with a keyword you are targeting as the anchor.
Check out this post on duplicate content and especially the comments for some help. There are some really good tips on using scrapers to your advantage there.
Neecy
July 25th, 2008 10:28 am
Is there an RSS footer for Blogger?
Daniel, Fashionising.com
November 2nd, 2008 6:20 am
We had exactly this issue with content from Fashionising (http://www.fashionising.com). In the end I custom built our RSS feed, which now publishes only
* The first half of the article
* Two different links back to the original article, with a clearly stated “Read the rest here” type by line
Alison
January 16th, 2009 12:38 am
Great article! I just discovered someone had taken one of my blog posts and I’ve implemented 4. and 5. Now I’m off to do 2. Whew, I had no clue how to handle this. Thanks for saving my time and energy with this great list!!!!
p1nk g33k
January 21st, 2009 12:31 am
I only get upset when they try to pass the content off as their own or they don’t link back to my site.
The first thing that I do is find out if they advertise with AdSense, and then I report them.
The Blogspot blogs are the worse. And, that’s funny, because they’re owned by Google. You would think that Google would be able to get rid of those blogs the quickest.
Jessica
April 9th, 2009 3:46 am
Thank you SO much for this. Thanks to you, I just got a site deindexed from Google within 2 or 3 days for stealing my content. (They plagiarized my entire post, did not give me credit, and to put the icing on the cake, their site was ranking for the keywords in my article while my own site wasn’t!) I am brand new to internet marketing and had no clue how to handle scrapers until I read this post.
netbook
April 16th, 2009 1:34 pm
Wow, great read. I just got scraped and found them ranked btter than me. That burns big time. I will take the steps you mention here. Very handy. I have this bookmarked!!
Comments will be closed off on this post 90 days after it is published. Apologies to those this impacts but it's a regrettable and temporary measure to combat a growing comment spam problem. See our most recent posts where you can comment here.