Close
Close

Controlling the Googlebot

How to control what gets indexed by Google and when? That is the question. Most of the time, we want Google to snarf up as many pages as possible. In my own experience, I can think of a few times when indexing was not something I wanted and I had to go back to Google to actually have pages removed.

In the immortal words of Darren Rowse, there’s a tangent to follow:
In my case, a church website I run inadvertently had information about missionaries that were in sensitive areas of the world and information about them actually placed them in danger. While I typically wanted information about the church and functions of the church indexed for people to find, I did not want this information indexable. While I spun my wheels to correct the sensitivie information, I realized that anyone in the world could find enough information about these people via Google that I had to resort to Google’s url removal tool.

Matt Cutts provides a concise and link-filled guide to information to control the Googlebot’s indexing. Though details can be found at Matt’s site, here is a short rundown:

  1. At a site or directory level, use .htaccess to add password protection.
  2. At a site or directory level, make use of a robots.txt file.
  3. At a page level, use the noindex <meta> tag.
  4. At a link level, use a nofollow attribute.
  5. If the content has already been crawled, use the Google url removal tool as a last resort.

I would add just a point on common sense and intelligent web concepts. There is a saying that says that nothing you do on the internet is anonymous. There is something to be said about thinking before you act. It’s harder to cleanup from a boneheaded mistake such as the one I made for the church whose site I ran, than it is to think before posting anything online. If you don’t want the world to see it, then don’t rely on the mechanisms listed above. Simply don’t post it.

Problogger.net runs on the Genesis Framework

Genesis Framework

The Genesis Framework empowers you to quickly and easily build incredible websites with WordPress. Genesis provides the secure and search-engine-optimized foundation that takes WordPress to places you never thought it could go.

Check out the incredible features and the selection of designs. It's that simple - start using Genesis now!

Comments

  1. raj says:

    There is an old saying that if two people know something, it’s a secret. If three people know it, it’s public knowledge. If you need to publish sensitive info, you might want to try a password-protected site. If the info is truly life-threatening, there’s no way you should post it on the site at all. Because what if someone else links to it? [Provided you haven't changed your .htaccess file or used Google's URL remover.]

  2. Brad says:

    If I don’t want anyone to read something, then I don’t publish it on any of my websites! But, since my blog is still fairly new, I’m happy anytime I get a visit from the googlebot in hopes it helps my search engine ranking.

  3. “…If you don’t want the world to see it, then don’t rely on the mechanisms listed above. Simply don’t post it.”

    This phrase is paramount.

  4. I always operate on the assumption that there are only two possibilities: something is private (unpublished) or it’s public (published). If I publish I assume that everyone on the world will have a chance to read everything on the site at the worst possible moment.

    I haven’t had to beg Google to remove a URL yet…

    Okay, that’s a lie, I’ve used Google’s URL Removal Tool a few times to correct a less dangerous mistake, but it’s one you should also watch out for:

    * Always password-protect “beta” or “temporary” versions of a site.

    It’s convenient to have a beta site online to test a new design or function, but be careful lest Google spider the whole thing and start substituting it for the results on your original site…

  5. Matt Mullenweg actually had the noindex tag put on his site maliciously. Now that’s a dismal thought.

    Aaron, any info on when this delayed PR update is going to happen? Matt Cutts is back. What’s the holdup?

  6. John, I’ll presume that you realize I have no clue about Google updates. I wait like anyone else. I’ll also presume that you were not aware that there was a PR update last week.

  7. Very interesting article.
    How long Google take to cancel the pages using URL removal tool ?
    bye

  8. adfi says:

    Yes. Very good site!

  9. Brian says:
  10. Mark says:

    flys – Google can take quite a while to cancel pages, however, this of course depends on the frequency of their visits.

  11. Netbooks says:

    I am a big believer in using robots.txt

  12. I think robots.txt is a good tool, but be careful what you restrict!

  13. Netbooks UK says:

    Sure robots.txt might work, but what about after the pages have already been indexed. How do you get google to remove them?

Trackbacks

  1. [...] Controlling the Googlebot (tags: blogging Search Tools Google Bots) [...]

  2. [...] One webmasters comments and help when things are posted by mistake has been written about on Problogger. [...]