DotDragnet
May 22, 2012, 01:55:36 PM *
Welcome, Guest. Please login or register.

Login with username, password and session length
News: Mobile users - Our forum is Tapatalk enabled. http://www.tapatalk.com/
 
   Home   Help Search Login Register  
Pages: [1]   Go Down
  Print  
Author Topic: Help with robots.txt  (Read 2749 times)
Chris H
Resident God Botherer
Global Moderator
Hero Member
*****
Posts: 2291



chrishall57
View Profile WWW Awards
« on: June 16, 2007, 05:27:27 PM »

May have mentioned a problem with googlebot leaching all the bandwidth off one of me sites last month.

Set me robots.txt up as follows

Quote
User-agent: *
Disallow: /administrator/
Disallow: etc, etc etc.....

User-agent: Slurp
Crawl-delay: 20

User-agent: Googlebot
Crawl-delay: 20

This month I was expecting better things but have now found that less than halfway through the month googlebot has eaten 92% of me 6Gb bandwidth  no

Looking at me sitemaps report it tells me that the crawl-dealy is 'Rule ignored by Googlebot'.

Any ideas? Don't really want to shut out google and would like to know why it's ignoring me robots.txt.

I'm going to have to scrap the crawl delay and just disallow googlebot for the next couple of weeks.
Logged

slaughteredlamb
DDN Contributor
Hero Member
*****
Posts: 1089



peakoverload
View Profile WWW Awards
« Reply #1 on: June 16, 2007, 10:02:28 PM »

Please bare in mind that I know less than nothing about web design and how Google works but I thought that Google and indeed all search engines now ignored the robots.txt file so putting anything in there is just a waste of time.

If you just want to exclude everything inside your administrator folder can't you just password protect it which would stop the googlebot and make your site that little bit more secure?

Like I say, I know nothing about this.  Huh?
Logged
Chris H
Resident God Botherer
Global Moderator
Hero Member
*****
Posts: 2291



chrishall57
View Profile WWW Awards
« Reply #2 on: June 16, 2007, 10:14:29 PM »

Disallowing /admin etc isn't the issue. As far as I iknow those folders aren't indexed as I haven't seen any content on the search engines.

It's just that googlebot is visiting so often that it's using all my bandwidth.

Here's a list of bots and bandwidth.

Googlebot 4.20 GB
Inktomi Slurp  380.46 MB
Unknown robot (identified by 'robot') 210.94 MB
MSNBot 194.94 MB
Logged

Mike@TheWhippinpost
Global Moderator
Hero Member
*****
Posts: 705



View Profile Awards
« Reply #3 on: June 17, 2007, 04:37:10 AM »

That is extreme.

I'd first check the IP addy from your logs to make sure it is actually Google, then maybe try a [site:yoursite.com] to see what pages it has - It could be there's something about the site menu-construction that is sending it into an ascending loop from index.php?id=x, for instance.

Maybe a search function is getting triggered?

You got Xenu Linksleuth? Try throwing it at your site.
Logged

This sig is sponsored by International Gayboy of the Decade, Deepthroat Yawner.
Yawner - A man who takes it all Tongue
Chris H
Resident God Botherer
Global Moderator
Hero Member
*****
Posts: 2291



chrishall57
View Profile WWW Awards
« Reply #4 on: June 17, 2007, 07:00:08 AM »

Cheers Mike.

Google lists over 3000 entries for the site. Not surprising as it's been running since Mambo 4.5 and uses joom!fish to translate into about 3 other languages.

However, Google is listing everything, even the kitchen sink, all graphics, everything that I've set to disallow in robots.txt. Does Google ignore robots.txt?

I do have this in the head:

<meta name="robots" content="index, follow" />

And also this:

<meta name="revisit-after" value="2 days">

which may have been an effort at some time to slow the beastie down.
Logged

Chris H
Resident God Botherer
Global Moderator
Hero Member
*****
Posts: 2291



chrishall57
View Profile WWW Awards
« Reply #5 on: June 17, 2007, 07:05:49 AM »

Looking back over the year I have noticed that one or two months last year it went close to the bandwidth limit. Other months it used maybe 900Mb for all spiders.
Logged

Mr Anderson
DDN Contribs
Hero Member
*****
Posts: 2267



ap4a.uk ap4a
View Profile WWW Awards
« Reply #6 on: June 17, 2007, 10:42:52 AM »

http://www.google.com/support/webmasters/bin/answer.py?answer=33571&topic=8460
Logged

Chris H
Resident God Botherer
Global Moderator
Hero Member
*****
Posts: 2291



chrishall57
View Profile WWW Awards
« Reply #7 on: June 17, 2007, 01:11:28 PM »



Good thinking. Will do that tonight.
Logged

Mike@TheWhippinpost
Global Moderator
Hero Member
*****
Posts: 705



View Profile Awards
« Reply #8 on: June 17, 2007, 06:35:49 PM »

No it doesn't ignore robots.txt but your meta tags might be over-riding them.

Translation will be a heavy hit.

In addition to the above, check links sitewide (as I said) including www and non-www variations. Printer pages. Error pages.

Oh yeah, does your server issue a "if-not-modified" 304?

I'm working in the dark here so I dunno.

Logged

This sig is sponsored by International Gayboy of the Decade, Deepthroat Yawner.
Yawner - A man who takes it all Tongue
Mr Anderson
DDN Contribs
Hero Member
*****
Posts: 2267



ap4a.uk ap4a
View Profile WWW Awards
« Reply #9 on: June 17, 2007, 07:58:15 PM »

No it doesn't ignore robots.txt

But does ignore crawl delay. Google expects you to contact them to ask them to change the crawl frequency.
Logged

Chris H
Resident God Botherer
Global Moderator
Hero Member
*****
Posts: 2291



chrishall57
View Profile WWW Awards
« Reply #10 on: June 17, 2007, 09:17:23 PM »

No it doesn't ignore robots.txt

But does ignore crawl delay. Google expects you to contact them to ask them to change the crawl frequency.

I did set crawl delay to slow at the beginning of the month via the sitemapper pages.
Logged

civ
Full Member
***
Posts: 135



View Profile WWW Awards
« Reply #11 on: June 18, 2007, 12:24:45 AM »

No it doesn't ignore robots.txt

But does ignore crawl delay. Google expects you to contact them to ask them to change the crawl frequency.

I did set crawl delay to slow at the beginning of the month via the sitemapper pages.

Might take a while to take effect? I guess only G can tell you for sure.
Logged

Oli
Chris H
Resident God Botherer
Global Moderator
Hero Member
*****
Posts: 2291



chrishall57
View Profile WWW Awards
« Reply #12 on: June 29, 2007, 05:59:10 AM »

Setting the crawl delay in sitemapper tools has slowed G down.

Now getting through about 2Gb per 10 days.

An improvement I suppose.

Had to throw some money at the host for extra bandwidth to get the site back. Wonder if Google will reimburse me?  dry
Logged

Pages: [1]   Go Up
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF | SMF © 2006-2008, Simple Machines Valid XHTML 1.0! Valid CSS!