6

Gigabot is mega-clueless

 3 years ago
source link: http://rachelbythebay.com/w/2011/12/17/gigabot/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Gigabot is mega-clueless

Whatever "Gigabot" is, it's clueless.

Exhibit 1:

64.22.106.82 - - [17/Dec/2011:18:31:18 -0800] "GET /robots.txt HTTP/1.0" 200 26 "-" "Gigabot/3.0 (http://www.gigablast.com/spider.html)" "fred.rachelbythebay.com" "-"
64.22.106.82 - - [17/Dec/2011:18:31:19 -0800] "GET /wtf.html HTTP/1.0" 200 1200 "-" "Gigabot/3.0 (http://www.gigablast.com/spider.html)" "fred.rachelbythebay.com" "-"

The fred robots.txt is simple enough:

User-agent: *
Disallow: /

Translation: if it's a URL on this site, you aren't supposed to fetch it with a spider. Reality: they requested something anyway. Duh?

Exhibit 2:

64.22.106.82 - - [17/Dec/2011:18:31:01 -0800] "GET /robots.txt HTTP/1.0" 200 65 "-" "Gigabot/3.0 (http://www.gigablast.com/spider.html)" "scanner.rachelbythebay.com" "-"
64.22.106.82 - - [17/Dec/2011:18:31:02 -0800] "GET /main?335095 HTTP/1.0" 200 10234 "-" "Gigabot/3.0 (http://www.gigablast.com/spider.html)" "scanner.rachelbythebay.com" "-"

The scanner robots.txt is a little more complicated, but it should still be unambiguous:

User-agent: *
Disallow: /main
Disallow: /main/
Disallow: /main/*

If this keeps up, I think I'll start seeding pages with links which go nowhere useful but are well-covered by robots.txt. It should show exactly who honors this sort of thing and who just thumbs their nose at it.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK