Gigabot is mega-clueless

Whatever "Gigabot" is, it's clueless.

Exhibit 1:

64.22.106.82 - - [17/Dec/2011:18:31:18 -0800] "GET /robots.txt HTTP/1.0" 200 26 "-" "Gigabot/3.0 (http://www.gigablast.com/spider.html)" "fred.rachelbythebay.com" "-"
64.22.106.82 - - [17/Dec/2011:18:31:19 -0800] "GET /wtf.html HTTP/1.0" 200 1200 "-" "Gigabot/3.0 (http://www.gigablast.com/spider.html)" "fred.rachelbythebay.com" "-"

The fred robots.txt is simple enough:

User-agent: *
Disallow: /

Translation: if it's a URL on this site, you aren't supposed to fetch it with a spider. Reality: they requested something anyway. Duh?

Exhibit 2:

64.22.106.82 - - [17/Dec/2011:18:31:01 -0800] "GET /robots.txt HTTP/1.0" 200 65 "-" "Gigabot/3.0 (http://www.gigablast.com/spider.html)" "scanner.rachelbythebay.com" "-"
64.22.106.82 - - [17/Dec/2011:18:31:02 -0800] "GET /main?335095 HTTP/1.0" 200 10234 "-" "Gigabot/3.0 (http://www.gigablast.com/spider.html)" "scanner.rachelbythebay.com" "-"

The scanner robots.txt is a little more complicated, but it should still be unambiguous:

User-agent: *
Disallow: /main
Disallow: /main/
Disallow: /main/*

If this keeps up, I think I'll start seeding pages with links which go nowhere useful but are well-covered by robots.txt. It should show exactly who honors this sort of thing and who just thumbs their nose at it.

Gigabot is mega-clueless

Gigabot is mega-clueless

Recommend

SOPA means my half-baked ideas might start being useful

Ticket analysis with reflection

A tale of rocket ships and space radiation

How to sanely reboot a bunch of machines to test kernels

Stick a stateful server in front of that database

CFO gaffe: hit the Social Security cap by May

Discovering the reality of court recorders

A sneaky way to bounce TCP connections with Linux

Raw MySQL access over the Internet is a bad idea

The need for global low-speed data access

About Joyk