More broken web robots

If you use "zite", also known as "woriobot" (whatever those are) and don't get images for my posts, it's because their robot is badly broken. Check this out.

"GET /dialup1.jpg HTTP/1.1" 404
"GET /dialup2.jpg HTTP/1.1" 404
"GET /tc.jpg HTTP/1.1" 404

Those three images are part of my terminal server post from Wednesday. If retrieved as a post (that is, directly to /w/2013/04/17/slow/), it refers to those images using relative paths. That is, they are just "dialup1.jpg", with no dots, slashes, colons, hostnames, ports, protocols, or anything of the sort.

That means it lives at the same place as the base URL. Since I don't twiddle that setting in my pages, that means it's the same path that they just fetched plus the "dialup1.jpg". Easy. You'd think this fundamental tenet of the web would be well-understood by now, but apparently it is not.

Now, let's say they're actually crawling my Atom feed. That feed purposely spells everything out in long-form: protocol, hostname, path, filename. This has been the case since September when I declared that my "protocol-relative URL experiment" was over.

They aren't alone in their brokenness, though. There's another one from "Sosospider" which is has its own flavor of insanity:

"GET /w/2013/04/02/maps/img_0725.png HTTP/1.1" 404
"GET /w/2013/04/02/maps/img_0730.png HTTP/1.1" 404

These are from my first post this month about bad Apple maps. The problem is that these files are *uppercase*. They are the same filenames which came straight out of my old iPhone's screenshot facility. The file isn't called img_0725.png. It's IMG_0725.PNG.

I'm serving up HTML with the proper filenames. Nearly everyone manages to get this right. These guys, however, squash the case and so miss out. Who thought that was a good idea? Wouldn't it take more code to purposely squash case, and properly at that? Getting uppercase and lowercase right is just like date handling. Both are hard problems.

This is all on top of the well-known Java crawler stupidity where someone has decided that parsing the URLs which are targeted by SCRIPT tags is a good idea, even though those are not even HTML.

Welcome to the web, where any idiot can program for it, and probably does.

More broken web robots

More broken web robots

Recommend

Sufficiently advanced products might be doomed by design

Miserable programming contest questions

Half-baked IP address extensions

Apple's Time Machine hates me again

Stay in your own lane... if you can!

WWDC signups are cruel and unusual

I put a penguin in "jail"

Notes on keeping OpenVPN alive on a Mac

Usenet, binaries, and the other kind of logs

Replacing 2000 monitors in some harsh-looking schools

About Joyk