Broken crawler behavior with my binary protofeed file

I detected a disturbing uptick in the number of 404s coming from a certain big web indexing robot in recent days. It was completely nonsensical stuff like this:

"GET /w/2013/03/31/snark/filesystem.png"><img HTTP/1.1"

Got that? It's actually picking up a quotation mark, a greater-than, and then a less-than, and the beginning of an "img src"!

I finally figured out what was going on this morning. They've been fetching my protofeed file and have parsing it as if it was HTML! Yes, my half-baked protobuf-based feed file from last month has been linked a few times, and it started being indexed. Then, for some reason, they decided the blobs of text within that binary protobuf were indexable, and went to it. The result is that mess above.

I will note that I have been serving it as "text/plain" for lack of a better MIME type. It's definitely not going out as "text/html", in other words.

For now, I've "solved" it by blocking this file in robots.txt. Let this be a warning to anyone who links to binary data from their web pages. If you have something resembling HTML in that binary blob, they might start following the links, and this is probably not what you want.

Broken crawler behavior with my binary protofeed file

Broken crawler behavior with my binary protofeed file

Recommend

Administrivia: SSL certificate refresh

Employee ID over 32767? You're gonna have a bad time.

What constitutes the "full stack", anyway?

The long slow road to a terminal server

Signs of a bad feed fetching robot

More pictures of the real world

My first earthquake was caused by people

More broken web robots

Sufficiently advanced products might be doomed by design

Miserable programming contest questions

About Joyk