6

Broken crawler behavior with my binary protofeed file

 3 years ago
source link: http://rachelbythebay.com/w/2013/04/14/protofeed/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Broken crawler behavior with my binary protofeed file

I detected a disturbing uptick in the number of 404s coming from a certain big web indexing robot in recent days. It was completely nonsensical stuff like this:

"GET /w/2013/03/31/snark/filesystem.png"><img HTTP/1.1"

Got that? It's actually picking up a quotation mark, a greater-than, and then a less-than, and the beginning of an "img src"!

I finally figured out what was going on this morning. They've been fetching my protofeed file and have parsing it as if it was HTML! Yes, my half-baked protobuf-based feed file from last month has been linked a few times, and it started being indexed. Then, for some reason, they decided the blobs of text within that binary protobuf were indexable, and went to it. The result is that mess above.

I will note that I have been serving it as "text/plain" for lack of a better MIME type. It's definitely not going out as "text/html", in other words.

For now, I've "solved" it by blocking this file in robots.txt. Let this be a warning to anyone who links to binary data from their web pages. If you have something resembling HTML in that binary blob, they might start following the links, and this is probably not what you want.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK