Many popular HTML rewriters break protocol-relative URLs

If you're going to rewrite my HTML, you might as well do it correctly.

A fair amount of traffic has been heading this way as a result of Thursday's and Friday's posts about Mac development and tools. Both of those posts have images, and thus have IMG SRC tags in them. This has exposed a high quantity of insanity in my logs.

What's going on, you might ask. The answer is that I started using URLs without protocols in them to make sure that if you hit me in https mode, you get https resources and likewise for http. I talked about this about two weeks ago in yet another post in which I described the problems when Atom feeds get involved, and talked a bit about what I was doing about it.

I changed the feeds, but I didn't change the web pages. I figured, who has a browser that's dumb enough to not work right given a link like //rachelbythebay.com/foo/bar? Well, it turns out, browsers are generally not my problem. It's all of these other scraper-ish things which Do Not Get It which are the problem.

Here, have a look:

x.x.x.x - - [15/Sep/2012:01:49:58 -0700] "GET //rachelbythebay.com/w/2012/09/13/opt/meter.jpg HTTP/1.1" 404 328 "http://viewtext.org/article?url=https://rachelbythebay.com/w/2012/09/13/opt/"

That's some site which effectively proxies the page and rewrites it, presumably to remove cruft (not that my pages need it). Anyway, look what they end up supplying...

$ curl -s "http://viewtext.org/article?url=https://rachelbythebay.com/w/2012/09/13/opt/" | grep "img src"
<a href="http://viewtext.org/article?url=https%3a%2f%2frachelbythebay.com%2f%2frachelbythebay.com%2fw%2f2012%2f09%2f13%2fopt%2fmeter.jpg"><img src="https://rachelbythebay.com//rachelbythebay.com/w/2012/09/13/opt/meter.jpg" width="300" height="225" alt="Power meter" align="middle"></a>

Squint your eyes and you'll see it: they are rewriting the IMG SRC to actually generate the ridiculous "https://site//site/path" thing! Clearly, they do not understand how to handle a // URL reference.

Oh, but let's not pick on just one site. Here's another.

x.x.x.x - - [15/Sep/2012:01:52:21 -0700] "GET //rachelbythebay.com/w/2012/09/13/opt/meter.jpg HTTP/1.1" 404 328 "http://www.instapaper.com/m?u=https%3A%2F%2Frachelbythebay.com%2Fw%2F2012%2F09%2F13%2Fopt%2F"

Instapaper, OK? Let's look at that one.

$ curl -s "http://www.instapaper.com/m?u=https%3A%2F%2Frachelbythebay.com%2Fw%2F2012%2F09%2F13%2Fopt%2F" | grep "img src"
<a href="https://rachelbythebay.com//rachelbythebay.com/w/2012/09/13/opt/meter.jpg"><img src="https://rachelbythebay.com//rachelbythebay.com/w/2012/09/13/opt/meter.jpg" alt="Power meter" /></a>

It's the same deal here. It does not get the // thing, either.

viewtext and Instapaper are not alone in this. There are bunches of sites showing up in here. mapidea.com. feedly.com. GoogleProducer. gethifi.com. They all trip it. There are many, many more hits which have no User-Agent data and all cause it to happen, too.

Just for the record, as of the time I'm writing this in the wee hours of Saturday morning, here is what it looks like in the actual page, straight from my web server:

$ curl -s "https://rachelbythebay.com/w/2012/09/13/opt/" | grep "img src" | grep -vi feed.png
<a href="//rachelbythebay.com/w/2012/09/13/opt/meter.jpg"><img src="//rachelbythebay.com/w/2012/09/13/opt/meter.jpg" width="300" height="225" alt="Power meter" align="middle"></a>

Okay? All of these posts are static files. It's a "baked" web log. There is no dynamic content, which is partly how I can be on the front page of Hacker News all day and not fall over under the load. These pages are leaving here with the correct URL and are being broken by these rewriting services.

At this point I could pull the typical Valley "engineer" thing and say "screw 'em, they deserve it". Wrong. The users of these services -- you, my readers -- do not deserve to see broken pages, even if their access method is wonky. So, I am going to retool things to drop the // stuff and that will be it. The experiment is over.

I'll post an update once this has happened and I'm not serving up any posts in this format. Then it'll be a matter of seeing who has cached their scraped pages and who's still getting them "fresh". That should be interesting.

See, I told you there were corner cases out there.

September 16, 2012: This post has an update.

Many popular HTML rewriters break protocol-relative URLs

Many popular HTML rewriters break protocol-relative URLs

Recommend

My protocol-relative URL experiment is over

Logs, globbing, regexes and bucketed retention rules

Attacking a build warning when nobody else will

My first httpd server was really intended to serve Gopher

Cargo-cult your way through building software

Lightning bolt 2, computers 0

Shuttle Endeavour over Silicon Valley (and a bonus pic)

Cell phones need far more intelligence for phone calls

A typical night working tech support

The meeting room hardware virus

About Joyk