22

http://http://http://@http://http://?http://#http://

 2 years ago
source link: https://daniel.haxx.se/blog/2022/09/08/http-http-http-http-http-http-http/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

http://http://http://@http://http://?http://#http:// | daniel.haxx.se

URLs-672x372.jpg

The other day I sent out this tweet

As it took off, got an amazing attention and I received many different comments and replies, I felt a need to elaborate a little. To add some meat to this.

Is this string really a legitimate URL? What is a URL? How is it parsed?

http://http://http://@http://http://?http://#http://

Let’s start with curl. It parses the string as a valid URL and it does it like this. I have color coded the sections below a bit to help illustrate:

http://http://http://@http://http://?http://#http://

Going from left to right.

http – the scheme of the URL. This speaks HTTP. The “://” separates the scheme from the following authority part (that’s basically everything up to the path).

http – the user name up to the first colon

//http:// – the password up to the at-sign (@)

http: – the host name, with a trailing colon. The colon is a port number separator, but a missing number will be treated by curl as the default number. Browsers also treat missing port numbers like this. The default port number for the http scheme is 80.

//http:// – the path. Multiple leading slashes are fine, and so is using a colon in the path. It ends at the question mark separator (?).

http:// – the query part. To the right of the question mark, but before the hash (#).

http:// – the fragment, also known as “anchor”. Everything to the right of the hash sign (#).

To use this URL with curl and serve it on your localhost try this command line:

curl "http://http://http://@http://http://?http://#http://" --resolve http:80:127.0.0.1

The curl parser

The curl parser has been around for decades already and do not break existing scripts and applications is one of our core principles. Thus, some of the choices in the URL parser we did a long time ago and we stand by them, even if they might be slightly questionable standards wise. As if any standard meant much here.

The curl parser started its life as lenient as it could possibly be. While it has been made stricter over the years, traces of the original design still shows. In addition to that, we have also felt that we have been forced to make adaptions to the parser to make it work with real world URLs thrown at it. URLs that maybe once was not deemed fine, but that have become “fine” because they are accepted in other tools and perhaps primarily in browsers.

URL standards

I have mentioned it many times before. What we all colloquially refer to as a URL is not something that has a firm definition:

There is the URI (not URL!) definition from IETF RFC 3986, there is the WHATWG URL Specification that browsers (try to) adhere to and there are numerous different implementations of parsers being more or less strict when following one or both of the above mentioned specifications.

You will find that when you scrutinize them into the details, hardly any two URL parsers agree on every little character.

Therefore, if you throw the above mentioned URL on any random URL parser they may reject it (like the Twitter parser didn’t seem to think it was a URL) or they might come to a different conclusion about the different parts than curl does. In fact, it is likely that they will not do exactly as curl does.

Python’s urllib

April King threw it at Python’s urllib. A valid URL! While it accepted it as a URL, it split it differently:

ParseResult(scheme='http',
            netloc='http:',
            path='//http://@http://http://',
            params='',
            query='http://',
            fragment='http://')

Given the replies to my tweet, several other parsers did the similar interpretation. Possibly because they use the same underlying parser…

Javacript

Meduz showed how the JavaScript URL object treats it, and it looks quite similar to the Python above. Still a valid URL.

http-url-js.jpg

Firefox and Chrome

I added the host entry ‘127.0.0.1 http‘ into /etc/hosts and pasted the URL into the address bar of Firefox. It rewrote it to

http://http//http://@http://http://?http://#http://

(the second colon from the left is removed, everything else is the same)

… but still considered it a valid URL and showed the contents from my web server.

Chrome behaved exactly the same. A valid URL according to it, and a rewrite of the URL like Firefox does.

RFC 3986

Some commenters mentioned that the unencoded “unreserved” letters in the authority part make it not RFC 3986 compliant. Section 3.2 says:

The authority component is preceded by a double slash ("//") and is terminated by the next slash ("/"), question mark ("?"), or number sign ("#") character, or by the end of the URI.

Meaning that the password should have its slashes URL-encoded as %2f to make it valid. At least. So maybe not a valid URL?

HTTPS

The URL works equally fine with https.

https://https://https://@https://https://?https://#https://

The two reasons I did not use https in the tweet:

  1. It made it 7 characters longer for no obvious extra fun
  2. It is harder to prove working by serving your own content as you would need curl -k or similar to make the host name ‘https’ be OK vs the name used in the actual server you would target.

The URL Buffalo buffalo

A surprisingly large number of people thought it reminded them of the old buffalo buffalo thing.

cURL and libcurlURL

Post navigation

2 thoughts on “http://http://http://@http://http://?http://#http://”

  1. 270fc727ce4a5ba24d9710504615b3f8?s=34&d=monsterid&r=gWilly

    I guess you had quite some fun at this, Daniel!

    Just tested through haproxy and it’s properly parsed as well, however we do drop the userinfo part before fowarding it, as permitted by RFC3986 (and likely suggested in 7230 IIRC).

    That could make a good interview test, though not very nice 🙂

  2. It is a valid URL/URI, yes, but your curl parser is just plain wrong. The authority component is terminated by the next slash, as required by RFC3986 and shown by urllib. That has been the case since the very first definition of web addresses, no matter what acronym is used. You can see it in RFC1808 as well.

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Comment *

Name *

Email *

Website

2  ×  six  = 

This site uses Akismet to reduce spam. Learn how your comment data is processed.


Recommend

  • 158
    • jiajunhuang.com 7 years ago
    • Cache

    Web开发系列(二):HTTP协议

    Web开发系列(二):HTTP协议 在上一次我们介绍TCP协议的最后(点这里),我们简单的 看了一下HTTP协议大概长什么样:当我们输入 GET / HTTP/1.1 回...

  • 168
    • 掘金 juejin.im 7 years ago
    • Cache

    JS HTTP 请求终极解决方案 - fly.js

    找不到页面

  • 207
    • 掘金 juejin.im 7 years ago
    • Cache

    TCP TCP/IP HTTP HTTPS

    大学学习网络基础的时候老师讲过,网络由下往上分为物理层、数据链路层、网络层、传输层、会话层、表示层和应用层。通过初步的了解,我知道IP协议对应于网络层,TCP协议对应于传输层,而HTTP协议对应于应用层,三者从本质上来说没有可比性,socket则是对TCP/I

  • 252

    HTTP(S) Proxy in Golang in less than 10...

  • 130
    • www.solidot.org 7 years ago
    • Cache

    IETF 公布新 HTTP 状态码 103

    solidot新版网站常见问题,请点击这里查看。 提交文章 ...

  • 152

    Files Permalink Latest commit message Commi...

  • 76
    • 掘金 juejin.im 7 years ago
    • Cache

    程序员都该懂点 HTTP

    作者:developerHaozGithub 地址:

  • 121
    • 掘金 juejin.im 7 years ago
    • Cache

    初识Http缓存君

    前言二零一七年十一月十三日,就是我开始前端之旅的第n日,我独在卧室外徘徊,遇见程君,前来问我道,“你可曾为http缓存写了一点什么没有?”我说“没有”,他就正告我,“你还是写一点罢,http缓存应该很高兴与你相识,你们可以相互认识多一点”。 可是我实在无话可说

  • 144
    • 微信 mp.weixin.qq.com 7 years ago
    • Cache

    一文读懂Go的net/http标准库

    一文读懂Go的net/http标准库 Original...

  • 70
    • 掘金 juejin.im 7 years ago
    • Cache

    具有代表性的 HTTP 14 个状态码

    2XX(Success 成功状态码) 2XX 响应的结果标明请求被正常处理了 200 OK 表示从客户端发来的请求在服务器端被正常处理了 在响应报文内,随状态码一起返回的信息会因方法的不同而发生改变。比如,使用 GET 方法时,对应请求资源的实体会作为响

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK