2

Improving open-uri

 3 years ago
source link: https://janko.io/improving-open-uri/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Improving open-uri

When working on the Shrine library for handling file uploads, in multiple places I needed to be able to download a file from URL. If you know the Ruby standard library well, the solution might be obvious to you: open-uri.

require "open-uri"
result = open("http://example.com/image.jpg")
result #=> #<Tempfile:/var/folders/k7/6zx6dx6x7ys3rv3srh0nyfj00000gn/T/20160524-10403-xpdakz>

Open-uri is something that I indeed very much wanted to use for my use case. It ships with Ruby, so there are no external dependencies (just Net::HTTP), and it has many benefits:

  • downloads to a unique filesystem location (using Tempfile)
  • supports HTTP/HTTPS/FTP links
  • follows redirects
  • memory efficient
  • easy basic authentication
  • easy proxy

However, also considering that in my case the URL could come from user input, open-uri turned out to have many limiations and quirks:

  • Using Kernel#open makes you vulnerable to remote code execution
  • If the remote file is smaller than 10KB, open-uri actually returns a StringIO instead of a Tempfile
  • URL’s file extension isn’t preserved in downloaded Tempfile
  • You cannot limit maximum number of redirects
  • You cannot limit maximum filesize

I’ve thought about alternatives: rest-client, curl or wget. However, rest-client was a too heavy dependency just for downloading, and I didn’t want to depend on external CLI tools. Also, none of them were able to properly limit the maximum filesize, which I found important in context of Shrine.

So, realizing that I still wanted to use open-uri, I decided to make a wrapper around it that addresses these limitations. I want to guide you through my journey, fixing one issue at a time.

Improvements

Kernel#open

Ruby has a Kernel#open method, which given a file path acts as File.open. but given a string that starts with “|”, it interprets it as a shell command and returns an IO connected to the spawned subprocess:

open("| ls") # returns an IO connected to the `ls` shell command

Open-uri extends Kernel#open with the ability to accept URLs. However, if the URL is coming from user input, we should never pass it to Kernel#open, because different users have different ideas on what is a “URL”; someone might think that | rm -rf ~ is a nice looking URL.

A little known fact is that Kernel#open just delegates to URI::(HTTP|HTTPS|FTP)#open, and we can simply use that instead:

uri = URI.parse("http://example.com/image.jpg") #=> #<URI::HTTP>
uri.open #=> #<Tempfile:/var/folders/k7/6zx6dx6x7ys3rv3srh0nyfj00000gn/T/20160524-10403-xpdakz>

StringIO

Stangely, if the remote file has less than 10KB, open-uri will actually return a StringIO instead of a Tempfile.

uri.open #=> #<StringIO>

In context of Shrine I wanted the returned IO to always be a file, for consistency and because it could later be given for processing. We can easily fix that:

io = uri.open

if io.is_a?(StringIO)
  downloaded = Tempfile.new
  File.write(downloaded.path, io.string)
else
  downloaded = io
end

downloaded # now always a Tempfile

File extension

Surprisingly, open-uri always creates a Tempfile without a file extension, even if the url has one. In Shrine I wanted that downloaded files (which will later be uploaded) always have an extension if it’s known.

So let’s copy the downloaded IO to a new Tempfile which has a file extension, but use mv if we can so that we don’t pay any performance penalty (and that the old file also gets deleted):

io = uri.open
downloaded = Tempfile.new([File.basename(uri.path), File.extname(uri.path)])

if io.is_a?(Tempfile)
  FileUtils.mv io.path, downloaded.path
else # StringIO
  File.write(downloaded.path, io.string)
end

File.extname(downloaded.path) #=> ".jpg"

Redirects

What’s good is that open-uri can automatically follow redirects. What’s bad is that we cannot limit the maximum number of redirects. This allows the attacker to give a URL which causes a redirect loop, and open-uri would continue making requests forever. To be fair, open-uri has a detection for redirect loops, but only if URLs repeat.

So we disable open-uri’s following of redirects, which now raises OpenURI::HTTPRedirect on redirects, allowing us to reimplement it:

tries = 3

begin
  uri.open(redirect: false)
rescue OpenURI::HTTPRedirect => redirect
  uri = redirect.uri # assigned from the "Location" response header
  retry if (tries -= 1) > 0
  raise
end

Maximum filesize

Since the URL can sometimes come from the user input, I wanted to give Shrine users the ability to limit maximum filesize of the remote file. Specifically, I wanted that download aborts as soon as the “Content-Length” header reveals that the file will be too large. Luckily, open-uri has the :content_length_proc option, which calls the given proc as soon as open-uri reads “Content-Length”:

uri.open(
  content_length_proc: ->(size) { raise FileTooLarge if size > max_size },
)

However, an attacker could theoretically create an app which returns large files, but where the “Content-Length” response header is ommited on purpose. Luckily, open-uri has got our back on this one too with :progress_proc, which calls the given proc whenever a chunk is downloaded, with the current size. That means we can add it as a fallback in case “Content-Length” is missing:

uri.open(
  content_length_proc: ->(size) { raise FileTooLarge if size && size > max_size },
  progress_proc:       ->(size) { raise FileTooLarge if size > max_size },
)

User agent

It turns out that when we’re making requests to an application, but we don’t include a “User-Agent” header, most applications will start rejecting our requests after some time.

Open-uri doesn’t include a “User-Agent” by default, but allows us to easily add one, since open-uri treats any unknown option as a request header:

uri.open("User-Agent" => "MyApp/1.0")

Result

The result of this investigation is the Down gem, which incorporates all of these improvements, and more. You can use it like this:

require "down"
result = Down.download("http://example.com/image.jpg")
result #=> #<Tempfile:/var/folders/k7/6zx6dx6x7ys3rv3srh0nyfj00000gn/T/20160524-10403-xpdakz.jpg>

More advanced downloading could look something like this:

Down.download "http://example.com/image.jpg",
  max_size: 20*1024*1024,   # 20 MB
  max_redirects: 5,         # default is 2
  proxy: "http://proxy.com" # delegates to open-uri

Conclusion

I like that I was able to make a lightweight wrapper around open-uri, which already had most of the features that I wanted, but allowed me to complete the ones that I was missing. If you want to use open-uri, but without any of the mentioned quirks, consider using Down.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK