Improving open-uri
source link: https://janko.io/improving-open-uri/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
Improving open-uri
When working on the Shrine library for handling file uploads, in multiple places I needed to be able to download a file from URL. If you know the Ruby standard library well, the solution might be obvious to you: open-uri.
require "open-uri"
result = open("http://example.com/image.jpg")
result #=> #<Tempfile:/var/folders/k7/6zx6dx6x7ys3rv3srh0nyfj00000gn/T/20160524-10403-xpdakz>
Open-uri is something that I indeed very much wanted to use for my use case. It ships with Ruby, so there are no external dependencies (just Net::HTTP), and it has many benefits:
- downloads to a unique filesystem location (using Tempfile)
- supports HTTP/HTTPS/FTP links
- follows redirects
- memory efficient
- easy basic authentication
- easy proxy
However, also considering that in my case the URL could come from user input, open-uri turned out to have many limiations and quirks:
- Using
Kernel#open
makes you vulnerable to remote code execution - If the remote file is smaller than 10KB, open-uri actually returns a StringIO instead of a Tempfile
- URL’s file extension isn’t preserved in downloaded Tempfile
- You cannot limit maximum number of redirects
- You cannot limit maximum filesize
I’ve thought about alternatives: rest-client, curl
or wget
. However,
rest-client was a too heavy dependency just for downloading, and I didn’t want
to depend on external CLI tools. Also, none of them were able to properly limit
the maximum filesize, which I found important in context of Shrine.
So, realizing that I still wanted to use open-uri, I decided to make a wrapper around it that addresses these limitations. I want to guide you through my journey, fixing one issue at a time.
Improvements
Kernel#open
Ruby has a Kernel#open
method, which given a file path acts as File.open
.
but given a string that starts with “|”, it interprets it as a shell command
and returns an IO connected to the spawned subprocess:
open("| ls") # returns an IO connected to the `ls` shell command
Open-uri extends Kernel#open
with the ability to accept URLs. However, if the
URL is coming from user input, we should never pass it to Kernel#open
,
because different users have different ideas on what is a “URL”; someone might
think that | rm -rf ~
is a nice looking URL.
A little known fact is that Kernel#open
just delegates to
URI::(HTTP|HTTPS|FTP)#open
, and we can simply use that instead:
uri = URI.parse("http://example.com/image.jpg") #=> #<URI::HTTP>
uri.open #=> #<Tempfile:/var/folders/k7/6zx6dx6x7ys3rv3srh0nyfj00000gn/T/20160524-10403-xpdakz>
StringIO
Stangely, if the remote file has less than 10KB, open-uri will actually return a StringIO instead of a Tempfile.
uri.open #=> #<StringIO>
In context of Shrine I wanted the returned IO to always be a file, for consistency and because it could later be given for processing. We can easily fix that:
io = uri.open
if io.is_a?(StringIO)
downloaded = Tempfile.new
File.write(downloaded.path, io.string)
else
downloaded = io
end
downloaded # now always a Tempfile
File extension
Surprisingly, open-uri always creates a Tempfile without a file extension, even if the url has one. In Shrine I wanted that downloaded files (which will later be uploaded) always have an extension if it’s known.
So let’s copy the downloaded IO to a new Tempfile which has a file extension,
but use mv
if we can so that we don’t pay any performance penalty (and that
the old file also gets deleted):
io = uri.open
downloaded = Tempfile.new([File.basename(uri.path), File.extname(uri.path)])
if io.is_a?(Tempfile)
FileUtils.mv io.path, downloaded.path
else # StringIO
File.write(downloaded.path, io.string)
end
File.extname(downloaded.path) #=> ".jpg"
Redirects
What’s good is that open-uri can automatically follow redirects. What’s bad is that we cannot limit the maximum number of redirects. This allows the attacker to give a URL which causes a redirect loop, and open-uri would continue making requests forever. To be fair, open-uri has a detection for redirect loops, but only if URLs repeat.
So we disable open-uri’s following of redirects, which now raises
OpenURI::HTTPRedirect
on redirects, allowing us to reimplement it:
tries = 3
begin
uri.open(redirect: false)
rescue OpenURI::HTTPRedirect => redirect
uri = redirect.uri # assigned from the "Location" response header
retry if (tries -= 1) > 0
raise
end
Maximum filesize
Since the URL can sometimes come from the user input, I wanted to give Shrine
users the ability to limit maximum filesize of the remote file. Specifically, I
wanted that download aborts as soon as the “Content-Length” header reveals that
the file will be too large. Luckily, open-uri has the :content_length_proc
option, which calls the given proc as soon as open-uri reads “Content-Length”:
uri.open(
content_length_proc: ->(size) { raise FileTooLarge if size > max_size },
)
However, an attacker could theoretically create an app which returns large
files, but where the “Content-Length” response header is ommited on purpose.
Luckily, open-uri has got our back on this one too with :progress_proc
, which
calls the given proc whenever a chunk is downloaded, with the current size.
That means we can add it as a fallback in case “Content-Length” is missing:
uri.open(
content_length_proc: ->(size) { raise FileTooLarge if size && size > max_size },
progress_proc: ->(size) { raise FileTooLarge if size > max_size },
)
User agent
It turns out that when we’re making requests to an application, but we don’t include a “User-Agent” header, most applications will start rejecting our requests after some time.
Open-uri doesn’t include a “User-Agent” by default, but allows us to easily add one, since open-uri treats any unknown option as a request header:
uri.open("User-Agent" => "MyApp/1.0")
Result
The result of this investigation is the Down gem, which incorporates all of these improvements, and more. You can use it like this:
require "down"
result = Down.download("http://example.com/image.jpg")
result #=> #<Tempfile:/var/folders/k7/6zx6dx6x7ys3rv3srh0nyfj00000gn/T/20160524-10403-xpdakz.jpg>
More advanced downloading could look something like this:
Down.download "http://example.com/image.jpg",
max_size: 20*1024*1024, # 20 MB
max_redirects: 5, # default is 2
proxy: "http://proxy.com" # delegates to open-uri
Conclusion
I like that I was able to make a lightweight wrapper around open-uri, which already had most of the features that I wanted, but allowed me to complete the ones that I was missing. If you want to use open-uri, but without any of the mentioned quirks, consider using Down.
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK