0

How To Harden And Simplify Urlopen Function In Python

 3 years ago
source link: https://hackernoon.com/how-to-harden-and-simplify-urlopen-function-in-python-69w33wc
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

How To Harden And Simplify Urlopen Function In Python

@bowmanjdJonathan Bowman

Constantly learning software development, Python, Linux, containers, and a little Rust. Thinks regex is fun.

Python has a useful high-level HTTP client built into the standard library:

urllib.request.urlopen()
. While HTTP libraries such as requests and HTTPX are excellent,
urlopen
gets things done without external dependencies.
0 reactions
heart.png
light.png
money.png
thumbs-down.png

However, there is a notable security concern with

urlopen
: the url can open a variety of protocols, even allowing local filesystem access with
file:///
URLs. Of course, we should always sanitize user input, but what if we could configure
urlopen 
to only load, say,
https
URLs?
0 reactions
heart.png
light.png
money.png
thumbs-down.png

The built-in OpenerDirector class offers just such an opportunity to streamline

urlopen
, make it more secure, and provide custom error handling.
0 reactions
heart.png
light.png
money.png
thumbs-down.png

Security problems with the default

urlopen

You might try the following on a Linux system:

0 reactions
heart.png
light.png
money.png
thumbs-down.png
from urllib.request import urlopen

url = "file:///etc/passwd"

with urlopen(url) as response:
  print(response.read().decode())

A little disturbing, yes? Bandit agrees. (Perhaps you want to consider scanning with Bandit or its related flake8 plugin.)

0 reactions
heart.png
light.png
money.png
thumbs-down.png

The code above certainly communicates the lesson "sanitize your user inputs". Of course, if you control that "url" string, or can ensure that it starts with the correct "https://" scheme, then things are looking better. If this url comes from user input, though, it would be good to check for protocol at least, and, better yet, ensure that the domain is as expected.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

A suggested solution

The following code results in a

urlopen 
command that only opens "https://" URLs by default:
0 reactions
heart.png
light.png
money.png
thumbs-down.png
import urllib.request


class SafeOpener(urllib.request.OpenerDirector):
    def __init__(self, handlers: typing.Iterable = None):
        super().__init__()
        handlers = handlers or (
            urllib.request.UnknownHandler,
            urllib.request.HTTPDefaultErrorHandler,
            urllib.request.HTTPRedirectHandler,
            urllib.request.HTTPSHandler,
            urllib.request.HTTPErrorProcessor,
        )

        for handler_class in handlers:
            self.add_handler(handler_class())


opener = SafeOpener()
urllib.request.install_opener(opener)

After running the above, using

urllib.request.urlopen()
should fail with a
URLError
if attempting to open http, ftp, file, data, or any other URL that doesn't have "https:" at the beginning. It will still follow redirects automatically, just as
urlopen
does, and raise an exception for any HTTP status code that isn't in the 200s or 300s.
0 reactions
heart.png
light.png
money.png
thumbs-down.png

By the way, if you prefer not to override the opener in the

urlopen 
function, you could remove the
install_opener(opener)
line. Then call
opener.open()
anywhere you would have previously used
urlopen()
.
0 reactions
heart.png
light.png
money.png
thumbs-down.png

The above code assumes that all HTTP calls will be encrypted with TLS (aka "SSL", with "https:" at the beginning of the URL). That also means that testing will need to use "https:" URLs as well. Consider using the vcr library to mock-reproduce HTTP calls, or the trustme library to actually set up certs for testing. If you need to use unencrypted "http:" URLs, though, you can simply add

urllib.request.HTTPHandler
to the
handlers 
iterable.
0 reactions
heart.png
light.png
money.png
thumbs-down.png

The handlers, defined

This is the chain of default handlers normally used by

urlopen
:
0 reactions
heart.png
light.png
money.png
thumbs-down.png
  • ProxyHandler
    : searches system settings for proxies. If you are 100% sure that your tool or library will never be used with a proxy, then this is not necessary.
  • UnknownHandler
    : raises a URLError if the protocol requested in the URL is not supported by a handler in this chain. Very helpful and recommended.
  • HTTPHandler
    : handles unencrypted HTTP connections. Only add this if you are sure you need URL support other than HTTPS
  • HTTPDefaultErrorHandler
    : a prefilter of sorts that turns all responses into exceptions, for handling downstream. This is necessary unless you plan on handling statuses, exceptions, and redirects yourself.
  • HTTPRedirectHandler
    : handles redirects (status codes 301, 302, 303, or 307) and is necessary if automatic following of redirects is desired.
  • FTPHandler
    : handles ftp: URLs. Not necessary for HTTP calls.
  • FileHandler
    : handles file: URLs, and poses security risks. Rarely should this be necessary, if ever.
  • HTTPErrorProcessor
    : The final response handler, raising any non-200 (OK) responses
  • DataHandler
    : handles 
    data:
     URLs. Hard to imagine why this would be necessary in normal use, and could pose potential security risks with user input.

As may be apparent, several of the above are rarely necessary, if ever, for HTTP API work. Instead, I recommend this list as a happy medium between security and usability:

0 reactions
heart.png
light.png
money.png
thumbs-down.png
  • ProxyHandler
  • UnknownHandler
  • HTTPHandler
  • HTTPDefaultErrorHandler
  • HTTPRedirectHandler
  • HTTPSHandler
  • HTTPErrorProcessor

Of the above,

ProxyHandler 
could possibly be removed if you know you don't need it, and even
HTTPHandler 
could be removed if you know that only HTTPS URLs will be called. Actually, this is a pretty good combination: the point of HTTPS is to ensure that nothing is intercepting the connection, and that there are no proxies. So a most-secure list is the same as what is in the example code above:
0 reactions
heart.png
light.png
money.png
thumbs-down.png
  • UnknownHandler
  • HTTPDefaultErrorHandler
  • HTTPRedirectHandler
  • HTTPSHandler
  • HTTPErrorProcessor

Five handlers, not the original nine.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

Flexibility

The use cases for custom

OpenerDirector 
instances go beyond just security and simplicity. By subclassing
BaseHandler
then adding custom status handlers with names like
http_error_401
, you create your own handlers that then can be appended to the handler list. These can be used for authorization, retry cadence, and other goals.
0 reactions
heart.png
light.png
money.png
thumbs-down.png

I hope these suggestions open up possibilities and peace of mind for you.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

Also published at https://dev.to/bowmanjd/hardening-and-simplifying-python-s-urlopen-4gee

0 reactions
heart.png
light.png
money.png
thumbs-down.png
5
heart.pngheart.pngheart.pngheart.png
light.pnglight.pnglight.pnglight.png
boat.pngboat.pngboat.pngboat.png
money.pngmoney.pngmoney.pngmoney.png
Share this story

@bowmanjdJonathan Bowman

Read my stories

Constantly learning software development, Python, Linux, containers, and a little Rust. Thinks regex is fun.

Join Hacker Noon

Create your free account to unlock your custom reading experience.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK