7

How to Download Files From URLs With Python

 10 months ago
source link: https://realpython.com/python-download-file-from-url/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

How to Download Files From URLs With Python – Real Python

Facilitating File Downloads With Python

While it’s possible to download files from URLs using traditional command-line tools, Python provides several libraries that facilitate file retrieval. Using Python to download files offers several advantages.

One advantage is flexibility, as Python has a rich ecosystem of libraries, including ones that offer efficient ways to handle different file formats, protocols, and authentication methods. You can choose the most suitable Python tools to accomplish the task at hand and fulfill your specific requirements, whether you’re downloading from a plain-text CSV file or a complex binary file.

Another reason is portability. You may encounter situations where you’re working on cross-platform applications. In such cases, using Python is a good choice because it’s a cross-platform programming language. This means that Python code can run consistently across different operating systems, such as Windows, Linux, and macOS.

Using Python also offers the possibility of automating your processes, saving you time and effort. Some examples include automating retries if a download fails, retrieving and saving multiple files from URLs, and processing and storing your data in designated locations.

These are just a few reasons why downloading files using Python is better than using traditional command-line tools. Depending on your project requirements, you can choose the approach and library that best suits your needs. In this tutorial, you’ll learn approaches to some common scenarios requiring file retrievals.

Downloading a File From a URL in Python

In this section, you’ll learn the basics of downloading a ZIP file containing gross domestic product (GDP) data from the World Bank Open Data platform. You’ll use two common tools in Python, urllib and requests, to download GDP by country.

While the urllib package comes with Python in its standard library, it has some limitations. So, you’ll also learn to use a popular third-party library, requests, that offers more features for making HTTP requests. Later in the tutorial, you’ll see additional functionalities and use cases.

Using urllib From the Standard Library

Python ships with a package called urllib, which provides a convenient way to interact with web resources. It has a straightforward and user-friendly interface, making it suitable for quick prototyping and smaller projects. With urllib, you can perform different tasks dealing with network communication, such as parsing URLs, sending HTTP requests, downloading files, and handling errors related to network operations.

As a standard library package, urllib has no external dependencies and doesn’t require installing additional packages, making it a convenient choice. For the same reason, it’s readily accessible for development and deployment. It’s also cross-platform compatible, meaning you can write and run code seamlessly using the urllib package across different operating systems without additional dependencies or configuration.

The urllib package is also very versatile. It integrates well with other modules in the Python standard library, such as re for building and manipulating regular expressions, as well as json for working with JSON data. The latter is particularly handy when you need to consume JSON APIs.

In addition, you can extend the urllib package and use it with other third-party libraries, like requests, BeautifulSoup, and Scrapy. This offers the possibility for more advanced operations in web scraping and interacting with web APIs.

To download a file from a URL using the urllib package, you can call urlretrieve() from the urllib.request module. This function fetches a web resource from the specified URL and then saves the response to a local file. To start, import urlretrieve() from urlllib.request:

>>> from urllib.request import urlretrieve

Next, define the URL that you want to retrieve data from. If you don’t specify a path to a local file where you want to save the data, then the function will create a temporary file for you. Since you know that you’ll be downloading a ZIP file from that URL, go ahead and provide an optional path to the target file:

>>> url = (
...     "https://api.worldbank.org/v2/en/indicator/"
...     "NY.GDP.MKTP.CD?downloadformat=csv"
... )
>>> filename = "gdp_by_country.zip"

Because your URL is quite long, you rely on Python’s implicit concatenation by splitting the string literal over multiple lines inside parentheses. The Python interpreter will automatically join the separate strings on different lines into a single string. You also define the location where you wish to save the file. When you only provide a filename without a path, Python will save the resulting file in your current working directory.

Then, you can download and save the file by calling urlretrieve() and passing in the URL and optionally your filename:

>>> urlretrieve(url, filename)
('gdp_by_country.zip', <http.client.HTTPMessage object at 0x7f06ee7527d0>)

The function returns a tuple of two objects: the path to your output file and an HTTP message object. When you don’t specify a custom filename, then you’ll see a path to a temporary file that might look like this: /tmp/tmps7qjl1tj. The HTTPMessage object represents the HTTP headers returned by the server for the request, which can contain information like content type, content length, and other metadata.

You can unpack the tuple into the individual variables using an assignment statement and iterate over the headers as though they were a Python dictionary:

>>> path, headers = urlretrieve(url, filename)
>>> for name, value in headers.items():
...     print(name, value)
...
Date Wed, 28 Jun 2023 11:26:18 GMT
Content-Type application/zip
Content-Length 128310
Connection close
Set-Cookie api_https.cookieCORS=76a6c6567ab12cea5dac4942d8df71cc; Path=/; SameSite=None; Secure
Set-Cookie api_https.cookie=76a6c6567ab12cea5dac4942d8df71cc; Path=/
Cache-Control public, must-revalidate, max-age=1
Expires Wed, 28 Jun 2023 11:26:19 GMT
Last-Modified Wed, 28 Jun 2023 11:26:18 GMT
Content-Disposition attachment; filename=API_NY.GDP.MKTP.CD_DS2_en_csv_v2_5551501.zip
Request-Context appId=cid-v1:da002513-bd8b-4441-9f30-737944134422

This information might be helpful when you’re unsure about which file format you’ve just downloaded and how you’re supposed to interpret its content. In this case, it’s a ZIP file that’s about 128 kilobytes in size. You can also deduce the original filename, which was API_NY.GDP.MKTP.CD_DS2_en_csv_v2_5551501.zip.

Now that you’ve seen how to download a file from a URL using Python’s urllib package, it’s time to tackle the same task using a third-party library. You’ll find out which way is more convenient for you.

Using the Third-Party requests Library

While urllib is a good built-in option, there may be scenarios where you need to use third-party libraries to make more advanced HTTP requests, such as those requiring some form of authentication. The requests library is a popular, user-friendly, and Pythonic API for making HTTP requests in Python. It can handle the complexities of low-level network communication behind the curtain.

The requests library is also known for its flexibility and offers tighter control over the download process, allowing you to customize it according to your project requirements. Some examples include the ability to specify request headers, handle cookies, access data behind login-gated web pages, stream data in chunks, and more.

In addition, the library is designed to be efficient and performant by supporting various features that enhance the overall download performance. Its ability to automatically handle connection pooling and reuse optimizes network utilization and reduces overhead.

Now, you’ll look into using the requests library to download that same ZIP file with GDP by country data from the World Bank Open Data platform. To begin, install the requests library into your active virtual environment using pip:

(venv) $ python -m pip install requests

This command installs the latest release of the requests library into your virtual environment. Afterward, you can start a new Python REPL session and import the requests library:

>>> import requests

Before moving further, it’s worth recalling the available HTTP methods because the requests library exposes them to you through Python functions. When you make HTTP requests to web servers, you have two commonly used methods to choose from:

You’ll use the GET method to retrieve data by fetching a representation of the remote resource without modifying the server’s state. Therefore, you’ll commonly use it to retrieve files like images, HTML web pages, or raw data. You’ll use the GET request in later steps.

The POST method allows you to send data for the server to process or use in creating or updating a resource. In POST requests, the data is typically sent in the request body in various formats like JSON or XML, and it’s not visible in the URL. You can use POST requests for operations that modify server data, such as creating, updating, or submitting existing or new resources.

In this tutorial, you’ll only use GET requests for downloading files.

Next, define the URL of the file that you want to download. To include additional query parameters in the URL, you’ll pass in a dictionary of strings as key-value pairs:

>>> url = "https://api.worldbank.org/v2/en/indicator/NY.GDP.MKTP.CD"
>>> query_parameters = {"downloadformat": "csv"}

In the example above, you define the same URL as before but specify the downloadformat=csv parameter separately using a Python dictionary. The library will append those parameters to the URL after you pass them to requests.get() using an optional params argument:

>>> response = requests.get(url, params=query_parameters)

This makes a GET request to retrieve data from the constructed URL with optional query parameters. The function returns an HTTP response object with the server’s response to the request. If you’d like to see the constructed URL with the optional parameters included, then use the response object’s .url attribute:

>>> response.url
'https://api.worldbank.org/v2/en/indicator/NY.GDP.MKTP.CD?downloadformat=csv'

The response object provides several other convenient attributes that you can check out. For example, these two will let you determine if the request was successful and what HTTP status code the server returned:

>>> response.ok
True

>>> response.status_code
200

A status code of 200 indicates that your request has been completed successfully. Okay, but how do you access the data payload, usually in JSON format, that you’ve retrieved with the requests library? Read on to answer this question in the next section.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK