3

Downloading files from Xeno-Canto

 11 months ago
source link: https://gist.github.com/rhine3/4829bf66381c7aa05c1f656cec4fa040
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Bulk XC downloads

Three steps:

  1. Use Xeno-Canto to generate a list of all the records
  2. Convert records into a bulk-downloadable format
  3. Download all files

1. List all records

Use Xeno-Canto's web API to generate a list of all the records. For instance, https://www.xeno-canto.org/api/2/recordings?query=northern%20cardinal gives a list of all Northern Cardinal recordings.

Some species have multiple "pages" in their recording lists. You can tell if this is the case based on whether "numPages" is greater than 1. If that's the case, you won't get all of the records unless you specifically go to the next page: https://www.xeno-canto.org/api/2/recordings?query=red%crossbill&page=3

Check your record list by visiting online. When you're satisfied by the records you see at the URL, use wget to download the list of records (technically a "JSON document") onto your computer. Save the download into a file named with extension .json, e.g., noca_query.json, as in the command below:

wget -O noca-query.json https://www.xeno-canto.org/api/2/recordings?query=northern%20cardinal

2. Prepare for bulk downloading

Use a few lines of Python to transform the .json into a .csv file, and then the .csv into a .txt file containing only the download URLs. The .txt file of download URLs is all you'll use to actually get the file, but the .csv will help you keep track of which files you downloaded, what license they had, and what the creator's information was. (You need to give attribution to the creator wherever you use these files!)

import json
import pandas as pd

# Get the json entries from your downloaded json
jsonFile = open('/Users/tessa/Documents/noca-query.json', 'r')
values = json.load(jsonFile)
jsonFile.close()

# Create a pandas dataframe of records & convert to .csv file
record_df = pd.DataFrame(values['recordings'])
record_df.to_csv('xc-noca.csv', index=False)

# Make wget input file
url_list = []
for file in record_df['file'].tolist():
    url_list.append('https:{}'.format(file))
with open('xc-noca-urls.txt', 'w+') as f:
    for item in url_list:
        f.write("{}\n".format(item))

3. Download files

Use wget in your terminal to download the recordings. You may want to run save the files in their own directory; to do so, use the -P flag as below:

mkdir /Users/tessa/Downloads/XC
wget -P /Users/tessa/Downloads/XC --trust-server-names -i xc-noca-urls.txt

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK