

HTTP错误403-禁止使用urlretrieve
source link: http://coding2live.com/detail/78/HTTP%E9%94%99%E8%AF%AF403-%E7%A6%81%E6%AD%A2%E4%BD%BF%E7%94%A8urlretrieve
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

HTTP错误403-禁止使用urlretrieve
coding2live 2021-01-29 15:27:18 0 154 python, http, python-requests, urllib
我正在下载PDF,遇到了一个报错:HTTP Error 403: Forbidden
我个人猜测的原因可能是请求被禁止了,但我没有找到解决方案。
下面是我的代码:
import urllib.request
import urllib.parse
import requests
def download_pdf(url):
full_name = "Test.pdf"
urllib.request.urlretrieve(url, full_name)
try:
url = ('http://papers.xtremepapers.com/CIE/Cambridge IGCSE/Mathematics (0580)/0580_s03_qp_1.pdf')
print('initialized')
hdr = {}
hdr = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36',
'Content-Length': '136963',
}
print('HDR recieved')
req = urllib.request.Request(url, headers=hdr)
print('Header sent')
resp = urllib.request.urlopen(req)
print('Request sent')
respData = resp.read()
download_pdf(url)
print('Complete')
except Exception as e:
print(str(e))
以下答案仅供参考
你的猜测是对的。
远程服务器显然正在检查user agent header
,并拒绝来自Python的urllib的请求。
虽然urllib.request.urlretrieve()
不允许更改HTTP请求头。但是,你可以用urllib.request.URLopener.retrieve()
:
import urllib.request
opener = urllib.request.URLopener()
opener.addheader('User-Agent', 'whatever')
filename, headers = opener.retrieve(url, 'Test.pdf')
注意:你使用的是python3,这些函数现在被认为是“遗留接口”的一部分,而且URLopener
已被弃用。
所以,不应该继续使用这些老旧的方法了。
另外,简单直接地访问URL也会遇到很多麻烦。
你的项目里引入了requests
包,那应该使用requests
而不是用urllib
。
requests
使用起来更简单:
import requests
url = 'http://papers.xtremepapers.com/CIE/Cambridge IGCSE/Mathematics (0580)/0580_s03_qp_1.pdf'
r = requests.get(url)
with open('0580_s03_qp_1.pdf', 'wb') as outfile:
outfile.write(r.content)
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK