wget整站爬取

3 years ago

source link: http://blog.ilibrary.me/2020/01/08/wget%E6%95%B4%E7%AB%99%E7%88%AC%E5%8F%96
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

欢迎转载，请支持原创，保留原文链接:blog.ilibrary.me

用wget爬取整站

整站爬取可以用wget来做

$ wget -m -k (-H) http://www.example.com/

该命令可用来镜像一个网站，wget 将对链接进行转换。如果网站中的图像是放在另外的站点，那么可以使用 -H 选项。

你可以用python来浏览整个文件夹

python -m http.server

也可以用nginx 来host整个文件夹

docker run --name static-nginx -p 8081:80 -v $PWD:/usr/share/nginx/html:ro -d nginx

该命令保存页面的时候会把页面参数做为文件名的一部分，比如 /index.html?l=en-US, nginx会尝试去寻找index.html, 而wget把这个文件保存为’index.html?l=en-US’了。

可以通过try_files $uri$is_args$args /404.html来实现把参数做为文件名的一部分来查找文件。然后通过add_header Content-Type 'text/html'来返回html文件。 (这一步我没有试成功，只能作为一个方向)

Nginx 的 try_files 指令使用实例

扫描二维码分享到微信朋友圈Loading...Please wait qrcode.php?url=%2F2020%2F01%2F08%2Fwget%25E6%2595%25B4%25E7%25AB%2599%25E7%2588%25AC%25E5%258F%2596

Recommend

wget整站爬取

用wget爬取整站

Recommend

activeadmin无出错提示

matplotlib保存动画为mp4失败,convert: delegate failed ffmpeg error

rails用middleware改environment

python中id转换为object instance

New API Docs site, configurable cache coder, bug fixes, and more!

Active Record values_at and cache improvements

Multiple database improvements, bugfixes and more!

Enhanced strict loading, multiple databases and more!

Bugfixes, improvements and more!

New Active Record and Action View capabilities, bug fixes and more!

About Joyk