GNU Wget 1.16 built on linux-gnueabihf
上Raspberry Pi 3
如何强制 wget 获取整个站点(跟随链接,像机器人一样),而不仅仅是第一个索引?
我试过了:
wget -r http://aol.com
wget -r -l0 http://aol.com
wget -r -m -l0 http://aol.com
每个命令都以相同的方式完成:
--2017-11-29 08:05:42-- http://aol.com/
Resolving aol.com (aol.com)... 149.174.149.73, 64.12.249.135, 149.174.110.105, ...
Connecting to aol.com (aol.com)|149.174.149.73|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://www.aol.com/ [following]
--2017-11-29 08:05:42-- https://www.aol.com/
Resolving www.aol.com (www.aol.com)... 34.233.220.13, 34.235.7.32, 52.6.64.98, ...
Connecting to www.aol.com (www.aol.com)|34.233.220.13|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Last-modified header missing -- time-stamps turned off.
--2017-11-29 08:05:44-- https://www.aol.com/
Reusing existing connection to www.aol.com:443.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘aol.com/index.html’
aol.com/index.html [ <=> ] 359.95K 751KB/s in 0.5s
2017-11-29 08:05:45 (751 KB/s) - ‘aol.com/index.html’ saved [368585]
FINISHED --2017-11-29 08:05:45--
Total wall clock time: 2.8s
Downloaded: 1 files, 360K in 0.5s (751 KB/s)
我究竟做错了什么?
出现您的问题是因为所有链接都
aol.com/index.html
指向不同的主机。要从所有主机递归下载,您可以添加选项--span-hosts
。--span-hosts '*.aol.com'
为了允许所有 aol 主机,添加该选项似乎对我有用。您可以列出链接
您会看到其中大多数指向 www.aol.com,因此您也可以致电
使用以下将允许
wget
以递归方式下载链接到网站的所有页面。将示例网站替换为您想要的网站。这会像
Deapth for search in a graph
工作方法
curl
将获取index.html
。它将通过管道输入grep
以通过匹配找到所有链接href
。输入结果将wget
作为变量提供。wget
从变量中一一获取链接。