我写了一个小的 Python 爬虫(使用 Scrapy 框架)。刮板需要无头浏览...我正在使用 ChromeDriver。
当我在没有任何 GUI 的 Ubuntu 服务器上运行此代码时,我必须安装 Xvfb 才能在我的 Ubuntu 服务器上运行 ChromeDriver(我遵循了本指南)
这是我的代码:
class MySpider(scrapy.Spider):
name = 'my_spider'
def __init__(self):
# self.driver = webdriver.Chrome(ChromeDriverManager().install())
chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
self.driver = webdriver.Chrome('/usr/bin/chromedriver', chrome_options=chrome_options)
我可以从 Ubuntu shell 运行上面的代码,它执行时没有任何错误:
ubuntu@ip-1-2-3-4:~/scrapers/my_scraper$ scrapy crawl my_spider
现在我想设置一个 cron 作业来每天运行上述命令:
# m h dom mon dow command
PATH=/usr/local/bin:/home/ubuntu/.local/bin/
05 12 * * * cd /home/ubuntu/scrapers/my_scraper && scrapy crawl my_spider >> /tmp/scraper.log 2>&1
但是 crontab 作业给了我以下错误:
Traceback (most recent call last):
File "/home/ubuntu/.local/lib/python3.6/site-packages/scrapy/crawler.py", line 192, in crawl
return self._crawl(crawler, *args, **kwargs)
File "/home/ubuntu/.local/lib/python3.6/site-packages/scrapy/crawler.py", line 196, in _crawl
d = crawler.crawl(*args, **kwargs)
File "/home/ubuntu/.local/lib/python3.6/site-packages/twisted/internet/defer.py", line 1613, in unwindGenerator
return _cancellableInlineCallbacks(gen)
File "/home/ubuntu/.local/lib/python3.6/site-packages/twisted/internet/defer.py", line 1529, in _cancellableInlineCallbacks
_inlineCallbacks(None, g, status)
--- <exception caught here> ---
File "/home/ubuntu/.local/lib/python3.6/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
result = g.send(result)
File "/home/ubuntu/.local/lib/python3.6/site-packages/scrapy/crawler.py", line 86, in crawl
self.spider = self._create_spider(*args, **kwargs)
File "/home/ubuntu/.local/lib/python3.6/site-packages/scrapy/crawler.py", line 98, in _create_spider
return self.spidercls.from_crawler(self, *args, **kwargs)
File "/home/ubuntu/.local/lib/python3.6/site-packages/scrapy/spiders/__init__.py", line 19, in from_crawler
spider = cls(*args, **kwargs)
File "/home/ubuntu/scrapers/my_scraper/my_scraper/spiders/spider.py", line 27, in __init__
self.driver = webdriver.Chrome('/usr/bin/chromedriver', chrome_options=chrome_options)
File "/home/ubuntu/.local/lib/python3.6/site-packages/selenium/webdriver/chrome/webdriver.py", line 81, in __init__
desired_capabilities=desired_capabilities)
File "/home/ubuntu/.local/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 157, in __init__
self.start_session(capabilities, browser_profile)
File "/home/ubuntu/.local/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 252, in start_session
response = self.execute(Command.NEW_SESSION, parameters)
File "/home/ubuntu/.local/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 321, in execute
self.error_handler.check_response(response)
File "/home/ubuntu/.local/lib/python3.6/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: unknown error: Chrome failed to start: exited abnormally
(unknown error: DevToolsActivePort file doesn't exist)
(The process started from chrome location /usr/bin/google-chrome is no longer running, so ChromeDriver is assuming that Chrome has crashed.)
(Driver info: chromedriver=2.41.578700 (2f1ed5f9343c13f73144538f15c00b370eda6706),platform=Linux 5.4.0-1029-aws x86_64)
更新
这个答案帮助我解决了这个问题(但我不太明白为什么)
我echo $PATH
在我的 Ubuntu shell 上运行并将值复制到 crontab 中:
PATH=/home/ubuntu/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
05 12 * * * cd /home/ubuntu/scrapers/my_scraper && scrapy crawl my_spider >> /tmp/scraper.log 2>&1
注意:由于我为这个问题创建了一个赏金,我很高兴将它奖励给任何解释为什么更改 PATH 解决了问题的答案。
这是几乎所有
cron
似乎没有运行的情况的原因。Cron 总是在一个空的环境中运行。
HOME
,LOGNAME
, 和SHELL
被设置;而且非常有限PATH
。因此,建议使用可执行文件的完整路径,并在使用cron
.你也可以:
使用您在 shell 上使用的环境变量
模拟它,通过暂时将其添加到您的 crontab 并等待一分钟以将 cron 环境保存到
~/cronenv
(然后您可以删除它):然后在该环境下测试运行 shell(默认为
SHELL=/bin/sh
):强制 crontab 运行。
此外,您不能像在 shell 中那样使用变量替换
PATH=/usr/local/bin:$PATH
,因此像这样的声明是按字面意思解释的。由于未包含在环境变量中
readlink
,因此无法找到命令dirname
和。cat
/bin
PATH
解释
尝试设置
PATH=/usr/local/bin:/home/ubuntu/.local/bin/
并执行/usr/bin/google-chrome --no-sandbox --headless --disable-dev-shm-usage
你会得到你也可以试试这个。Crontab 为用户 ubuntu 打开一个新的 shell。