Nagios 从 3.5.1 升级到 4.0.8
我想在 nagios 支持论坛上问这个问题,但是一个小时后,我没有收到设置帐户的确认电子邮件...
nagios 似乎可以作为服务运行,但 Web CGI 无法正常工作,并且 apache 的 error.log 和 nagios.log 中都没有错误。我检查了权限,并查看了一些具有此嵌入错误的 C 代码:
哎呀!错误:无法读取主机和服务状态信息!
nagios 主页左侧的几乎每个菜单都会出现上述相同的错误。
nagios.log 在启动和停止时看起来像这样,从 init 开始:
[1431102009] Nagios 4.0.8 starting... (PID=27779)
[1431102009] Local time is Fri May 08 13:20:09 ADT 2015
[1431102009] LOG VERSION: 2.0
[1431102009] qh: Socket '/usr/local/nagios/var/rw/query.sh' successfully initialized
[1431102009] qh: core query handler registered
[1431102009] nerd: Channel hostchecks registered successfully
[1431102009] nerd: Channel servicechecks registered successfully
[1431102009] nerd: Channel opathchecks registered successfully
[1431102009] nerd: Fully initialized and ready to rock!
[1431102009] wproc: Successfully registered manager as @wproc with query handler
[1431102009] wproc: Registry request: name=Core Worker 27785;pid=27785
[1431102009] wproc: Registry request: name=Core Worker 27786;pid=27786
[1431102009] wproc: Registry request: name=Core Worker 27782;pid=27782
[1431102009] wproc: Registry request: name=Core Worker 27781;pid=27781
[1431102009] wproc: Registry request: name=Core Worker 27783;pid=27783
[1431102009] wproc: Registry request: name=Core Worker 27784;pid=27784
[1431102009] Successfully launched command file worker with pid 27787
[1431102022] Caught SIGTERM, shutting down...
[1431102022] Successfully shutdown... (PID=27779)
[1431102022] Event broker module 'NERD' deinitialized successfully.
使用 -v 运行很干净:
# /usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg
Nagios Core 4.0.8
Copyright (c) 2009-present Nagios Core Development Team and Community Contributors
Copyright (c) 1999-2009 Ethan Galstad
Last Modified: 08-12-2014
License: GPL
Website: http://www.nagios.org
Reading configuration data...
Read main config file okay...
Read object config files okay...
Running pre-flight check on configuration data...
Checking objects...
Checked 816 services.
Checked 826 hosts.
Checked 11 host groups.
Checked 0 service groups.
Checked 18 contacts.
Checked 13 contact groups.
Checked 61 commands.
Checked 6 time periods.
Checked 0 host escalations.
Checked 0 service escalations.
Checking for circular paths...
Checked 826 hosts
Checked 0 service dependencies
Checked 0 host dependencies
Checked 6 timeperiods
Checking global event handlers...
Checking obsessive compulsive processor commands...
Checking misc settings...
Total Warnings: 0
Total Errors: 0
Things look okay - No serious problems were detected during the pre-flight check
此外,check_nagios 说我们运行正常:
# /usr/local/nagios/libexec/check_nagios /var/log/nagios.log 5 '/usr/local/nagios/bin/nagios'
NAGIOS OK: 8 processes, status log updated 11 seconds ago
一种可能性是错误意味着它无法访问 nagios.cfg 文件。我已经检查了路径上所有目录上的“其他”(以覆盖 apache 用户)的 rx 路径。无论如何,如果存在权限问题,那应该会导致 apache 错误。我已经为此工作了几个小时,但找不到失败点或发生了什么变化。
主页还在 Nagios Core 徽标下显示“无法获取进程状态”。那是在 main.php 中运行 statusjson.cgi - 不确定它在看什么,但是当我从 main.php 手动运行 CGI 查询 (cgi-bin/statusjson.cgi?query=programstatus) 时页面是空白的。我用谷歌搜索了这个,搜索了 nagios 论坛,但其他人似乎都有一些日志错误来提供更多线索。
我确实有一个异常...
我发现了另一个 nagios.log,每次启动服务时只需要几行代码:
# cat /usr/local/nagios/var/nagios.log
[1431088940] Error: Cannot open main configuration file '/' for reading!
[1431088940] Error: Failed to process config file '/'. Aborting
也许 init 或 cfg 文件有些古怪,但我找不到。作为另一个测试,我可以 su 到 nagios 并手动运行守护进程。
su - nagios
[nagios@atlas ~]$ /usr/local/nagios/bin/nagios /usr/local/nagios/etc/nagios.cfg
Nagios Core 4.0.8
Copyright (c) 2009-present Nagios Core Development Team and Community Contributors
Copyright (c) 1999-2009 Ethan Galstad
Last Modified: 08-12-2014
License: GPL
Website: http://www.nagios.org
Nagios 4.0.8 starting... (PID=23234)
Local time is Fri May 08 13:45:12 ADT 2015
nerd: Channel hostchecks registered successfully
nerd: Channel servicechecks registered successfully
nerd: Channel opathchecks registered successfully
nerd: Fully initialized and ready to rock!
wproc: Successfully registered manager as @wproc with query handler
wproc: Registry request: name=Core Worker 23235;pid=23235
wproc: Registry request: name=Core Worker 23236;pid=23236
wproc: Registry request: name=Core Worker 23237;pid=23237
wproc: Registry request: name=Core Worker 23238;pid=23238
wproc: Registry request: name=Core Worker 23239;pid=23239
wproc: Registry request: name=Core Worker 23240;pid=23240
Successfully launched command file worker with pid 23241
我希望这会避免 init 脚本中出现任何奇怪的情况。它不会触及 /usr/local/nagios/var/nagios.log(预期),但不会更改来自网站 cgis 的错误。另一个线索是,当像这样手动启动 nagios 时,我在主机和状态项的屏幕上看不到任何日志记录。如果我启动 init,会有一些主机性能警告、抖动和来自 nagios 日志的通常喋喋不休,但是当以 nagios 用户从命令行启动时,它并没有比上面所说的更多。
这个问题最终确实去了 nagios 核心支持论坛,并在那里得到了解决。
http://support.nagios.com/forum/viewtopic.php?f=7&t=32795
在这种特殊情况下,我们缺少配置条目
state_retention status_file
但是有许多不同类型的错误也可能导致以“哎呀!”开头的 Web 界面错误。