我正在尝试设置一个简单的流浪盒来使用 FreeIPA 进行测试。我使用的是 CentOS 7 映像,并在盒子中安装了最少的额外东西,并使用一个非常简单的 FreeIPA 定义开始。我尝试过使用简单的 shell 命令,也尝试过使用ansible-freeipa。在这两种情况下,我都看到了相同的错误,尽管它发生的频率似乎不同。在简单的 shell 命令中,它只有大约 50% 的时间失败,但使用 Ansible 似乎是 100%。
失败给了我一个类似下面的错误。
fatal: [ipaserver.test.hadoop.com]: FAILED! => {"changed": false, "module_stderr": "Shared connection to ipaserver.test.hadoop.com closed.\r\n", "module_stdout": "\u001b[?1034hTraceback (most recent call last):\r\n File "/root/.ansible/tmp/ansible-tmp-1583188576.27-186488091977372/AnsiballZ_ipaserver_setup_ca.py", line 102, in \r\n _ansiballz_main()\r\n File "/root/.ansible/tmp/ansible-tmp-1583188576.27-186488091977372/AnsiballZ_ipaserver_setup_ca.py", line 94, in _ansiballz_main\r\n invoke_module(zipped_mod, temp_path, ANSIBALLZ_PARAMS)\r\n File "/root/.ansible/tmp/ansible-tmp-1583188576.27-186488091977372/AnsiballZ_ipaserver_setup_ca.py", line 40, in invoke_module\r\n runpy.run_module(mod_name='ansible.modules.ipaserver_setup_ca', init_globals=None, run_name='main', alter_sys=True)\r\n File "/usr/lib64/python2.7/runpy.py", line 176, in run_module\r\n fname, loader, pkg_name)\r\n File "/usr/lib64/python2.7/runpy.py", line 82, in _run_module_code\r\n mod_name, mod_fname, mod_loader, pkg_name)\r\n File "/usr/lib64/python2.7/runpy.py", line 72, in _run_code\r\n exec code in run_globals\r\n File "/tmp/ansible_ipaserver_setup_ca_payload_Pc9wnM/ansible_ipaserver_setup_ca_payload.zip/ansible/modules/ipaserver_setup_ca.py", line 354, in \r\n File "/tmp/ansible_ipaserver_setup_ca_payload_Pc9wnM/ansible_ipaserver_setup_ca_payload.zip/ansible/modules/ipaserver_setup_ca.py", line 345, in main\r\n File "/usr/lib/python2.7/site-packages/ipaserver/install/ca.py", line 391, in install_step_1\r\n ca.start('pki-tomcat')\r\n File "/usr/lib/python2.7/site-packages/ipaserver/install/service.py", line 464, in start\r\n self.service.start(instance_name, capture_output=capture_output, wait=wait)\r\n File "/usr/lib/python2.7/site-packages/ipaplatform/redhat/services.py", line 192, in start\r\n self.wait_until_running()\r\n File "/usr/lib/python2.7/site-packages/ipaplatform/redhat/services.py", line 186, in wait_until_running\r\n raise RuntimeError('CA did not start in %ss' % timeout)\r\nRuntimeError: CA did not start in 300.0s\r\n", "msg": "MODULE FAILURE\nSee stdout/stderr for the exact error", "rc": 1}
查看 /var/log/messages,我看到错误发生在系统时间 23:25 到 23:27 之间的某个时间。根据错误,在重新启动 CA 时会发生这种情况。它似乎在第一次启动时就可以正常启动。
Mar 2 23:25:42 localhost systemd: Stopped PKI Tomcat Server pki-tomcat.
Mar 2 23:25:43 localhost systemd: Starting PKI Tomcat Server pki-tomcat...
Mar 2 23:26:13 localhost pkidaemon: -----------------------
Mar 2 23:26:13 localhost pkidaemon: Banner is not installed
Mar 2 23:26:13 localhost pkidaemon: -----------------------
Mar 2 23:27:07 localhost pkidaemon: ----------------------
Mar 2 23:27:08 localhost pkidaemon: Enabled all subsystems
Mar 2 23:27:08 localhost pkidaemon: ----------------------
Mar 2 23:27:18 localhost systemd: [email protected] start-pre operation timed out. Terminating.
Mar 2 23:27:18 localhost systemd: Failed to start PKI Tomcat Server pki-tomcat.
Mar 2 23:27:18 localhost systemd: Unit [email protected] entered failed state.
Mar 2 23:27:18 localhost systemd: [email protected] failed.
但是,当我查看 PKI 日志时,在此时间范围内没有任何内容。这些是 /var/log/pki/pki-ca-spawn.20200302231442.log 中的最后几行:
2020-03-02 23:18:32 pkispawn : INFO END spawning subsystem 'CA' of instance 'pki-tomcat'
2020-03-02 23:18:32 pkispawn : INFO ... archiving configuration into '/var/log/pki/pki-tomcat/ca/archive/spawn_deployment.cfg.20200302231442'
2020-03-02 23:18:32 pkispawn : INFO ....... cp -p /etc/sysconfig/pki/tomcat/pki-tomcat/ca/deployment.cfg /var/log/pki/pki-tomcat/ca/archive/spawn_deployment.cfg.20200302231442
2020-03-02 23:18:32 pkispawn : DEBUG ........... chmod 660 /var/log/pki/pki-tomcat/ca/archive/spawn_deployment.cfg.20200302231442
2020-03-02 23:18:32 pkispawn : DEBUG ........... chown 17:17 /var/log/pki/pki-tomcat/ca/archive/spawn_deployment.cfg.20200302231442
2020-03-02 23:18:32 pkispawn : INFO ... archiving manifest into '/var/log/pki/pki-tomcat/ca/archive/spawn_manifest.20200302231442'
2020-03-02 23:18:32 pkispawn : INFO ....... cp -p /etc/sysconfig/pki/tomcat/pki-tomcat/ca/manifest /var/log/pki/pki-tomcat/ca/archive/spawn_manifest.20200302231442
2020-03-02 23:18:32 pkispawn : DEBUG ........... chmod 660 /var/log/pki/pki-tomcat/ca/archive/spawn_manifest.20200302231442
2020-03-02 23:18:32 pkispawn : DEBUG ........... chown 17:17 /var/log/pki/pki-tomcat/ca/archive/spawn_manifest.20200302231442
/var/log/pki/pki-tomcat/ca/debug 相同:
[02/Mar/2020:23:25:00][http-bio-8080-exec-14]: getConn: mNumConns now 4
[02/Mar/2020:23:25:00][http-bio-8080-exec-14]: returnConn: mNumConns now 5
[02/Mar/2020:23:25:00][http-bio-8080-exec-14]: In LdapBoundConnFactory::getConn()
[02/Mar/2020:23:25:00][http-bio-8080-exec-14]: masterConn is connected: true
[02/Mar/2020:23:25:00][http-bio-8080-exec-14]: getConn: conn is connected true
[02/Mar/2020:23:25:00][http-bio-8080-exec-14]: getConn: mNumConns now 4
[02/Mar/2020:23:25:00][http-bio-8080-exec-14]: returnConn: mNumConns now 5
[02/Mar/2020:23:25:00][http-bio-8080-exec-14]: CMSServlet.java: renderTemplate
[02/Mar/2020:23:25:00][http-bio-8080-exec-14]: CMSServlet.java: xml parameter detected, returning xml
[02/Mar/2020:23:25:00][http-bio-8080-exec-14]: CMSServlet: curDate=Mon Mar 02 23:25:00 UTC 2020 id=caDisplayCertFromRequest time=144
/var/log/pki/pki-tomcat/ca/system 有一些错误,但在 23:25 之后没有:
0.localhost-startStop-1 - [02/Mar/2020:23:15:08 UTC] [13] [3] authz instance DirAclAuthz initialization failed and skipped, error=Property internaldb.ldapconn.port missing value
0.http-bio-8443-exec-3 - [02/Mar/2020:23:17:53 UTC] [3] [3] CASigningUnit: Object certificate not found. Error Certificate object not found
0.http-bio-8443-exec-3 - [02/Mar/2020:23:17:54 UTC] [11] [3] UGSubsystem: Get User Error netscape.ldap.LDAPException: error result (32); matchedDN = ou=People,o=ipaca
0.Thread-16 - [02/Mar/2020:23:25:00 UTC] [8] [3] Publishing: Could not publish certificate serial number 0x7. Error Failed to publish using rule: No rules enabled
我不知道是什么原因造成的。有任何想法吗?vagrant 文件和 hosts 文件都位于下面的 GitHub 存储库中:https ://github.com/davidov541/HadoopOnVagrant/tree/AnsibleRetrofit/FreeIPA
最后的ansible错误:
这让我想到了它出现在邮件列表中的一些东西,并且通过将脚本上的starting_timeout 变量设置为更高的值来修复它。
解决方案在这里描述:
https://www.freeipa.org/page/HowTo/FreeIPA_on_banana_pi
也许可以尝试一下,我不能保证它会为你解决它,但是......
我最终将我的 vagrant box 大小增加到 2 GB 以解决我看到的另一个问题,并且自昨天以来重新创建了 10 到 20 次后,我再也没有看到这个问题。基于此,我认为问题是由于 tomcat 没有足够的内存来启动它需要的,导致我们看到的行为。