管理我们集群的人最近突然去世了,所以现在我们必须自己操作它,直到新人加入。我们想要更改集群上节点的硬运行时间限制。由于某种原因,队列中的所有节点都具有所需的硬运行时间限制,但其中一个节点没有。
如何为给定的 x 设置 h_rt=x ?
管理我们集群的人最近突然去世了,所以现在我们必须自己操作它,直到新人加入。我们想要更改集群上节点的硬运行时间限制。由于某种原因,队列中的所有节点都具有所需的硬运行时间限制,但其中一个节点没有。
如何为给定的 x 设置 h_rt=x ?
我使用 Slurm 设置了一个集群,由一个头节点、16 个计算节点和一个具有 NFS-4 网络共享存储的 NAS 组成。我最近通过 apt 在 Ubuntu v22 上安装了 Slurm(sinfo -V
透露slurm-wlm 21.08.5
)。我已经测试了一些单节点和多节点作业,并且我可以让作业按照预期运行完成。down
然而,对于某些模拟,某些节点在模拟过程中不断改变状态。尽管看似随机,但显示此行为的是相同的两个节点。这种情况经常发生,但我相信我们已经使用这些节点完成了一些模拟。在状态更改为 的节点上down
,slurmd
守护进程仍然处于活动状态——也就是说,无论发生什么故障都不是由于守护进程关闭所致。
总体而言:为什么这些节点终止作业并将状态设置为down
?
更多信息:我检查了slurmd
其中一个发生故障的节点上的日志,这就是我们得到的信息(从作业提交到故障/节点故障的大致时间)。请注意,这是提交给 4 个节点且每个节点的所有 (64) 个处理器的作业(ID=64):
[2023-12-07T16:48:29.487] [64.extern] debug2: setup for a launch_task
[2023-12-07T16:48:29.487] [64.extern] debug2: hwloc_topology_init
[2023-12-07T16:48:29.491] [64.extern] debug2: xcpuinfo_hwloc_topo_load: xml file (/var/spool/slurmd/hwloc_topo_whole.xml) found
[2023-12-07T16:48:29.493] [64.extern] debug: CPUs:64 Boards:1 Sockets:1 CoresPerSocket:64 ThreadsPerCore:1
[2023-12-07T16:48:29.494] [64.extern] debug: cgroup/v1: init: Cgroup v1 plugin loaded
[2023-12-07T16:48:29.498] [64.extern] debug: jobacct_gather/cgroup: init: Job accounting gather cgroup plugin loaded
[2023-12-07T16:48:29.498] [64.extern] debug2: profile signaling type Task
[2023-12-07T16:48:29.499] [64.extern] debug: Message thread started pid = 4176
[2023-12-07T16:48:29.503] [64.extern] task/affinity: init: task affinity plugin loaded with CPU mask 0xffffffffffffffff
[2023-12-07T16:48:29.507] [64.extern] debug: task/cgroup: init: core enforcement enabled
[2023-12-07T16:48:29.507] [64.extern] debug: task/cgroup: task_cgroup_memory_init: task/cgroup/memory: total:257579M allowed:100%(enforced), swap:0%(permissive), max:100%(257579M) max+swap:100%(515158M) min:30M kmem:100%(257579M permissive) min:30M swappiness:0(unset)
[2023-12-07T16:48:29.507] [64.extern] debug: task/cgroup: init: memory enforcement enabled
[2023-12-07T16:48:29.509] [64.extern] debug: task/cgroup: task_cgroup_devices_init: unable to open /etc/slurm/cgroup_allowed_devices_file.conf: No such file or directory
[2023-12-07T16:48:29.509] [64.extern] debug: task/cgroup: init: device enforcement enabled
[2023-12-07T16:48:29.509] [64.extern] debug: task/cgroup: init: Tasks containment cgroup plugin loaded
[2023-12-07T16:48:29.510] [64.extern] cred/munge: init: Munge credential signature plugin loaded
[2023-12-07T16:48:29.510] [64.extern] debug: job_container/tmpfs: init: job_container tmpfs plugin loaded
[2023-12-07T16:48:29.510] [64.extern] debug: job_container/tmpfs: _read_slurm_jc_conf: Reading job_container.conf file /etc/slurm/job_container.conf
[2023-12-07T16:48:29.513] [64.extern] debug2: _spawn_job_container: Before call to spank_init()
[2023-12-07T16:48:29.513] [64.extern] debug: spank: opening plugin stack /etc/slurm/plugstack.conf
[2023-12-07T16:48:29.513] [64.extern] debug: /etc/slurm/plugstack.conf: 1: include "/etc/slurm/plugstack.conf.d/*.conf"
[2023-12-07T16:48:29.513] [64.extern] debug2: _spawn_job_container: After call to spank_init()
[2023-12-07T16:48:29.555] [64.extern] debug: task/cgroup: task_cgroup_cpuset_create: job abstract cores are '0-63'
[2023-12-07T16:48:29.555] [64.extern] debug: task/cgroup: task_cgroup_cpuset_create: step abstract cores are '0-63'
[2023-12-07T16:48:29.555] [64.extern] debug: task/cgroup: task_cgroup_cpuset_create: job physical CPUs are '0-63'
[2023-12-07T16:48:29.555] [64.extern] debug: task/cgroup: task_cgroup_cpuset_create: step physical CPUs are '0-63'
[2023-12-07T16:48:29.556] [64.extern] task/cgroup: _memcg_initialize: job: alloc=0MB mem.limit=257579MB memsw.limit=unlimited
[2023-12-07T16:48:29.556] [64.extern] task/cgroup: _memcg_initialize: step: alloc=0MB mem.limit=257579MB memsw.limit=unlimited
[2023-12-07T16:48:29.556] [64.extern] debug: cgroup/v1: _oom_event_monitor: started.
[2023-12-07T16:48:29.579] [64.extern] debug2: adding task 3 pid 4185 on node 3 to jobacct
[2023-12-07T16:48:29.582] debug2: Finish processing RPC: REQUEST_LAUNCH_PROLOG
[2023-12-07T16:48:29.830] debug2: Start processing RPC: REQUEST_LAUNCH_TASKS
[2023-12-07T16:48:29.830] debug2: Processing RPC: REQUEST_LAUNCH_TASKS
[2023-12-07T16:48:29.830] launch task StepId=64.0 request from UID:1000 GID:1000 HOST:10.115.79.9 PORT:48642
[2023-12-07T16:48:29.830] debug: Checking credential with 868 bytes of sig data
[2023-12-07T16:48:29.830] task/affinity: lllp_distribution: JobId=64 manual binding: none,one_thread
[2023-12-07T16:48:29.830] debug: Waiting for job 64's prolog to complete
[2023-12-07T16:48:29.830] debug: Finished wait for job 64's prolog to complete
[2023-12-07T16:48:29.839] debug2: debug level read from slurmd is 'debug2'.
[2023-12-07T16:48:29.839] debug2: read_slurmd_conf_lite: slurmd sent 8 TRES.
[2023-12-07T16:48:29.839] debug: acct_gather_energy/none: init: AcctGatherEnergy NONE plugin loaded
[2023-12-07T16:48:29.839] debug: acct_gather_Profile/none: init: AcctGatherProfile NONE plugin loaded
[2023-12-07T16:48:29.839] debug: acct_gather_interconnect/none: init: AcctGatherInterconnect NONE plugin loaded
[2023-12-07T16:48:29.839] debug: acct_gather_filesystem/none: init: AcctGatherFilesystem NONE plugin loaded
[2023-12-07T16:48:29.839] debug2: Received CPU frequency information for 64 CPUs
[2023-12-07T16:48:29.840] debug: switch/none: init: switch NONE plugin loaded
[2023-12-07T16:48:29.840] debug: switch Cray/Aries plugin loaded.
[2023-12-07T16:48:29.840] [64.0] debug2: setup for a launch_task
[2023-12-07T16:48:29.840] [64.0] debug2: hwloc_topology_init
[2023-12-07T16:48:29.845] [64.0] debug2: xcpuinfo_hwloc_topo_load: xml file (/var/spool/slurmd/hwloc_topo_whole.xml) found
[2023-12-07T16:48:29.846] [64.0] debug: CPUs:64 Boards:1 Sockets:1 CoresPerSocket:64 ThreadsPerCore:1
[2023-12-07T16:48:29.847] [64.0] debug: cgroup/v1: init: Cgroup v1 plugin loaded
[2023-12-07T16:48:29.851] [64.0] debug: jobacct_gather/cgroup: init: Job accounting gather cgroup plugin loaded
[2023-12-07T16:48:29.852] [64.0] debug2: profile signaling type Task
[2023-12-07T16:48:29.852] [64.0] debug: Message thread started pid = 4188
[2023-12-07T16:48:29.852] debug2: Finish processing RPC: REQUEST_LAUNCH_TASKS
[2023-12-07T16:48:29.857] [64.0] task/affinity: init: task affinity plugin loaded with CPU mask 0xffffffffffffffff
[2023-12-07T16:48:29.861] [64.0] debug: task/cgroup: init: core enforcement enabled
[2023-12-07T16:48:29.861] [64.0] debug: task/cgroup: task_cgroup_memory_init: task/cgroup/memory: total:257579M allowed:100%(enforced), swap:0%(permissive), max:100%(257579M) max+swap:100%(515158M) min:30M kmem:100%(257579M permissive) min:30M swappiness:0(unset)
[2023-12-07T16:48:29.861] [64.0] debug: task/cgroup: init: memory enforcement enabled
[2023-12-07T16:48:29.863] [64.0] debug: task/cgroup: task_cgroup_devices_init: unable to open /etc/slurm/cgroup_allowed_devices_file.conf: No such file or directory
[2023-12-07T16:48:29.863] [64.0] debug: task/cgroup: init: device enforcement enabled
[2023-12-07T16:48:29.863] [64.0] debug: task/cgroup: init: Tasks containment cgroup plugin loaded
[2023-12-07T16:48:29.863] [64.0] cred/munge: init: Munge credential signature plugin loaded
[2023-12-07T16:48:29.863] [64.0] debug: job_container/tmpfs: init: job_container tmpfs plugin loaded
[2023-12-07T16:48:29.863] [64.0] debug: mpi type = none
[2023-12-07T16:48:29.863] [64.0] debug2: Before call to spank_init()
[2023-12-07T16:48:29.863] [64.0] debug: spank: opening plugin stack /etc/slurm/plugstack.conf
[2023-12-07T16:48:29.864] [64.0] debug: /etc/slurm/plugstack.conf: 1: include "/etc/slurm/plugstack.conf.d/*.conf"
[2023-12-07T16:48:29.864] [64.0] debug2: After call to spank_init()
[2023-12-07T16:48:29.864] [64.0] debug: mpi type = (null)
[2023-12-07T16:48:29.864] [64.0] debug: mpi/none: p_mpi_hook_slurmstepd_prefork: mpi/none: slurmstepd prefork
[2023-12-07T16:48:29.864] [64.0] error: cpu_freq_cpuset_validate: cpu_bind string is null
[2023-12-07T16:48:29.883] [64.0] debug: task/cgroup: task_cgroup_cpuset_create: job abstract cores are '0-63'
[2023-12-07T16:48:29.883] [64.0] debug: task/cgroup: task_cgroup_cpuset_create: step abstract cores are '0-63'
[2023-12-07T16:48:29.883] [64.0] debug: task/cgroup: task_cgroup_cpuset_create: job physical CPUs are '0-63'
[2023-12-07T16:48:29.883] [64.0] debug: task/cgroup: task_cgroup_cpuset_create: step physical CPUs are '0-63'
[2023-12-07T16:48:29.883] [64.0] task/cgroup: _memcg_initialize: job: alloc=0MB mem.limit=257579MB memsw.limit=unlimited
[2023-12-07T16:48:29.884] [64.0] task/cgroup: _memcg_initialize: step: alloc=0MB mem.limit=257579MB memsw.limit=unlimited
[2023-12-07T16:48:29.884] [64.0] debug: cgroup/v1: _oom_event_monitor: started.
[2023-12-07T16:48:29.886] [64.0] debug2: hwloc_topology_load
[2023-12-07T16:48:29.918] [64.0] debug2: hwloc_topology_export_xml
[2023-12-07T16:48:29.922] [64.0] debug2: Entering _setup_normal_io
[2023-12-07T16:48:29.922] [64.0] debug2: io_init_msg_write_to_fd: entering
[2023-12-07T16:48:29.922] [64.0] debug2: io_init_msg_write_to_fd: msg->nodeid = 2
[2023-12-07T16:48:29.922] [64.0] debug2: io_init_msg_write_to_fd: leaving
[2023-12-07T16:48:29.923] [64.0] debug2: Leaving _setup_normal_io
[2023-12-07T16:48:29.923] [64.0] debug levels are stderr='error', logfile='debug2', syslog='quiet'
[2023-12-07T16:48:29.923] [64.0] debug: IO handler started pid=4188
[2023-12-07T16:48:29.925] [64.0] starting 1 tasks
[2023-12-07T16:48:29.925] [64.0] task 2 (4194) started 2023-12-07T16:48:29
[2023-12-07T16:48:29.926] [64.0] debug: Setting slurmstepd oom_adj to -1000
[2023-12-07T16:48:29.926] [64.0] debug: job_container/tmpfs: _read_slurm_jc_conf: Reading job_container.conf file /etc/slurm/job_container.conf
[2023-12-07T16:48:29.959] [64.0] debug2: adding task 2 pid 4194 on node 2 to jobacct
[2023-12-07T16:48:29.960] [64.0] debug: Sending launch resp rc=0
[2023-12-07T16:48:29.961] [64.0] debug: mpi type = (null)
[2023-12-07T16:48:29.961] [64.0] debug: mpi/none: p_mpi_hook_slurmstepd_task: Using mpi/none
[2023-12-07T16:48:29.961] [64.0] debug: task/affinity: task_p_pre_launch: affinity StepId=64.0, task:2 bind:none,one_thread
[2023-12-07T16:48:29.961] [64.0] debug2: _set_limit: conf setrlimit RLIMIT_CPU no change in value: 18446744073709551615
[2023-12-07T16:48:29.961] [64.0] debug2: _set_limit: conf setrlimit RLIMIT_FSIZE no change in value: 18446744073709551615
[2023-12-07T16:48:29.961] [64.0] debug2: _set_limit: conf setrlimit RLIMIT_DATA no change in value: 18446744073709551615
[2023-12-07T16:48:29.961] [64.0] debug2: _set_limit: RLIMIT_STACK : max:inf cur:inf req:8388608
[2023-12-07T16:48:29.961] [64.0] debug2: _set_limit: conf setrlimit RLIMIT_STACK succeeded
[2023-12-07T16:48:29.961] [64.0] debug2: _set_limit: RLIMIT_CORE : max:inf cur:inf req:0
[2023-12-07T16:48:29.961] [64.0] debug2: _set_limit: conf setrlimit RLIMIT_CORE succeeded
[2023-12-07T16:48:29.961] [64.0] debug2: _set_limit: conf setrlimit RLIMIT_RSS no change in value: 18446744073709551615
[2023-12-07T16:48:29.961] [64.0] debug2: _set_limit: RLIMIT_NPROC : max:1030021 cur:1030021 req:1030020
[2023-12-07T16:48:29.961] [64.0] debug2: _set_limit: conf setrlimit RLIMIT_NPROC succeeded
[2023-12-07T16:48:29.961] [64.0] debug2: _set_limit: RLIMIT_NOFILE : max:131072 cur:4096 req:1024
[2023-12-07T16:48:29.961] [64.0] debug2: _set_limit: conf setrlimit RLIMIT_NOFILE succeeded
[2023-12-07T16:48:29.961] [64.0] debug2: _set_limit: RLIMIT_MEMLOCK: max:inf cur:inf req:33761472512
[2023-12-07T16:48:29.961] [64.0] debug2: _set_limit: conf setrlimit RLIMIT_MEMLOCK succeeded
[2023-12-07T16:48:29.961] [64.0] debug2: _set_limit: conf setrlimit RLIMIT_AS no change in value: 18446744073709551615
[2023-12-07T16:48:59.498] [64.extern] debug2: profile signaling type Task
[2023-12-07T16:48:59.852] [64.0] debug2: profile signaling type Task
[2023-12-07T16:51:03.457] debug: Log file re-opened
[2023-12-07T16:51:03.457] debug: _step_connect: connect() failed for /var/spool/slurmd/cn4_64.4294967292: Connection refused
[2023-12-07T16:51:03.457] debug: _step_connect: connect() failed for /var/spool/slurmd/cn4_64.0: Connection refused
[2023-12-07T16:51:03.457] debug2: hwloc_topology_init
[2023-12-07T16:51:03.462] debug2: hwloc_topology_load
[2023-12-07T16:51:03.480] debug2: hwloc_topology_export_xml
[2023-12-07T16:51:03.482] debug: CPUs:64 Boards:1 Sockets:1 CoresPerSocket:64 ThreadsPerCore:1
[2023-12-07T16:51:03.483] debug2: hwloc_topology_init
[2023-12-07T16:51:03.484] debug2: xcpuinfo_hwloc_topo_load: xml file (/var/spool/slurmd/hwloc_topo_whole.xml) found
[2023-12-07T16:51:03.485] debug: CPUs:64 Boards:1 Sockets:1 CoresPerSocket:64 ThreadsPerCore:1
[2023-12-07T16:51:03.485] topology/none: init: topology NONE plugin loaded
[2023-12-07T16:51:03.485] route/default: init: route default plugin loaded
[2023-12-07T16:51:03.485] debug2: Gathering cpu frequency information for 64 cpus
[2023-12-07T16:51:03.487] debug: Resource spec: No specialized cores configured by default on this node
[2023-12-07T16:51:03.487] debug: Resource spec: Reserved system memory limit not configured for this node
[2023-12-07T16:51:03.490] task/affinity: init: task affinity plugin loaded with CPU mask 0xffffffffffffffff
[2023-12-07T16:51:03.490] debug: task/cgroup: init: Tasks containment cgroup plugin loaded
[2023-12-07T16:51:03.490] debug: auth/munge: init: Munge authentication plugin loaded
[2023-12-07T16:51:03.490] debug: spank: opening plugin stack /etc/slurm/plugstack.conf
[2023-12-07T16:51:03.490] debug: /etc/slurm/plugstack.conf: 1: include "/etc/slurm/plugstack.conf.d/*.conf"
[2023-12-07T16:51:03.491] cred/munge: init: Munge credential signature plugin loaded
[2023-12-07T16:51:03.491] slurmd version 21.08.5 started
[2023-12-07T16:51:03.491] debug: jobacct_gather/cgroup: init: Job accounting gather cgroup plugin loaded
[2023-12-07T16:51:03.491] debug: job_container/tmpfs: init: job_container tmpfs plugin loaded
[2023-12-07T16:51:03.491] debug: job_container/tmpfs: _read_slurm_jc_conf: Reading job_container.conf file /etc/slurm/job_container.conf
[2023-12-07T16:51:03.492] debug: job_container/tmpfs: container_p_restore: job_container.conf read successfully
[2023-12-07T16:51:03.492] debug: job_container/tmpfs: _restore_ns: _restore_ns: Job 58 not found, deleting the namespace
[2023-12-07T16:51:03.492] error: _delete_ns: umount2 /var/nvme/storage/cn4/58/.ns failed: Invalid argument
[2023-12-07T16:51:03.492] debug: job_container/tmpfs: _restore_ns: _restore_ns: Job 56 not found, deleting the namespace
[2023-12-07T16:51:03.492] error: _delete_ns: umount2 /var/nvme/storage/cn4/56/.ns failed: Invalid argument
[2023-12-07T16:51:03.492] debug: _step_connect: connect() failed for /var/spool/slurmd/cn4_64.4294967292: Connection refused
[2023-12-07T16:51:03.492] error: _restore_ns: failed to connect to stepd for 64.
[2023-12-07T16:51:03.492] error: _delete_ns: umount2 /var/nvme/storage/cn4/64/.ns failed: Invalid argument
[2023-12-07T16:51:03.492] debug: job_container/tmpfs: _restore_ns: _restore_ns: Job 54 not found, deleting the namespace
[2023-12-07T16:51:03.492] error: _delete_ns: umount2 /var/nvme/storage/cn4/54/.ns failed: Invalid argument
[2023-12-07T16:51:03.492] debug: job_container/tmpfs: _restore_ns: _restore_ns: Job 59 not found, deleting the namespace
[2023-12-07T16:51:03.492] error: _delete_ns: umount2 /var/nvme/storage/cn4/59/.ns failed: Invalid argument
[2023-12-07T16:51:03.492] error: Encountered an error while restoring job containers.
[2023-12-07T16:51:03.492] error: Unable to restore job_container state.
[2023-12-07T16:51:03.493] debug: switch/none: init: switch NONE plugin loaded
[2023-12-07T16:51:03.493] debug: switch Cray/Aries plugin loaded.
[2023-12-07T16:51:03.493] slurmd started on Thu, 07 Dec 2023 16:51:03 -0600
[2023-12-07T16:51:03.493] debug: _step_connect: connect() failed for /var/spool/slurmd/cn4_64.4294967292: Connection refused
[2023-12-07T16:51:03.494] debug: _step_connect: connect() failed for /var/spool/slurmd/cn4_64.0: Connection refused
[2023-12-07T16:51:03.494] CPUs=64 Boards=1 Sockets=1 Cores=64 Threads=1 Memory=257579 TmpDisk=937291 Uptime=14 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
[2023-12-07T16:51:03.494] debug: _step_connect: connect() failed for /var/spool/slurmd/cn4_64.4294967292: Connection refused
[2023-12-07T16:51:03.494] debug: _step_connect: connect() failed for /var/spool/slurmd/cn4_64.0: Connection refused
[2023-12-07T16:51:03.494] debug: acct_gather_energy/none: init: AcctGatherEnergy NONE plugin loaded
[2023-12-07T16:51:03.494] debug: acct_gather_Profile/none: init: AcctGatherProfile NONE plugin loaded
[2023-12-07T16:51:03.495] debug: acct_gather_interconnect/none: init: AcctGatherInterconnect NONE plugin loaded
[2023-12-07T16:51:03.495] debug: acct_gather_filesystem/none: init: AcctGatherFilesystem NONE plugin loaded
[2023-12-07T16:51:03.495] debug2: No acct_gather.conf file (/etc/slurm/acct_gather.conf)
[2023-12-07T16:51:03.499] debug: _handle_node_reg_resp: slurmctld sent back 8 TRES.
[2023-12-07T16:51:03.500] debug2: Start processing RPC: REQUEST_TERMINATE_JOB
[2023-12-07T16:51:03.500] debug2: Processing RPC: REQUEST_TERMINATE_JOB
[2023-12-07T16:51:03.500] debug: _rpc_terminate_job: uid = 64030 JobId=64
[2023-12-07T16:51:03.500] debug: credential for job 64 revoked
[2023-12-07T16:51:03.500] debug: _step_connect: connect() failed for /var/spool/slurmd/cn4_64.4294967292: Connection refused
[2023-12-07T16:51:03.500] debug: signal for nonexistent StepId=64.extern stepd_connect failed: Connection refused
[2023-12-07T16:51:03.500] debug: _step_connect: connect() failed for /var/spool/slurmd/cn4_64.0: Connection refused
[2023-12-07T16:51:03.500] debug: signal for nonexistent StepId=64.0 stepd_connect failed: Connection refused
[2023-12-07T16:51:03.500] debug2: No steps in jobid 64 were able to be signaled with 998
[2023-12-07T16:51:03.500] debug: _step_connect: connect() failed for /var/spool/slurmd/cn4_64.4294967292: Connection refused
[2023-12-07T16:51:03.500] debug: signal for nonexistent StepId=64.extern stepd_connect failed: Connection refused
[2023-12-07T16:51:03.500] debug: _step_connect: connect() failed for /var/spool/slurmd/cn4_64.0: Connection refused
[2023-12-07T16:51:03.500] debug: signal for nonexistent StepId=64.0 stepd_connect failed: Connection refused
[2023-12-07T16:51:03.500] debug2: No steps in jobid 64 were able to be signaled with 18
[2023-12-07T16:51:03.500] debug: _step_connect: connect() failed for /var/spool/slurmd/cn4_64.4294967292: Connection refused
[2023-12-07T16:51:03.500] debug: signal for nonexistent StepId=64.extern stepd_connect failed: Connection refused
[2023-12-07T16:51:03.500] debug: _step_connect: connect() failed for /var/spool/slurmd/cn4_64.0: Connection refused
[2023-12-07T16:51:03.500] debug: signal for nonexistent StepId=64.0 stepd_connect failed: Connection refused
[2023-12-07T16:51:03.500] debug2: No steps in jobid 64 were able to be signaled with 15
[2023-12-07T16:51:03.500] debug2: set revoke expiration for jobid 64 to 1701989583 UTS
[2023-12-07T16:51:03.501] error: _delete_ns: umount2 /var/nvme/storage/cn4/64/.ns failed: Invalid argument
[2023-12-07T16:51:03.501] error: container_g_delete(64): Invalid argument
[2023-12-07T16:51:03.501] debug2: Finish processing RPC: REQUEST_TERMINATE_JOB
我立即看到一些有关的错误/var/nvme/storage
(这是每个节点上的本地文件夹,而不是 NAS 上的网络共享位置),但这在所有节点上都是相同的,并且只会在几个节点上引起问题。请注意,这是在中设置的基本路径job_container.conf
:
AutoBasePath=true
BasePath=/var/nvme/storage
另外,这里是cgroup.conf
:
CgroupAutomount=yes
CgroupReleaseAgentDir="/etc/slurm/cgroup"
ConstrainCores=yes
ConstrainRAMSpace=yes
ConstrainDevices=yes
ConstrainKmemSpace=no
TaskAffinity=no
CgroupMountpoint=/sys/fs/cgroup
...和slurm.conf
:
# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ClusterName=cauchy
SlurmctldHost=cauchy
#SlurmctldHost=
#
#DisableRootJobs=NO
#EnforcePartLimits=NO
#Epilog=
#EpilogSlurmctld=
#FirstJobId=1
#MaxJobId=67043328
#GresTypes=
#GroupUpdateForce=0
#GroupUpdateTime=600
#JobFileAppend=0
#JobRequeue=1
#JobSubmitPlugins=lua
#KillOnBadExit=0
#LaunchType=launch/slurm
#Licenses=foo*4,bar
#MailProg=/bin/mail
MaxJobCount=1000000
#MaxStepCount=40000
#MaxTasksPerNode=512
MpiDefault=none
#MpiParams=ports=#-#
#PluginDir=
#PlugStackConfig=
#PrivateData=jobs
ProctrackType=proctrack/cgroup
#Prolog=
PrologFlags=contain
#PrologSlurmctld=
#PropagatePrioProcess=0
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
#RebootProgram=
ReturnToService=1
SlurmctldPidFile=/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
#SlurmdUser=root
#SrunEpilog=
#SrunProlog=
StateSaveLocation=/var/spool/slurmctld
SwitchType=switch/none
#TaskEpilog=
TaskPlugin=task/affinity,task/cgroup
#TaskProlog=
#TopologyPlugin=topology/tree
#TmpFS=/tmp
#TrackWCKey=no
#TreeWidth=
#UnkillableStepProgram=
#UsePAM=0
#
#
# TIMERS
#BatchStartTimeout=10
#CompleteWait=0
#EpilogMsgTime=2000
#GetEnvTimeout=2
#HealthCheckInterval=0
#HealthCheckProgram=
InactiveLimit=0
KillWait=120
#MessageTimeout=10
#ResvOverRun=0
MinJobAge=300
#OverTimeLimit=0
SlurmctldTimeout=120
SlurmdTimeout=300
#UnkillableStepTimeout=60
#VSizeFactor=0
Waittime=0
MaxArraySize=100000
#
#
# SCHEDULING
#DefMemPerCPU=0
#MaxMemPerCPU=0
#SchedulerTimeSlice=30
SchedulerType=sched/backfill
SchedulerParameters=enable_user_top,bf_job_part_count_reserve=5,bf_continue
SelectType=select/cons_tres
SelectTypeParameters=CR_Core
#
#
####### Priority Begin ##################
PriorityType=priority/multifactor
PriorityDecayHalfLife=14-0
PriorityWeightAge=100
PriorityWeightPartition=10000
PriorityWeightJobSize=0
PriorityMaxAge=14-0
PriorityFavorSmall=YES
#PriorityWeightQOS=10000
#PriorityWeightTRES=cpu=2000,mem=1,gres/gpu=400
#AccountingStorageTRES=gres/gpu
#AccountingStorageEnforce=all
#FairShareDampeningFactor=5
####### Priority End ##################
#
#
# LOGGING AND ACCOUNTING
#AccountingStorageEnforce=0
#AccountingStorageHost=
#AccountingStoragePass=
#AccountingStoragePort=
AccountingStorageType=accounting_storage/none
#AccountingStorageUser=
#AccountingStoreFlags=
#JobCompHost=
#JobCompLoc=
#JobCompParams=
#JobCompPass=
#JobCompPort=
JobCompType=jobcomp/none
#JobCompUser=
JobContainerType=job_container/tmpfs
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/cgroup
SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=debug2
SlurmdLogFile=/var/log/slurmd.log
#SlurmSchedLogFile=
#SlurmSchedLogLevel=
#DebugFlags=
#
#
# POWER SAVE SUPPORT FOR IDLE NODES (optional)
#SuspendProgram=
#ResumeProgram=
#SuspendTimeout=
#ResumeTimeout=
#ResumeRate=
#SuspendExcNodes=
#SuspendExcParts=
#SuspendRate=
#SuspendTime=
#
#
# COMPUTE NODES
NodeName=cn1 CPUs=64 Sockets=1 CoresPerSocket=64 ThreadsPerCore=1 State=UNKNOWN RealMemory=257000
NodeName=cn2 CPUs=64 Sockets=1 CoresPerSocket=64 ThreadsPerCore=1 State=UNKNOWN RealMemory=257000
NodeName=cn3 CPUs=64 Sockets=1 CoresPerSocket=64 ThreadsPerCore=1 State=UNKNOWN RealMemory=257000
NodeName=cn4 CPUs=64 Sockets=1 CoresPerSocket=64 ThreadsPerCore=1 State=UNKNOWN RealMemory=257000
NodeName=cn5 CPUs=64 Sockets=1 CoresPerSocket=64 ThreadsPerCore=1 State=UNKNOWN RealMemory=257000
NodeName=cn6 CPUs=64 Sockets=1 CoresPerSocket=64 ThreadsPerCore=1 State=UNKNOWN RealMemory=257000
NodeName=cn7 CPUs=64 Sockets=1 CoresPerSocket=64 ThreadsPerCore=1 State=UNKNOWN RealMemory=257000
NodeName=cn8 CPUs=64 Sockets=1 CoresPerSocket=64 ThreadsPerCore=1 State=UNKNOWN RealMemory=257000
NodeName=cn9 CPUs=64 Sockets=1 CoresPerSocket=64 ThreadsPerCore=1 State=UNKNOWN RealMemory=257000
NodeName=cn10 CPUs=64 Sockets=1 CoresPerSocket=64 ThreadsPerCore=1 State=UNKNOWN RealMemory=257000
NodeName=cn11 CPUs=64 Sockets=1 CoresPerSocket=64 ThreadsPerCore=1 State=UNKNOWN RealMemory=257000
NodeName=cn12 CPUs=64 Sockets=1 CoresPerSocket=64 ThreadsPerCore=1 State=UNKNOWN RealMemory=257000
NodeName=cn13 CPUs=64 Sockets=1 CoresPerSocket=64 ThreadsPerCore=1 State=UNKNOWN RealMemory=257000
NodeName=cn14 CPUs=64 Sockets=1 CoresPerSocket=64 ThreadsPerCore=1 State=UNKNOWN RealMemory=257000
NodeName=cn15 CPUs=64 Sockets=1 CoresPerSocket=64 ThreadsPerCore=1 State=UNKNOWN RealMemory=257000
NodeName=cn16 CPUs=64 Sockets=1 CoresPerSocket=64 ThreadsPerCore=1 State=UNKNOWN RealMemory=257000
PartitionName=main Nodes=ALL Default=YES MaxTime=INFINITE State=UP PriorityJobFactor=2000
PartitionName=low Nodes=ALL MaxTime=INFINITE State=UP PriorityJobFactor=1000
EDIT: I cancelled a job (submitted to 4 nodes, cn1 - cn4) that showed this error before it could reallocate to new nodes / overwrite the Slurm error file. Here are the contents of the error file:
Authorization required, but no authorization protocol specified
Authorization required, but no authorization protocol specified
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
Backtrace for this error:
slurmstepd-cn1: error: *** JOB 74 ON cn1 CANCELLED AT 2023-12-10T10:05:57 DUE TO NODE FAILURE, SEE SLURMCTLD LOG FOR DETAILS ***
The Authorization required...
error is pervasive on all nodes / all simulations, so I'm not sure it is of particular consequence for the consistent failure of the cn4
node. The segfault doesn't appear when the same job is run on other nodes, so this is new info / anomalous. The slurmctld
log is not particularly illuminating:
[2023-12-09T19:54:46.401] _slurm_rpc_submit_batch_job: JobId=65 InitPrio=10000 usec=493
[2023-12-09T19:54:46.826] sched/backfill: _start_job: Started JobId=65 in main on cn[1-4]
[2023-12-09T20:01:33.529] validate_node_specs: Node cn4 unexpectedly rebooted boot_time=1702173678 last response=1702173587
[2023-12-09T20:01:33.529] requeue job JobId=65 due to failure of node cn4
[2023-12-09T20:01:38.334] Requeuing JobId=65
我有一个 DRBD 集群,其中一个节点关闭了几天。单节点运行良好,没有出现任何问题。当我打开它时,我遇到了一种情况,所有资源都停止了,一个 DRBD 卷是辅助卷,其他卷是主要卷,因为它似乎试图对刚刚打开的节点执行角色交换(ha1 处于活动状态,然后我打开 ha2)为了便于理解日志,在 08:06)
我的问题:
下面是我能想象到的所有可能有用的信息。
bash-5.1# cat /proc/drbd
version: 8.4.11 (api:1/proto:86-101)
srcversion: 60F610B702CC05315B04B50
0: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r-----
ns:109798092 nr:90528 dw:373317496 dr:353811713 al:558387 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
1: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r-----
ns:415010252 nr:188601628 dw:1396698240 dr:1032339078 al:1387347 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
2: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r-----
ns:27957772 nr:21354732 dw:97210572 dr:100798651 al:5283 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
集群状态最终为
bash-5.1# pcs status
Cluster name: HA
Status of pacemakerd: 'Pacemaker is running' (last updated 2023-08-10 08:38:40Z)
Cluster Summary:
* Stack: corosync
* Current DC: ha2.local (version 2.1.4-5.el9_1.2-dc6eb4362e) - partition with quorum
* Last updated: Thu Aug 10 08:38:40 2023
* Last change: Mon Jul 10 06:49:08 2023 by hacluster via crmd on ha1.local
* 2 nodes configured
* 14 resource instances configured
Node List:
* Online: [ ha1.local ha2.local ]
Full List of Resources:
* Clone Set: LV_BLOB-clone [LV_BLOB] (promotable):
* Promoted: [ ha2.local ]
* Unpromoted: [ ha1.local ]
* Resource Group: nsdrbd:
* LV_BLOBFS (ocf:heartbeat:Filesystem): Started ha2.local
* LV_POSTGRESFS (ocf:heartbeat:Filesystem): Stopped
* LV_HOMEFS (ocf:heartbeat:Filesystem): Stopped
* ClusterIP (ocf:heartbeat:IPaddr2): Stopped
* Clone Set: LV_POSTGRES-clone [LV_POSTGRES] (promotable):
* Promoted: [ ha1.local ]
* Unpromoted: [ ha2.local ]
* postgresql (systemd:postgresql): Stopped
* Clone Set: LV_HOME-clone [LV_HOME] (promotable):
* Promoted: [ ha1.local ]
* Unpromoted: [ ha2.local ]
* ns_mhswdog (lsb:mhswdog): Stopped
* Clone Set: pingd-clone [pingd]:
* Started: [ ha1.local ha2.local ]
Failed Resource Actions:
* LV_POSTGRES promote on ha2.local could not be executed (Timed Out: Resource agent did not complete within 1m30s) at Thu Aug 10 08:19:27 2023 after 1m30.003s
* LV_BLOB promote on ha2.local could not be executed (Timed Out: Resource agent did not complete within 1m30s) at Thu Aug 10 08:15:38 2023 after 1m30.001s
Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled
我附上两个节点的日志
Aug 10 08:07:00 [1032387] ha1.local corosync info [KNET ] rx: host: 2 link: 0 is up
Aug 10 08:07:00 [1032387] ha1.local corosync info [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 10 08:07:00 [1032387] ha1.local corosync info [KNET ] pmtud: Global data MTU changed to: 1397
Aug 10 08:07:00 [1032387] ha1.local corosync notice [QUORUM] Sync members[2]: 1 2
Aug 10 08:07:00 [1032387] ha1.local corosync notice [QUORUM] Sync joined[1]: 2
Aug 10 08:07:00 [1032387] ha1.local corosync notice [TOTEM ] A new membership (1.12d) was formed. Members joined: 2
Aug 10 08:07:00 [1032387] ha1.local corosync notice [QUORUM] Members[2]: 1 2
Aug 10 08:07:00 [1032387] ha1.local corosync notice [MAIN ] Completed service synchronization, ready to provide service.
Aug 10 08:07:07 [1032387] ha1.local corosync info [KNET ] rx: host: 2 link: 1 is up
Aug 10 08:07:07 [1032387] ha1.local corosync info [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 10 08:11:48 [1032387] ha1.local corosync info [KNET ] link: host: 2 link: 1 is down
Aug 10 08:11:48 [1032387] ha1.local corosync info [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 10 08:11:50 [1032387] ha1.local corosync info [KNET ] rx: host: 2 link: 1 is up
Aug 10 08:11:50 [1032387] ha1.local corosync info [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 10 08:12:22 [1032387] ha1.local corosync info [KNET ] link: host: 2 link: 1 is down
Aug 10 08:12:22 [1032387] ha1.local corosync info [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 10 08:12:23 [1032387] ha1.local corosync info [KNET ] rx: host: 2 link: 1 is up
Aug 10 08:12:23 [1032387] ha1.local corosync info [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
和
Aug 10 08:06:55 [1128] ha2.local corosync notice [MAIN ] Corosync Cluster Engine 3.1.5 starting up
Aug 10 08:06:55 [1128] ha2.local corosync info [MAIN ] Corosync built-in features: dbus systemd xmlconf vqsim nozzle snmp pie relro bindnow
Aug 10 08:06:56 [1128] ha2.local corosync notice [TOTEM ] Initializing transport (Kronosnet).
Aug 10 08:06:57 [1128] ha2.local corosync info [TOTEM ] totemknet initialized
Aug 10 08:06:57 [1128] ha2.local corosync info [KNET ] common: crypto_nss.so has been loaded from /usr/lib64/kronosnet/crypto_nss.so
Aug 10 08:06:57 [1128] ha2.local corosync notice [SERV ] Service engine loaded: corosync configuration map access [0]
Aug 10 08:06:57 [1128] ha2.local corosync info [QB ] server name: cmap
Aug 10 08:06:57 [1128] ha2.local corosync notice [SERV ] Service engine loaded: corosync configuration service [1]
Aug 10 08:06:57 [1128] ha2.local corosync info [QB ] server name: cfg
Aug 10 08:06:57 [1128] ha2.local corosync notice [SERV ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
Aug 10 08:06:57 [1128] ha2.local corosync info [QB ] server name: cpg
Aug 10 08:06:57 [1128] ha2.local corosync notice [SERV ] Service engine loaded: corosync profile loading service [4]
Aug 10 08:06:57 [1128] ha2.local corosync notice [QUORUM] Using quorum provider corosync_votequorum
Aug 10 08:06:57 [1128] ha2.local corosync notice [VOTEQ ] Waiting for all cluster members. Current votes: 1 expected_votes: 2
Aug 10 08:06:57 [1128] ha2.local corosync notice [SERV ] Service engine loaded: corosync vote quorum service v1.0 [5]
Aug 10 08:06:57 [1128] ha2.local corosync info [QB ] server name: votequorum
Aug 10 08:06:57 [1128] ha2.local corosync notice [SERV ] Service engine loaded: corosync cluster quorum service v0.1 [3]
Aug 10 08:06:57 [1128] ha2.local corosync info [QB ] server name: quorum
Aug 10 08:06:57 [1128] ha2.local corosync info [TOTEM ] Configuring link 0
Aug 10 08:06:57 [1128] ha2.local corosync info [TOTEM ] Configured link number 0: local addr: 192.168.51.216, port=5405
Aug 10 08:06:57 [1128] ha2.local corosync info [TOTEM ] Configuring link 1
Aug 10 08:06:57 [1128] ha2.local corosync info [TOTEM ] Configured link number 1: local addr: 10.0.0.2, port=5406
Aug 10 08:06:57 [1128] ha2.local corosync info [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 10 08:06:57 [1128] ha2.local corosync warning [KNET ] host: host: 1 has no active links
Aug 10 08:06:57 [1128] ha2.local corosync info [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 10 08:06:57 [1128] ha2.local corosync warning [KNET ] host: host: 1 has no active links
Aug 10 08:06:57 [1128] ha2.local corosync info [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 10 08:06:57 [1128] ha2.local corosync warning [KNET ] host: host: 1 has no active links
Aug 10 08:06:57 [1128] ha2.local corosync info [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 10 08:06:57 [1128] ha2.local corosync warning [KNET ] host: host: 1 has no active links
Aug 10 08:06:57 [1128] ha2.local corosync info [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 10 08:06:57 [1128] ha2.local corosync warning [KNET ] host: host: 1 has no active links
Aug 10 08:06:57 [1128] ha2.local corosync info [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 10 08:06:57 [1128] ha2.local corosync warning [KNET ] host: host: 1 has no active links
Aug 10 08:06:57 [1128] ha2.local corosync notice [QUORUM] Sync members[1]: 2
Aug 10 08:06:57 [1128] ha2.local corosync notice [QUORUM] Sync joined[1]: 2
Aug 10 08:06:57 [1128] ha2.local corosync notice [TOTEM ] A new membership (2.126) was formed. Members joined: 2
Aug 10 08:06:57 [1128] ha2.local corosync notice [VOTEQ ] Waiting for all cluster members. Current votes: 1 expected_votes: 2
Aug 10 08:06:57 [1128] ha2.local corosync notice [VOTEQ ] Waiting for all cluster members. Current votes: 1 expected_votes: 2
Aug 10 08:06:57 [1128] ha2.local corosync notice [VOTEQ ] Waiting for all cluster members. Current votes: 1 expected_votes: 2
Aug 10 08:06:57 [1128] ha2.local corosync notice [QUORUM] Members[1]: 2
Aug 10 08:06:57 [1128] ha2.local corosync notice [MAIN ] Completed service synchronization, ready to provide service.
Aug 10 08:07:00 [1128] ha2.local corosync info [KNET ] rx: host: 1 link: 0 is up
Aug 10 08:07:00 [1128] ha2.local corosync info [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 10 08:07:00 [1128] ha2.local corosync info [KNET ] pmtud: Global data MTU changed to: 469
Aug 10 08:07:00 [1128] ha2.local corosync notice [QUORUM] Sync members[2]: 1 2
Aug 10 08:07:00 [1128] ha2.local corosync notice [QUORUM] Sync joined[1]: 1
Aug 10 08:07:00 [1128] ha2.local corosync notice [TOTEM ] A new membership (1.12d) was formed. Members joined: 1
Aug 10 08:07:00 [1128] ha2.local corosync notice [VOTEQ ] Waiting for all cluster members. Current votes: 1 expected_votes: 2
Aug 10 08:07:00 [1128] ha2.local corosync notice [QUORUM] This node is within the primary component and will provide service.
Aug 10 08:07:00 [1128] ha2.local corosync notice [QUORUM] Members[2]: 1 2
Aug 10 08:07:00 [1128] ha2.local corosync notice [MAIN ] Completed service synchronization, ready to provide service.
Aug 10 08:07:05 [1128] ha2.local corosync info [KNET ] rx: host: 1 link: 1 is up
Aug 10 08:07:05 [1128] ha2.local corosync info [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 10 08:07:08 [1128] ha2.local corosync info [KNET ] pmtud: PMTUD link change for host: 1 link: 0 from 469 to 1397
Aug 10 08:07:08 [1128] ha2.local corosync info [KNET ] pmtud: PMTUD link change for host: 1 link: 1 from 469 to 8885
Aug 10 08:07:08 [1128] ha2.local corosync info [KNET ] pmtud: Global data MTU changed to: 1397
Aug 10 08:14:13 [1128] ha2.local corosync info [KNET ] link: host: 1 link: 1 is down
Aug 10 08:14:13 [1128] ha2.local corosync info [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 10 08:14:15 [1128] ha2.local corosync info [KNET ] rx: host: 1 link: 1 is up
Aug 10 08:14:15 [1128] ha2.local corosync info [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 10 08:19:53 [1128] ha2.local corosync info [KNET ] link: host: 1 link: 1 is down
Aug 10 08:19:53 [1128] ha2.local corosync info [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 10 08:19:54 [1128] ha2.local corosync info [KNET ] rx: host: 1 link: 1 is up
Aug 10 08:19:54 [1128] ha2.local corosync info [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 10 08:23:18 [1128] ha2.local corosync info [KNET ] link: host: 1 link: 1 is down
Aug 10 08:23:18 [1128] ha2.local corosync info [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 10 08:23:19 [1128] ha2.local corosync info [KNET ] rx: host: 1 link: 1 is up
Aug 10 08:23:19 [1128] ha2.local corosync info [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
我还附上了 pcs config show 的输出,还可以根据要求提供 pcs cluster cib
Cluster Name: HA
Corosync Nodes:
ha1.local ha2.local
Pacemaker Nodes:
ha1.local ha2.local
Resources:
Resource: postgresql (class=systemd type=postgresql)
Operations:
monitor: postgresql-monitor-interval-60s
interval=60s
start: postgresql-start-interval-0s
interval=0s
timeout=100
stop: postgresql-stop-interval-0s
interval=0s
timeout=100
Resource: ns_mhswdog (class=lsb type=mhswdog)
Operations:
force-reload: ns_mhswdog-force-reload-interval-0s
interval=0s
timeout=15
monitor: ns_mhswdog-monitor-interval-60s
interval=60s
timeout=10s
on-fail=standby
restart: ns_mhswdog-restart-interval-0s
interval=0s
timeout=140s
start: ns_mhswdog-start-interval-0s
interval=0s
timeout=80s
stop: ns_mhswdog-stop-interval-0s
interval=0s
timeout=80s
Group: nsdrbd
Resource: LV_BLOBFS (class=ocf provider=heartbeat type=Filesystem)
Attributes: LV_BLOBFS-instance_attributes
device=/dev/drbd0
directory=/data
fstype=ext4
Operations:
monitor: LV_BLOBFS-monitor-interval-20s
interval=20s
timeout=40s
start: LV_BLOBFS-start-interval-0s
interval=0s
timeout=60s
stop: LV_BLOBFS-stop-interval-0s
interval=0s
timeout=60s
Resource: LV_POSTGRESFS (class=ocf provider=heartbeat type=Filesystem)
Attributes: LV_POSTGRESFS-instance_attributes
device=/dev/drbd1
directory=/var/lib/pgsql
fstype=ext4
Operations:
monitor: LV_POSTGRESFS-monitor-interval-20s
interval=20s
timeout=40s
start: LV_POSTGRESFS-start-interval-0s
interval=0s
timeout=60s
stop: LV_POSTGRESFS-stop-interval-0s
interval=0s
timeout=60s
Resource: LV_HOMEFS (class=ocf provider=heartbeat type=Filesystem)
Attributes: LV_HOMEFS-instance_attributes
device=/dev/drbd2
directory=/home
fstype=ext4
Operations:
monitor: LV_HOMEFS-monitor-interval-20s
interval=20s
timeout=40s
start: LV_HOMEFS-start-interval-0s
interval=0s
timeout=60s
stop: LV_HOMEFS-stop-interval-0s
interval=0s
timeout=60s
Resource: ClusterIP (class=ocf provider=heartbeat type=IPaddr2)
Attributes: ClusterIP-instance_attributes
cidr_netmask=32
ip=192.168.51.75
Operations:
monitor: ClusterIP-monitor-interval-60s
interval=60s
start: ClusterIP-start-interval-0s
interval=0s
timeout=20s
stop: ClusterIP-stop-interval-0s
interval=0s
timeout=20s
Clone: LV_BLOB-clone
Meta Attributes: LV_BLOB-clone-meta_attributes
clone-max=2
clone-node-max=1
notify=true
promotable=true
promoted-max=1
promoted-node-max=1
Resource: LV_BLOB (class=ocf provider=linbit type=drbd)
Attributes: LV_BLOB-instance_attributes
drbd_resource=lv_blob
Operations:
demote: LV_BLOB-demote-interval-0s
interval=0s
timeout=90
monitor: LV_BLOB-monitor-interval-60s
interval=60s
role=Promoted
monitor: LV_BLOB-monitor-interval-63s
interval=63s
role=Unpromoted
notify: LV_BLOB-notify-interval-0s
interval=0s
timeout=90
promote: LV_BLOB-promote-interval-0s
interval=0s
timeout=90
reload: LV_BLOB-reload-interval-0s
interval=0s
timeout=30
start: LV_BLOB-start-interval-0s
interval=0s
timeout=240
stop: LV_BLOB-stop-interval-0s
interval=0s
timeout=100
Clone: LV_POSTGRES-clone
Meta Attributes: LV_POSTGRES-clone-meta_attributes
clone-max=2
clone-node-max=1
notify=true
promotable=true
promoted-max=1
promoted-node-max=1
Resource: LV_POSTGRES (class=ocf provider=linbit type=drbd)
Attributes: LV_POSTGRES-instance_attributes
drbd_resource=lv_postgres
Operations:
demote: LV_POSTGRES-demote-interval-0s
interval=0s
timeout=90
monitor: LV_POSTGRES-monitor-interval-60s
interval=60s
role=Promoted
monitor: LV_POSTGRES-monitor-interval-63s
interval=63s
role=Unpromoted
notify: LV_POSTGRES-notify-interval-0s
interval=0s
timeout=90
promote: LV_POSTGRES-promote-interval-0s
interval=0s
timeout=90
reload: LV_POSTGRES-reload-interval-0s
interval=0s
timeout=30
start: LV_POSTGRES-start-interval-0s
interval=0s
timeout=240
stop: LV_POSTGRES-stop-interval-0s
interval=0s
timeout=100
Clone: LV_HOME-clone
Meta Attributes: LV_HOME-clone-meta_attributes
clone-max=2
clone-node-max=1
notify=true
promotable=true
promoted-max=1
promoted-node-max=1
Resource: LV_HOME (class=ocf provider=linbit type=drbd)
Attributes: LV_HOME-instance_attributes
drbd_resource=lv_home
Operations:
demote: LV_HOME-demote-interval-0s
interval=0s
timeout=90
monitor: LV_HOME-monitor-interval-60s
interval=60s
role=Promoted
monitor: LV_HOME-monitor-interval-63s
interval=63s
role=Unpromoted
notify: LV_HOME-notify-interval-0s
interval=0s
timeout=90
promote: LV_HOME-promote-interval-0s
interval=0s
timeout=90
reload: LV_HOME-reload-interval-0s
interval=0s
timeout=30
start: LV_HOME-start-interval-0s
interval=0s
timeout=240
stop: LV_HOME-stop-interval-0s
interval=0s
timeout=100
Clone: pingd-clone
Resource: pingd (class=ocf provider=pacemaker type=ping)
Attributes: pingd-instance_attributes
dampen=6s
host_list=192.168.51.251
multiplier=1000
Operations:
monitor: pingd-monitor-interval-10s
interval=10s
timeout=60s
reload-agent: pingd-reload-agent-interval-0s
interval=0s
timeout=20s
start: pingd-start-interval-0s
interval=0s
timeout=60s
stop: pingd-stop-interval-0s
interval=0s
timeout=20s
Stonith Devices:
Fencing Levels:
Location Constraints:
Resource: ClusterIP
Constraint: location-ClusterIP
Rule: boolean-op=or score=-INFINITY (id:location-ClusterIP-rule)
Expression: pingd lt 1 (id:location-ClusterIP-rule-expr)
Expression: not_defined pingd (id:location-ClusterIP-rule-expr-1)
Ordering Constraints:
promote LV_BLOB-clone then start LV_BLOBFS (kind:Mandatory) (id:order-LV_BLOB-clone-LV_BLOBFS-mandatory)
promote LV_POSTGRES-clone then start LV_POSTGRESFS (kind:Mandatory) (id:order-LV_POSTGRES-clone-LV_POSTGRESFS-mandatory)
start LV_POSTGRESFS then start postgresql (kind:Mandatory) (id:order-LV_POSTGRESFS-postgresql-mandatory)
promote LV_HOME-clone then start LV_HOMEFS (kind:Mandatory) (id:order-LV_HOME-clone-LV_HOMEFS-mandatory)
start LV_HOMEFS then start ns_mhswdog (kind:Mandatory) (id:order-LV_HOMEFS-ns_mhswdog-mandatory)
start LV_BLOBFS then start ns_mhswdog (kind:Mandatory) (id:order-LV_BLOBFS-ns_mhswdog-mandatory)
start postgresql then start ns_mhswdog (kind:Mandatory) (id:order-postgresql-ns_mhswdog-mandatory)
start ns_mhswdog then start ClusterIP (kind:Mandatory) (id:order-ns_mhswdog-ClusterIP-mandatory)
Colocation Constraints:
LV_BLOBFS with LV_BLOB-clone (score:INFINITY) (with-rsc-role:Promoted) (id:colocation-LV_BLOBFS-LV_BLOB-clone-INFINITY)
LV_POSTGRESFS with LV_POSTGRES-clone (score:INFINITY) (with-rsc-role:Promoted) (id:colocation-LV_POSTGRESFS-LV_POSTGRES-clone-INFINITY)
postgresql with LV_POSTGRESFS (score:INFINITY) (id:colocation-postgresql-LV_POSTGRESFS-INFINITY)
LV_HOMEFS with LV_HOME-clone (score:INFINITY) (with-rsc-role:Promoted) (id:colocation-LV_HOMEFS-LV_HOME-clone-INFINITY)
ns_mhswdog with LV_HOMEFS (score:INFINITY) (id:colocation-ns_mhswdog-LV_HOMEFS-INFINITY)
ns_mhswdog with LV_BLOBFS (score:INFINITY) (id:colocation-ns_mhswdog-LV_BLOBFS-INFINITY)
ns_mhswdog with postgresql (score:INFINITY) (id:colocation-ns_mhswdog-postgresql-INFINITY)
ClusterIP with ns_mhswdog (score:INFINITY) (id:colocation-ClusterIP-ns_mhswdog-INFINITY)
Ticket Constraints:
Alerts:
No alerts defined
Resources Defaults:
Meta Attrs: build-resource-defaults
resource-stickiness=INFINITY
Operations Defaults:
Meta Attrs: op_defaults-meta_attributes
timeout=240s
Cluster Properties:
cluster-infrastructure: corosync
cluster-name: HA
dc-version: 2.1.4-5.el9_1.2-dc6eb4362e
have-watchdog: false
last-lrm-refresh: 1688971748
maintenance-mode: false
no-quorum-policy: ignore
stonith-enabled: false
Tags:
No tags defined
Quorum:
Options:
命令ceph-deploy install admin datanode_dn2
失败并输出:
[ceph_deploy.install][INFO ] Distro info: rocky 9.1 blue onyx
[admin][INFO ] installing Ceph on admin
[admin][INFO ] Running command: sudo yum clean all
[admin][DEBUG ] 57 files removed
[admin][INFO ] Running command: sudo yum -y install epel-release
[admin][DEBUG ] CentOS-9-stream - Ceph Quincy 113 kB/s | 474 kB 00:04
[admin][DEBUG ] Ceph aarch64 87 B/s | 257 B 00:02
[admin][DEBUG ] Ceph noarch 2.4 kB/s | 8.8 kB 00:03
[admin][DEBUG ] Ceph SRPMS 629 B/s | 1.8 kB 00:02
[admin][DEBUG ] Extra Packages for Enterprise Linux 9 - aarch64 3.7 MB/s | 14 MB 00:03
[admin][DEBUG ] Rocky Linux 9 - BaseOS 544 kB/s | 1.4 MB 00:02
[admin][DEBUG ] Rocky Linux 9 - AppStream 2.0 MB/s | 5.5 MB 00:02
[admin][DEBUG ] Rocky Linux 9 - Extras 3.1 kB/s | 9.1 kB 00:02
[admin][DEBUG ] Package epel-release-9-4.el9.noarch is already installed.
[admin][DEBUG ] Dependencies resolved.
[admin][DEBUG ] Nothing to do.
[admin][DEBUG ] Complete!
[ceph_deploy][ERROR ] RuntimeError: configparser.NoSectionError: No section: 'main'
我不太确定 ceph-deploy 抱怨的是哪个文件:肯定不是 ~/.cephdeploy.conf 或 ceph.conf。我还可以使用调试器,因为在调试器下运行会丢失有关配置文件位置的信息。
因为使用了auto-scaling,所以看起来和集群没什么区别,只是比例是自动变化的。
我认为它们的共同点是都通过在多个实例之间分配流量来提高可用性。
不同之处在于自动缩放可以增加和减少实例。
我想知道我的概念是否正确。
如果您能附上链接以供参考,我将不胜感激。
我已经使用 OnApp 配置了三个虚拟机管理程序。由于所有 Hypervisor 都使用 SAN 存储,因此如果任何 hypervisor 出现故障,托管在一个 hypervisor 上的 VPS 会在两个 hypervisor 上启动。
每个管理程序有 12 个内核,所以主要问题是我可以将 2 个管理程序的 CPU 内核分配给单个 VPS 吗?
例如,管理程序 1 有 12 个内核,管理程序 2 有 12 个内核,那么我可以将 24 个内核分配给该集群的 1 VPS 吗?
任何答案或澄清都会有所帮助。
环境
crash.log 中的错误
2022-02-08 22:42:45 =CRASH REPORT==== crasher:初始调用:pgsql_proto:init/1 pid:<0.27318.6018>registered_name:[]异常退出:{{init,{error,timeout} },[{gen_server,init_it,6,[{file,"gen_server.erl"},{line,349}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line, 247}]}]} 祖先:['ejabberd_sql_vhost1.xmpp_12','ejabberd_sql_sup_vhost1.xmpp',ejabberd_db_sup,ejabberd_sup,<0.87.0>] message_queue_len:0条消息:[]链接:[]字典:[]trap_exit:错误状态:运行堆大小:376 堆栈大小:27 减少:997 邻居:
错误描述 我正在尝试从 eJabberd 20.04 升级到 20.07。我的集群设置有三个节点。两个节点滚动升级成功。当 node1 试图离开集群进行升级时,它会给出以下错误:
与节点 '[email protected] 的 RPC 连接失败:超时
当我尝试 ejabberdctl 状态时,返回以下内容:节点 '[email protected]' 以状态启动:已启动 与节点 '[email protected]' 的 RPC 连接失败:{'EXIT', {timeout, {gen_server ,调用, [application_controller, which_applications]}}}
在 Erlang shell 上,节点仍然显示为集群的一部分
节点()。['[email protected]','[email protected]']
你能帮我解决这个问题吗?
您好,当我创建 EMR 集群时。状态说它正在创建,但 58 分钟后它抛出错误说Master - 1: Error provisioning instances
. 错误消息(附加错误截图)我尝试了多次,但所有尝试都失败了。
我正在关注有关如何创建 EMR 集群的 AWS 文档
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-gs.html
在 AWS 上创建 EMR 集群(图片来自附件文档)
我哪里做错了?我想成功创建 EMR 集群并将 Jupiter notebook 附加到集群。是否有文档可以成功创建集群并使集群在 58 分钟后不被终止而运行
请建议我必须做什么。
谢谢你。
我正在运行一个简单的裸机多主机“高可用性”环境,其中包含 2 个主机和 2 个工作人员,以及另一个使用 HAProxy 作为外部负载均衡器的 VM。
我的问题是:可以从集群外部访问服务(仪表板、ngnix、mysql(尤其是 mysql)等...),使用我正在运行的这个设置将它们暴露给网络?
我曾尝试在此环境中使用 MetalLB 将服务公开为 LoadBalancer,但它似乎不起作用,而且由于我对 Kubernetes 有点陌生,我不知道为什么。
编辑:现在开始工作了,遵循@c4f4t0r 的建议,而不是外部 HAProxy 负载均衡器,同一个 VM 成为第三个主节点,以及其他主节点,它们现在每个运行一个 HAProxy 和 Keepalived 的内部实例,而曾经作为外部 LB 的 VM 现在是其他节点加入集群的端点主机,MetalLB 在集群内运行,nginx 入口控制器将请求引导到已请求的服务。
按照本文档使用Ubuntu 20.04 LTS设置高可用性 Kubernetes 集群。
本文档指导您使用 HAProxy 设置具有两个主节点、一个工作节点和一个负载平衡器节点的集群。
角色 | 全域名 | 知识产权 | 操作系统 | 内存 | 中央处理器 |
---|---|---|---|---|---|
负载均衡器 | loadbalancer.example.com | 192.168.44.100 | Ubuntu 21.04 | 1G | 1 |
掌握 | kmaster1.example.com | 10.84.44.51 | Ubuntu 21.04 | 2G | 2 |
掌握 | kmaster2.example.com | 192.168.44.50 | Ubuntu 21.04 | 2G | 2 |
工人 | kworker1.example.com | 10.84.44.50 | Ubuntu 21.04 | 2G | 2 |
工人 | kworker2.example.com | 192.168.44.51 | Ubuntu 21.04 | 2G | 2 |
- 所有这些虚拟机上的root帐户的密码都是kubeadmin
- 除非另有说明,否则以 root 用户身份执行所有命令
如果您想在工作站上的虚拟化环境中尝试此操作
apt update && apt install -y haproxy
将以下行附加到/etc/haproxy/haproxy.cfg
frontend kubernetes-frontend
bind 192.168.44.100:6443
mode tcp
option tcplog
default_backend kubernetes-backend
backend kubernetes-backend
mode tcp
option tcp-check
balance roundrobin
server kmaster1 10.84.44.51:6443 check fall 3 rise 2
server kmaster2 192.168.44.50:6443 check fall 3 rise 2
systemctl restart haproxy
ufw disable
swapoff -a; sed -i '/swap/d' /etc/fstab
cat >>/etc/sysctl.d/kubernetes.conf<<EOF
net.bridge.bridge-nf-call-ip6tables = 1
net.bridge.bridge-nf-call-iptables = 1
EOF
sysctl --system
{
apt install -y apt-transport-https ca-certificates curl gnupg-agent software-properties-common
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | apt-key add -
add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable"
apt update && apt install -y docker-ce containerd.io
}
{
curl -s https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add -
echo "deb https://apt.kubernetes.io/ kubernetes-xenial main" > /etc/apt/sources.list.d/kubernetes.list
}
apt update && apt install -y kubeadm=1.19.2-00 kubelet=1.19.2-00 kubectl=1.19.2-00
kubeadm init --control-plane-endpoint="192.168.44.100:6443" --upload-certs
复制命令以加入其他主节点和工作节点。
kubectl --kubeconfig=/etc/kubernetes/admin.conf create -f https://docs.projectcalico.org/v3.15/manifests/calico.yaml
使用您从第一个 master 上的 kubeadm init 命令的输出中复制的相应 kubeadm join 命令。
重要提示:当您加入另一个主节点时,您还需要将 --apiserver-advertise-address 传递给 join 命令。
我在 IBM 云中有 2 个 Kubernetes 集群,一个有 2 个节点,另一个有 4 个。
一个有 4 个节点的节点工作正常,但在另一个节点上,由于金钱原因,我不得不暂时移除工作节点(不应该在空闲时支付)。
当我重新激活这两个节点时,一切似乎都正常启动,只要我不尝试与 Pod 交互,它表面上看起来仍然很好,没有关于不可用或严重健康状态的消息。好的,我删除了两个Namespace
卡在Terminating
状态中的过时的 s,但我可以通过重新启动集群节点来解决该问题(不再确切知道它是哪一个)。
当一切正常时,我尝试访问 kubernetes 仪表板(之前所做的一切都是在 IBM 管理级别或命令行中完成的),但令人惊讶的是,我发现它无法访问,浏览器中的错误页面显示:
503服务不可用
该页面底部有一条小的 JSON 消息,内容是:
{
"kind": "Status",
"apiVersion": "v1",
"metadata": { },
"status": "Failure",
"message": "error trying to reach service: read tcp 172.18.190.60:39946-\u003e172.19.151.38:8090: read: connection reset by peer",
"reason": "ServiceUnavailable",
"code": 503
}
我发送了一个kubectl logs kubernetes-dashboard-54674bdd65-nf6w7 --namespace=kube-system
显示Pod
为正在运行的位置,但结果不是要查看的日志,而是这条消息:
Error from server: Get "https://10.215.17.75:10250/containerLogs/kube-system/kubernetes-dashboard-54674bdd65-nf6w7/kubernetes-dashboard":
read tcp 172.18.135.195:56882->172.19.151.38:8090:
read: connection reset by peer
然后我发现我既无法获取该集群中运行的任何 Pod
日志,也无法部署任何需要调度的新自定义 kubernetes 对象(我实际上可以应用Service
s 或s ,ConfigMap
但没有Pod
,或类似的) .ReplicaSet
Deployment
我已经尝试过
Deployment
不幸的是,上述操作都没有改变Pod
s.
还有另一件事可能是相关的(虽然我不太确定它是否真的是):
在另一个运行良好的集群中,有三个 calicoPod
正在运行,并且所有三个都启动,而在有问题的集群中,三个 calicoPod
中只有两个启动并运行,第三个保持Pending
状态并且 akubectl describe pod calico-blablabla-blabla
揭示了原因,Event
Warning FailedScheduling 13s default-scheduler
0/2 nodes are available: 2 node(s) didn't have free ports for the requested pod ports.
有没有人知道该集群中发生了什么并可以指出可能的解决方案?我真的不想删除集群并生成一个新集群。
结果kubectl describe pod kubernetes-dashboard-54674bdd65-4m2ch --namespace=kube-system
:
Name: kubernetes-dashboard-54674bdd65-4m2ch
Namespace: kube-system
Priority: 2000000000
Priority Class Name: system-cluster-critical
Node: 10.215.17.82/10.215.17.82
Start Time: Mon, 15 Nov 2021 09:01:30 +0100
Labels: k8s-app=kubernetes-dashboard
pod-template-hash=54674bdd65
Annotations: cni.projectcalico.org/containerID: ca52cefaae58d8e5ce6d54883cb6a6135318c8db53d231dc645a5cf2e67d821e
cni.projectcalico.org/podIP: 172.30.184.2/32
cni.projectcalico.org/podIPs: 172.30.184.2/32
container.seccomp.security.alpha.kubernetes.io/kubernetes-dashboard: runtime/default
kubectl.kubernetes.io/restartedAt: 2021-11-10T15:47:14+01:00
kubernetes.io/psp: ibm-privileged-psp
Status: Running
IP: 172.30.184.2
IPs:
IP: 172.30.184.2
Controlled By: ReplicaSet/kubernetes-dashboard-54674bdd65
Containers:
kubernetes-dashboard:
Container ID: containerd://bac57850055cd6bb944c4d893a5d315c659fd7d4935fe49083d9ef8ae03e5c31
Image: registry.eu-de.bluemix.net/armada-master/kubernetesui-dashboard:v2.3.1
Image ID: registry.eu-de.bluemix.net/armada-master/kubernetesui-dashboard@sha256:f14f581d36b83fc9c1cfa3b0609e7788017ecada1f3106fab1c9db35295fe523
Port: 8443/TCP
Host Port: 0/TCP
Args:
--auto-generate-certificates
--namespace=kube-system
State: Running
Started: Mon, 15 Nov 2021 09:01:37 +0100
Ready: True
Restart Count: 0
Requests:
cpu: 50m
memory: 100Mi
Liveness: http-get https://:8443/ delay=30s timeout=30s period=10s #success=1 #failure=3
Readiness: http-get https://:8443/ delay=10s timeout=30s period=10s #success=1 #failure=3
Environment: <none>
Mounts:
/certs from kubernetes-dashboard-certs (rw)
/tmp from tmp-volume (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-sc9kw (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
kubernetes-dashboard-certs:
Type: Secret (a volume populated by a Secret)
SecretName: kubernetes-dashboard-certs
Optional: false
tmp-volume:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
kube-api-access-sc9kw:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node-role.kubernetes.io/master:NoSchedule
node.kubernetes.io/not-ready:NoExecute op=Exists for 600s
node.kubernetes.io/unreachable:NoExecute op=Exists for 600s
Events: <none>