mkasemer提出的问题 -server

mkasemer

Asked: 2023-12-10 08:56:29 +0800 CST

Slurm 节点随机掉落

我使用 Slurm 设置了一个集群，由一个头节点、16 个计算节点和一个具有 NFS-4 网络共享存储的 NAS 组成。我最近通过 apt 在 Ubuntu v22 上安装了 Slurm（sinfo -V透露slurm-wlm 21.08.5）。我已经测试了一些单节点和多节点作业，并且我可以让作业按照预期运行完成。down然而，对于某些模拟，某些节点在模拟过程中不断改变状态。尽管看似随机，但显示此行为的是相同的两个节点。这种情况经常发生，但我相信我们已经使用这些节点完成了一些模拟。在状态更改为的节点上down，slurmd守护进程仍然处于活动状态——也就是说，无论发生什么故障都不是由于守护进程关闭所致。

总体而言：为什么这些节点终止作业并将状态设置为down？

更多信息：我检查了slurmd其中一个发生故障的节点上的日志，这就是我们得到的信息（从作业提交到故障/节点故障的大致时间）。请注意，这是提交给 4 个节点且每个节点的所有 (64) 个处理器的作业（ID=64）：

[2023-12-07T16:48:29.487] [64.extern] debug2: setup for a launch_task
[2023-12-07T16:48:29.487] [64.extern] debug2: hwloc_topology_init
[2023-12-07T16:48:29.491] [64.extern] debug2: xcpuinfo_hwloc_topo_load: xml file (/var/spool/slurmd/hwloc_topo_whole.xml) found
[2023-12-07T16:48:29.493] [64.extern] debug:  CPUs:64 Boards:1 Sockets:1 CoresPerSocket:64 ThreadsPerCore:1
[2023-12-07T16:48:29.494] [64.extern] debug:  cgroup/v1: init: Cgroup v1 plugin loaded
[2023-12-07T16:48:29.498] [64.extern] debug:  jobacct_gather/cgroup: init: Job accounting gather cgroup plugin loaded
[2023-12-07T16:48:29.498] [64.extern] debug2: profile signaling type Task
[2023-12-07T16:48:29.499] [64.extern] debug:  Message thread started pid = 4176
[2023-12-07T16:48:29.503] [64.extern] task/affinity: init: task affinity plugin loaded with CPU mask 0xffffffffffffffff
[2023-12-07T16:48:29.507] [64.extern] debug:  task/cgroup: init: core enforcement enabled
[2023-12-07T16:48:29.507] [64.extern] debug:  task/cgroup: task_cgroup_memory_init: task/cgroup/memory: total:257579M allowed:100%(enforced), swap:0%(permissive), max:100%(257579M) max+swap:100%(515158M) min:30M kmem:100%(257579M permissive) min:30M swappiness:0(unset)
[2023-12-07T16:48:29.507] [64.extern] debug:  task/cgroup: init: memory enforcement enabled
[2023-12-07T16:48:29.509] [64.extern] debug:  task/cgroup: task_cgroup_devices_init: unable to open /etc/slurm/cgroup_allowed_devices_file.conf: No such file or directory
[2023-12-07T16:48:29.509] [64.extern] debug:  task/cgroup: init: device enforcement enabled
[2023-12-07T16:48:29.509] [64.extern] debug:  task/cgroup: init: Tasks containment cgroup plugin loaded
[2023-12-07T16:48:29.510] [64.extern] cred/munge: init: Munge credential signature plugin loaded
[2023-12-07T16:48:29.510] [64.extern] debug:  job_container/tmpfs: init: job_container tmpfs plugin loaded
[2023-12-07T16:48:29.510] [64.extern] debug:  job_container/tmpfs: _read_slurm_jc_conf: Reading job_container.conf file /etc/slurm/job_container.conf
[2023-12-07T16:48:29.513] [64.extern] debug2: _spawn_job_container: Before call to spank_init()
[2023-12-07T16:48:29.513] [64.extern] debug:  spank: opening plugin stack /etc/slurm/plugstack.conf
[2023-12-07T16:48:29.513] [64.extern] debug:  /etc/slurm/plugstack.conf: 1: include "/etc/slurm/plugstack.conf.d/*.conf"
[2023-12-07T16:48:29.513] [64.extern] debug2: _spawn_job_container: After call to spank_init()
[2023-12-07T16:48:29.555] [64.extern] debug:  task/cgroup: task_cgroup_cpuset_create: job abstract cores are '0-63'
[2023-12-07T16:48:29.555] [64.extern] debug:  task/cgroup: task_cgroup_cpuset_create: step abstract cores are '0-63'
[2023-12-07T16:48:29.555] [64.extern] debug:  task/cgroup: task_cgroup_cpuset_create: job physical CPUs are '0-63'
[2023-12-07T16:48:29.555] [64.extern] debug:  task/cgroup: task_cgroup_cpuset_create: step physical CPUs are '0-63'
[2023-12-07T16:48:29.556] [64.extern] task/cgroup: _memcg_initialize: job: alloc=0MB mem.limit=257579MB memsw.limit=unlimited
[2023-12-07T16:48:29.556] [64.extern] task/cgroup: _memcg_initialize: step: alloc=0MB mem.limit=257579MB memsw.limit=unlimited
[2023-12-07T16:48:29.556] [64.extern] debug:  cgroup/v1: _oom_event_monitor: started.
[2023-12-07T16:48:29.579] [64.extern] debug2: adding task 3 pid 4185 on node 3 to jobacct
[2023-12-07T16:48:29.582] debug2: Finish processing RPC: REQUEST_LAUNCH_PROLOG
[2023-12-07T16:48:29.830] debug2: Start processing RPC: REQUEST_LAUNCH_TASKS
[2023-12-07T16:48:29.830] debug2: Processing RPC: REQUEST_LAUNCH_TASKS
[2023-12-07T16:48:29.830] launch task StepId=64.0 request from UID:1000 GID:1000 HOST:10.115.79.9 PORT:48642
[2023-12-07T16:48:29.830] debug:  Checking credential with 868 bytes of sig data
[2023-12-07T16:48:29.830] task/affinity: lllp_distribution: JobId=64 manual binding: none,one_thread
[2023-12-07T16:48:29.830] debug:  Waiting for job 64's prolog to complete
[2023-12-07T16:48:29.830] debug:  Finished wait for job 64's prolog to complete
[2023-12-07T16:48:29.839] debug2: debug level read from slurmd is 'debug2'.
[2023-12-07T16:48:29.839] debug2: read_slurmd_conf_lite: slurmd sent 8 TRES.
[2023-12-07T16:48:29.839] debug:  acct_gather_energy/none: init: AcctGatherEnergy NONE plugin loaded
[2023-12-07T16:48:29.839] debug:  acct_gather_Profile/none: init: AcctGatherProfile NONE plugin loaded
[2023-12-07T16:48:29.839] debug:  acct_gather_interconnect/none: init: AcctGatherInterconnect NONE plugin loaded
[2023-12-07T16:48:29.839] debug:  acct_gather_filesystem/none: init: AcctGatherFilesystem NONE plugin loaded
[2023-12-07T16:48:29.839] debug2: Received CPU frequency information for 64 CPUs
[2023-12-07T16:48:29.840] debug:  switch/none: init: switch NONE plugin loaded
[2023-12-07T16:48:29.840] debug:  switch Cray/Aries plugin loaded.
[2023-12-07T16:48:29.840] [64.0] debug2: setup for a launch_task
[2023-12-07T16:48:29.840] [64.0] debug2: hwloc_topology_init
[2023-12-07T16:48:29.845] [64.0] debug2: xcpuinfo_hwloc_topo_load: xml file (/var/spool/slurmd/hwloc_topo_whole.xml) found
[2023-12-07T16:48:29.846] [64.0] debug:  CPUs:64 Boards:1 Sockets:1 CoresPerSocket:64 ThreadsPerCore:1
[2023-12-07T16:48:29.847] [64.0] debug:  cgroup/v1: init: Cgroup v1 plugin loaded
[2023-12-07T16:48:29.851] [64.0] debug:  jobacct_gather/cgroup: init: Job accounting gather cgroup plugin loaded
[2023-12-07T16:48:29.852] [64.0] debug2: profile signaling type Task
[2023-12-07T16:48:29.852] [64.0] debug:  Message thread started pid = 4188
[2023-12-07T16:48:29.852] debug2: Finish processing RPC: REQUEST_LAUNCH_TASKS
[2023-12-07T16:48:29.857] [64.0] task/affinity: init: task affinity plugin loaded with CPU mask 0xffffffffffffffff
[2023-12-07T16:48:29.861] [64.0] debug:  task/cgroup: init: core enforcement enabled
[2023-12-07T16:48:29.861] [64.0] debug:  task/cgroup: task_cgroup_memory_init: task/cgroup/memory: total:257579M allowed:100%(enforced), swap:0%(permissive), max:100%(257579M) max+swap:100%(515158M) min:30M kmem:100%(257579M permissive) min:30M swappiness:0(unset)
[2023-12-07T16:48:29.861] [64.0] debug:  task/cgroup: init: memory enforcement enabled
[2023-12-07T16:48:29.863] [64.0] debug:  task/cgroup: task_cgroup_devices_init: unable to open /etc/slurm/cgroup_allowed_devices_file.conf: No such file or directory
[2023-12-07T16:48:29.863] [64.0] debug:  task/cgroup: init: device enforcement enabled
[2023-12-07T16:48:29.863] [64.0] debug:  task/cgroup: init: Tasks containment cgroup plugin loaded
[2023-12-07T16:48:29.863] [64.0] cred/munge: init: Munge credential signature plugin loaded
[2023-12-07T16:48:29.863] [64.0] debug:  job_container/tmpfs: init: job_container tmpfs plugin loaded
[2023-12-07T16:48:29.863] [64.0] debug:  mpi type = none
[2023-12-07T16:48:29.863] [64.0] debug2: Before call to spank_init()
[2023-12-07T16:48:29.863] [64.0] debug:  spank: opening plugin stack /etc/slurm/plugstack.conf
[2023-12-07T16:48:29.864] [64.0] debug:  /etc/slurm/plugstack.conf: 1: include "/etc/slurm/plugstack.conf.d/*.conf"
[2023-12-07T16:48:29.864] [64.0] debug2: After call to spank_init()
[2023-12-07T16:48:29.864] [64.0] debug:  mpi type = (null)
[2023-12-07T16:48:29.864] [64.0] debug:  mpi/none: p_mpi_hook_slurmstepd_prefork: mpi/none: slurmstepd prefork
[2023-12-07T16:48:29.864] [64.0] error: cpu_freq_cpuset_validate: cpu_bind string is null
[2023-12-07T16:48:29.883] [64.0] debug:  task/cgroup: task_cgroup_cpuset_create: job abstract cores are '0-63'
[2023-12-07T16:48:29.883] [64.0] debug:  task/cgroup: task_cgroup_cpuset_create: step abstract cores are '0-63'
[2023-12-07T16:48:29.883] [64.0] debug:  task/cgroup: task_cgroup_cpuset_create: job physical CPUs are '0-63'
[2023-12-07T16:48:29.883] [64.0] debug:  task/cgroup: task_cgroup_cpuset_create: step physical CPUs are '0-63'
[2023-12-07T16:48:29.883] [64.0] task/cgroup: _memcg_initialize: job: alloc=0MB mem.limit=257579MB memsw.limit=unlimited
[2023-12-07T16:48:29.884] [64.0] task/cgroup: _memcg_initialize: step: alloc=0MB mem.limit=257579MB memsw.limit=unlimited
[2023-12-07T16:48:29.884] [64.0] debug:  cgroup/v1: _oom_event_monitor: started.
[2023-12-07T16:48:29.886] [64.0] debug2: hwloc_topology_load
[2023-12-07T16:48:29.918] [64.0] debug2: hwloc_topology_export_xml
[2023-12-07T16:48:29.922] [64.0] debug2: Entering _setup_normal_io
[2023-12-07T16:48:29.922] [64.0] debug2: io_init_msg_write_to_fd: entering
[2023-12-07T16:48:29.922] [64.0] debug2: io_init_msg_write_to_fd: msg->nodeid = 2
[2023-12-07T16:48:29.922] [64.0] debug2: io_init_msg_write_to_fd: leaving
[2023-12-07T16:48:29.923] [64.0] debug2: Leaving  _setup_normal_io
[2023-12-07T16:48:29.923] [64.0] debug levels are stderr='error', logfile='debug2', syslog='quiet'
[2023-12-07T16:48:29.923] [64.0] debug:  IO handler started pid=4188
[2023-12-07T16:48:29.925] [64.0] starting 1 tasks
[2023-12-07T16:48:29.925] [64.0] task 2 (4194) started 2023-12-07T16:48:29
[2023-12-07T16:48:29.926] [64.0] debug:  Setting slurmstepd oom_adj to -1000
[2023-12-07T16:48:29.926] [64.0] debug:  job_container/tmpfs: _read_slurm_jc_conf: Reading job_container.conf file /etc/slurm/job_container.conf
[2023-12-07T16:48:29.959] [64.0] debug2: adding task 2 pid 4194 on node 2 to jobacct
[2023-12-07T16:48:29.960] [64.0] debug:  Sending launch resp rc=0
[2023-12-07T16:48:29.961] [64.0] debug:  mpi type = (null)
[2023-12-07T16:48:29.961] [64.0] debug:  mpi/none: p_mpi_hook_slurmstepd_task: Using mpi/none
[2023-12-07T16:48:29.961] [64.0] debug:  task/affinity: task_p_pre_launch: affinity StepId=64.0, task:2 bind:none,one_thread
[2023-12-07T16:48:29.961] [64.0] debug2: _set_limit: conf setrlimit RLIMIT_CPU no change in value: 18446744073709551615
[2023-12-07T16:48:29.961] [64.0] debug2: _set_limit: conf setrlimit RLIMIT_FSIZE no change in value: 18446744073709551615
[2023-12-07T16:48:29.961] [64.0] debug2: _set_limit: conf setrlimit RLIMIT_DATA no change in value: 18446744073709551615
[2023-12-07T16:48:29.961] [64.0] debug2: _set_limit: RLIMIT_STACK  : max:inf cur:inf req:8388608
[2023-12-07T16:48:29.961] [64.0] debug2: _set_limit: conf setrlimit RLIMIT_STACK succeeded
[2023-12-07T16:48:29.961] [64.0] debug2: _set_limit: RLIMIT_CORE   : max:inf cur:inf req:0
[2023-12-07T16:48:29.961] [64.0] debug2: _set_limit: conf setrlimit RLIMIT_CORE succeeded
[2023-12-07T16:48:29.961] [64.0] debug2: _set_limit: conf setrlimit RLIMIT_RSS no change in value: 18446744073709551615
[2023-12-07T16:48:29.961] [64.0] debug2: _set_limit: RLIMIT_NPROC  : max:1030021 cur:1030021 req:1030020
[2023-12-07T16:48:29.961] [64.0] debug2: _set_limit: conf setrlimit RLIMIT_NPROC succeeded
[2023-12-07T16:48:29.961] [64.0] debug2: _set_limit: RLIMIT_NOFILE : max:131072 cur:4096 req:1024
[2023-12-07T16:48:29.961] [64.0] debug2: _set_limit: conf setrlimit RLIMIT_NOFILE succeeded
[2023-12-07T16:48:29.961] [64.0] debug2: _set_limit: RLIMIT_MEMLOCK: max:inf cur:inf req:33761472512
[2023-12-07T16:48:29.961] [64.0] debug2: _set_limit: conf setrlimit RLIMIT_MEMLOCK succeeded
[2023-12-07T16:48:29.961] [64.0] debug2: _set_limit: conf setrlimit RLIMIT_AS no change in value: 18446744073709551615
[2023-12-07T16:48:59.498] [64.extern] debug2: profile signaling type Task
[2023-12-07T16:48:59.852] [64.0] debug2: profile signaling type Task
[2023-12-07T16:51:03.457] debug:  Log file re-opened
[2023-12-07T16:51:03.457] debug:  _step_connect: connect() failed for /var/spool/slurmd/cn4_64.4294967292: Connection refused
[2023-12-07T16:51:03.457] debug:  _step_connect: connect() failed for /var/spool/slurmd/cn4_64.0: Connection refused
[2023-12-07T16:51:03.457] debug2: hwloc_topology_init
[2023-12-07T16:51:03.462] debug2: hwloc_topology_load
[2023-12-07T16:51:03.480] debug2: hwloc_topology_export_xml
[2023-12-07T16:51:03.482] debug:  CPUs:64 Boards:1 Sockets:1 CoresPerSocket:64 ThreadsPerCore:1
[2023-12-07T16:51:03.483] debug2: hwloc_topology_init
[2023-12-07T16:51:03.484] debug2: xcpuinfo_hwloc_topo_load: xml file (/var/spool/slurmd/hwloc_topo_whole.xml) found
[2023-12-07T16:51:03.485] debug:  CPUs:64 Boards:1 Sockets:1 CoresPerSocket:64 ThreadsPerCore:1
[2023-12-07T16:51:03.485] topology/none: init: topology NONE plugin loaded
[2023-12-07T16:51:03.485] route/default: init: route default plugin loaded
[2023-12-07T16:51:03.485] debug2: Gathering cpu frequency information for 64 cpus
[2023-12-07T16:51:03.487] debug:  Resource spec: No specialized cores configured by default on this node
[2023-12-07T16:51:03.487] debug:  Resource spec: Reserved system memory limit not configured for this node
[2023-12-07T16:51:03.490] task/affinity: init: task affinity plugin loaded with CPU mask 0xffffffffffffffff
[2023-12-07T16:51:03.490] debug:  task/cgroup: init: Tasks containment cgroup plugin loaded
[2023-12-07T16:51:03.490] debug:  auth/munge: init: Munge authentication plugin loaded
[2023-12-07T16:51:03.490] debug:  spank: opening plugin stack /etc/slurm/plugstack.conf
[2023-12-07T16:51:03.490] debug:  /etc/slurm/plugstack.conf: 1: include "/etc/slurm/plugstack.conf.d/*.conf"
[2023-12-07T16:51:03.491] cred/munge: init: Munge credential signature plugin loaded
[2023-12-07T16:51:03.491] slurmd version 21.08.5 started
[2023-12-07T16:51:03.491] debug:  jobacct_gather/cgroup: init: Job accounting gather cgroup plugin loaded
[2023-12-07T16:51:03.491] debug:  job_container/tmpfs: init: job_container tmpfs plugin loaded
[2023-12-07T16:51:03.491] debug:  job_container/tmpfs: _read_slurm_jc_conf: Reading job_container.conf file /etc/slurm/job_container.conf
[2023-12-07T16:51:03.492] debug:  job_container/tmpfs: container_p_restore: job_container.conf read successfully
[2023-12-07T16:51:03.492] debug:  job_container/tmpfs: _restore_ns: _restore_ns: Job 58 not found, deleting the namespace
[2023-12-07T16:51:03.492] error: _delete_ns: umount2 /var/nvme/storage/cn4/58/.ns failed: Invalid argument
[2023-12-07T16:51:03.492] debug:  job_container/tmpfs: _restore_ns: _restore_ns: Job 56 not found, deleting the namespace
[2023-12-07T16:51:03.492] error: _delete_ns: umount2 /var/nvme/storage/cn4/56/.ns failed: Invalid argument
[2023-12-07T16:51:03.492] debug:  _step_connect: connect() failed for /var/spool/slurmd/cn4_64.4294967292: Connection refused
[2023-12-07T16:51:03.492] error: _restore_ns: failed to connect to stepd for 64.
[2023-12-07T16:51:03.492] error: _delete_ns: umount2 /var/nvme/storage/cn4/64/.ns failed: Invalid argument
[2023-12-07T16:51:03.492] debug:  job_container/tmpfs: _restore_ns: _restore_ns: Job 54 not found, deleting the namespace
[2023-12-07T16:51:03.492] error: _delete_ns: umount2 /var/nvme/storage/cn4/54/.ns failed: Invalid argument
[2023-12-07T16:51:03.492] debug:  job_container/tmpfs: _restore_ns: _restore_ns: Job 59 not found, deleting the namespace
[2023-12-07T16:51:03.492] error: _delete_ns: umount2 /var/nvme/storage/cn4/59/.ns failed: Invalid argument
[2023-12-07T16:51:03.492] error: Encountered an error while restoring job containers.
[2023-12-07T16:51:03.492] error: Unable to restore job_container state.
[2023-12-07T16:51:03.493] debug:  switch/none: init: switch NONE plugin loaded
[2023-12-07T16:51:03.493] debug:  switch Cray/Aries plugin loaded.
[2023-12-07T16:51:03.493] slurmd started on Thu, 07 Dec 2023 16:51:03 -0600
[2023-12-07T16:51:03.493] debug:  _step_connect: connect() failed for /var/spool/slurmd/cn4_64.4294967292: Connection refused
[2023-12-07T16:51:03.494] debug:  _step_connect: connect() failed for /var/spool/slurmd/cn4_64.0: Connection refused
[2023-12-07T16:51:03.494] CPUs=64 Boards=1 Sockets=1 Cores=64 Threads=1 Memory=257579 TmpDisk=937291 Uptime=14 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
[2023-12-07T16:51:03.494] debug:  _step_connect: connect() failed for /var/spool/slurmd/cn4_64.4294967292: Connection refused
[2023-12-07T16:51:03.494] debug:  _step_connect: connect() failed for /var/spool/slurmd/cn4_64.0: Connection refused
[2023-12-07T16:51:03.494] debug:  acct_gather_energy/none: init: AcctGatherEnergy NONE plugin loaded
[2023-12-07T16:51:03.494] debug:  acct_gather_Profile/none: init: AcctGatherProfile NONE plugin loaded
[2023-12-07T16:51:03.495] debug:  acct_gather_interconnect/none: init: AcctGatherInterconnect NONE plugin loaded
[2023-12-07T16:51:03.495] debug:  acct_gather_filesystem/none: init: AcctGatherFilesystem NONE plugin loaded
[2023-12-07T16:51:03.495] debug2: No acct_gather.conf file (/etc/slurm/acct_gather.conf)
[2023-12-07T16:51:03.499] debug:  _handle_node_reg_resp: slurmctld sent back 8 TRES.
[2023-12-07T16:51:03.500] debug2: Start processing RPC: REQUEST_TERMINATE_JOB
[2023-12-07T16:51:03.500] debug2: Processing RPC: REQUEST_TERMINATE_JOB
[2023-12-07T16:51:03.500] debug:  _rpc_terminate_job: uid = 64030 JobId=64
[2023-12-07T16:51:03.500] debug:  credential for job 64 revoked
[2023-12-07T16:51:03.500] debug:  _step_connect: connect() failed for /var/spool/slurmd/cn4_64.4294967292: Connection refused
[2023-12-07T16:51:03.500] debug:  signal for nonexistent StepId=64.extern stepd_connect failed: Connection refused
[2023-12-07T16:51:03.500] debug:  _step_connect: connect() failed for /var/spool/slurmd/cn4_64.0: Connection refused
[2023-12-07T16:51:03.500] debug:  signal for nonexistent StepId=64.0 stepd_connect failed: Connection refused
[2023-12-07T16:51:03.500] debug2: No steps in jobid 64 were able to be signaled with 998
[2023-12-07T16:51:03.500] debug:  _step_connect: connect() failed for /var/spool/slurmd/cn4_64.4294967292: Connection refused
[2023-12-07T16:51:03.500] debug:  signal for nonexistent StepId=64.extern stepd_connect failed: Connection refused
[2023-12-07T16:51:03.500] debug:  _step_connect: connect() failed for /var/spool/slurmd/cn4_64.0: Connection refused
[2023-12-07T16:51:03.500] debug:  signal for nonexistent StepId=64.0 stepd_connect failed: Connection refused
[2023-12-07T16:51:03.500] debug2: No steps in jobid 64 were able to be signaled with 18
[2023-12-07T16:51:03.500] debug:  _step_connect: connect() failed for /var/spool/slurmd/cn4_64.4294967292: Connection refused
[2023-12-07T16:51:03.500] debug:  signal for nonexistent StepId=64.extern stepd_connect failed: Connection refused
[2023-12-07T16:51:03.500] debug:  _step_connect: connect() failed for /var/spool/slurmd/cn4_64.0: Connection refused
[2023-12-07T16:51:03.500] debug:  signal for nonexistent StepId=64.0 stepd_connect failed: Connection refused
[2023-12-07T16:51:03.500] debug2: No steps in jobid 64 were able to be signaled with 15
[2023-12-07T16:51:03.500] debug2: set revoke expiration for jobid 64 to 1701989583 UTS
[2023-12-07T16:51:03.501] error: _delete_ns: umount2 /var/nvme/storage/cn4/64/.ns failed: Invalid argument
[2023-12-07T16:51:03.501] error: container_g_delete(64): Invalid argument
[2023-12-07T16:51:03.501] debug2: Finish processing RPC: REQUEST_TERMINATE_JOB

我立即看到一些有关的错误/var/nvme/storage（这是每个节点上的本地文件夹，而不是 NAS 上的网络共享位置），但这在所有节点上都是相同的，并且只会在几个节点上引起问题。请注意，这是在中设置的基本路径job_container.conf：

AutoBasePath=true
BasePath=/var/nvme/storage

另外，这里是cgroup.conf：

CgroupAutomount=yes
CgroupReleaseAgentDir="/etc/slurm/cgroup"
ConstrainCores=yes
ConstrainRAMSpace=yes
ConstrainDevices=yes
ConstrainKmemSpace=no
TaskAffinity=no
CgroupMountpoint=/sys/fs/cgroup

...和slurm.conf：

# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ClusterName=cauchy
SlurmctldHost=cauchy
#SlurmctldHost=
#
#DisableRootJobs=NO
#EnforcePartLimits=NO
#Epilog=
#EpilogSlurmctld=
#FirstJobId=1
#MaxJobId=67043328
#GresTypes=
#GroupUpdateForce=0
#GroupUpdateTime=600
#JobFileAppend=0
#JobRequeue=1
#JobSubmitPlugins=lua
#KillOnBadExit=0
#LaunchType=launch/slurm
#Licenses=foo*4,bar
#MailProg=/bin/mail
MaxJobCount=1000000
#MaxStepCount=40000
#MaxTasksPerNode=512
MpiDefault=none
#MpiParams=ports=#-#
#PluginDir=
#PlugStackConfig=
#PrivateData=jobs
ProctrackType=proctrack/cgroup
#Prolog=
PrologFlags=contain
#PrologSlurmctld=
#PropagatePrioProcess=0
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
#RebootProgram=
ReturnToService=1
SlurmctldPidFile=/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
#SlurmdUser=root
#SrunEpilog=
#SrunProlog=
StateSaveLocation=/var/spool/slurmctld
SwitchType=switch/none
#TaskEpilog=
TaskPlugin=task/affinity,task/cgroup
#TaskProlog=
#TopologyPlugin=topology/tree
#TmpFS=/tmp
#TrackWCKey=no
#TreeWidth=
#UnkillableStepProgram=
#UsePAM=0
#
#
# TIMERS
#BatchStartTimeout=10
#CompleteWait=0
#EpilogMsgTime=2000
#GetEnvTimeout=2
#HealthCheckInterval=0
#HealthCheckProgram=
InactiveLimit=0
KillWait=120
#MessageTimeout=10
#ResvOverRun=0
MinJobAge=300
#OverTimeLimit=0
SlurmctldTimeout=120
SlurmdTimeout=300
#UnkillableStepTimeout=60
#VSizeFactor=0
Waittime=0
MaxArraySize=100000
#
#
# SCHEDULING
#DefMemPerCPU=0
#MaxMemPerCPU=0
#SchedulerTimeSlice=30
SchedulerType=sched/backfill
SchedulerParameters=enable_user_top,bf_job_part_count_reserve=5,bf_continue
SelectType=select/cons_tres
SelectTypeParameters=CR_Core
#
#
####### Priority Begin ##################
PriorityType=priority/multifactor
PriorityDecayHalfLife=14-0
PriorityWeightAge=100
PriorityWeightPartition=10000
PriorityWeightJobSize=0
PriorityMaxAge=14-0
PriorityFavorSmall=YES
#PriorityWeightQOS=10000
#PriorityWeightTRES=cpu=2000,mem=1,gres/gpu=400
#AccountingStorageTRES=gres/gpu
#AccountingStorageEnforce=all
#FairShareDampeningFactor=5
####### Priority End ##################
#
#
# LOGGING AND ACCOUNTING
#AccountingStorageEnforce=0
#AccountingStorageHost=
#AccountingStoragePass=
#AccountingStoragePort=
AccountingStorageType=accounting_storage/none
#AccountingStorageUser=
#AccountingStoreFlags=
#JobCompHost=
#JobCompLoc=
#JobCompParams=
#JobCompPass=
#JobCompPort=
JobCompType=jobcomp/none
#JobCompUser=
JobContainerType=job_container/tmpfs
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/cgroup
SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=debug2
SlurmdLogFile=/var/log/slurmd.log

#SlurmSchedLogFile=
#SlurmSchedLogLevel=
#DebugFlags=
#
#
# POWER SAVE SUPPORT FOR IDLE NODES (optional)
#SuspendProgram=
#ResumeProgram=
#SuspendTimeout=
#ResumeTimeout=
#ResumeRate=
#SuspendExcNodes=
#SuspendExcParts=
#SuspendRate=
#SuspendTime=
#
#
# COMPUTE NODES
NodeName=cn1 CPUs=64 Sockets=1 CoresPerSocket=64 ThreadsPerCore=1 State=UNKNOWN RealMemory=257000
NodeName=cn2 CPUs=64 Sockets=1 CoresPerSocket=64 ThreadsPerCore=1 State=UNKNOWN RealMemory=257000
NodeName=cn3 CPUs=64 Sockets=1 CoresPerSocket=64 ThreadsPerCore=1 State=UNKNOWN RealMemory=257000
NodeName=cn4 CPUs=64 Sockets=1 CoresPerSocket=64 ThreadsPerCore=1 State=UNKNOWN RealMemory=257000
NodeName=cn5 CPUs=64 Sockets=1 CoresPerSocket=64 ThreadsPerCore=1 State=UNKNOWN RealMemory=257000
NodeName=cn6 CPUs=64 Sockets=1 CoresPerSocket=64 ThreadsPerCore=1 State=UNKNOWN RealMemory=257000
NodeName=cn7 CPUs=64 Sockets=1 CoresPerSocket=64 ThreadsPerCore=1 State=UNKNOWN RealMemory=257000
NodeName=cn8 CPUs=64 Sockets=1 CoresPerSocket=64 ThreadsPerCore=1 State=UNKNOWN RealMemory=257000
NodeName=cn9 CPUs=64 Sockets=1 CoresPerSocket=64 ThreadsPerCore=1 State=UNKNOWN RealMemory=257000
NodeName=cn10 CPUs=64 Sockets=1 CoresPerSocket=64 ThreadsPerCore=1 State=UNKNOWN RealMemory=257000
NodeName=cn11 CPUs=64 Sockets=1 CoresPerSocket=64 ThreadsPerCore=1 State=UNKNOWN RealMemory=257000
NodeName=cn12 CPUs=64 Sockets=1 CoresPerSocket=64 ThreadsPerCore=1 State=UNKNOWN RealMemory=257000
NodeName=cn13 CPUs=64 Sockets=1 CoresPerSocket=64 ThreadsPerCore=1 State=UNKNOWN RealMemory=257000
NodeName=cn14 CPUs=64 Sockets=1 CoresPerSocket=64 ThreadsPerCore=1 State=UNKNOWN RealMemory=257000
NodeName=cn15 CPUs=64 Sockets=1 CoresPerSocket=64 ThreadsPerCore=1 State=UNKNOWN RealMemory=257000
NodeName=cn16 CPUs=64 Sockets=1 CoresPerSocket=64 ThreadsPerCore=1 State=UNKNOWN RealMemory=257000

PartitionName=main Nodes=ALL Default=YES MaxTime=INFINITE State=UP PriorityJobFactor=2000
PartitionName=low Nodes=ALL MaxTime=INFINITE State=UP PriorityJobFactor=1000

EDIT: I cancelled a job (submitted to 4 nodes, cn1 - cn4) that showed this error before it could reallocate to new nodes / overwrite the Slurm error file. Here are the contents of the error file:

Authorization required, but no authorization protocol specified
Authorization required, but no authorization protocol specified

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
slurmstepd-cn1: error: *** JOB 74 ON cn1 CANCELLED AT 2023-12-10T10:05:57 DUE TO NODE FAILURE, SEE SLURMCTLD LOG FOR DETAILS ***

The Authorization required... error is pervasive on all nodes / all simulations, so I'm not sure it is of particular consequence for the consistent failure of the cn4 node. The segfault doesn't appear when the same job is run on other nodes, so this is new info / anomalous. The slurmctld log is not particularly illuminating:

[2023-12-09T19:54:46.401] _slurm_rpc_submit_batch_job: JobId=65 InitPrio=10000 usec=493
[2023-12-09T19:54:46.826] sched/backfill: _start_job: Started JobId=65 in main on cn[1-4]
[2023-12-09T20:01:33.529] validate_node_specs: Node cn4 unexpectedly rebooted boot_time=1702173678 last response=1702173587
[2023-12-09T20:01:33.529] requeue job JobId=65 due to failure of node cn4
[2023-12-09T20:01:38.334] Requeuing JobId=65

Slurm 节点随机掉落

新安装后 postgres 的默认超级用户用户名/密码是什么？

SFTP 使用什么端口？

命令行列出 Windows Active Directory 组中的用户？

什么是 Pem 文件，它与其他 OpenSSL 生成的密钥文件格式有何不同？

如何确定bash变量是否为空？

mkasemer's questions

Slurm 节点随机掉落

新安装后 postgres 的默认超级用户用户名/密码是什么？

SFTP 使用什么端口？

命令行列出 Windows Active Directory 组中的用户？

什么是 Pem 文件，它与其他 OpenSSL 生成的密钥文件格式有何不同？

如何确定bash变量是否为空？