AskOverflow.Dev

AskOverflow.Dev Logo AskOverflow.Dev Logo

AskOverflow.Dev Navigation

  • 主页
  • 系统&网络
  • Ubuntu
  • Unix
  • DBA
  • Computer
  • Coding
  • LangChain

Mobile menu

Close
  • 主页
  • 系统&网络
    • 最新
    • 热门
    • 标签
  • Ubuntu
    • 最新
    • 热门
    • 标签
  • Unix
    • 最新
    • 标签
  • DBA
    • 最新
    • 标签
  • Computer
    • 最新
    • 标签
  • Coding
    • 最新
    • 标签
主页 / server / 问题

问题[cluster](server)

Martin Hope
Mikke Mus
Asked: 2024-02-19 20:35:00 +0800 CST

sge:如何设置节点的硬限制?

  • 7

管理我们集群的人最近突然去世了,所以现在我们必须自己操作它,直到新人加入。我们想要更改集群上节点的硬运行时间限制。由于某种原因,队列中的所有节点都具有所需的硬运行时间限制,但其中一个节点没有。

如何为给定的 x 设置 h_rt=x ?

cluster
  • 1 个回答
  • 212 Views
Martin Hope
mkasemer
Asked: 2023-12-10 08:56:29 +0800 CST

Slurm 节点随机掉落

  • 6

我使用 Slurm 设置了一个集群,由一个头节点、16 个计算节点和一个具有 NFS-4 网络共享存储的 NAS 组成。我最近通过 apt 在 Ubuntu v22 上安装了 Slurm(sinfo -V透露slurm-wlm 21.08.5)。我已经测试了一些单节点和多节点作业,并且我可以让作业按照预期运行完成。down然而,对于某些模拟,某些节点在模拟过程中不断改变状态。尽管看似随机,但显示此行为的是相同的两个节点。这种情况经常发生,但我相信我们已经使用这些节点完成了一些模拟。在状态更改为 的节点上down,slurmd守护进程仍然处于活动状态——也就是说,无论发生什么故障都不是由于守护进程关闭所致。

总体而言:为什么这些节点终止作业并将状态设置为down?

更多信息:我检查了slurmd其中一个发生故障的节点上的日志,这就是我们得到的信息(从作业提交到故障/节点故障的大致时间)。请注意,这是提交给 4 个节点且每个节点的所有 (64) 个处理器的作业(ID=64):

[2023-12-07T16:48:29.487] [64.extern] debug2: setup for a launch_task
[2023-12-07T16:48:29.487] [64.extern] debug2: hwloc_topology_init
[2023-12-07T16:48:29.491] [64.extern] debug2: xcpuinfo_hwloc_topo_load: xml file (/var/spool/slurmd/hwloc_topo_whole.xml) found
[2023-12-07T16:48:29.493] [64.extern] debug:  CPUs:64 Boards:1 Sockets:1 CoresPerSocket:64 ThreadsPerCore:1
[2023-12-07T16:48:29.494] [64.extern] debug:  cgroup/v1: init: Cgroup v1 plugin loaded
[2023-12-07T16:48:29.498] [64.extern] debug:  jobacct_gather/cgroup: init: Job accounting gather cgroup plugin loaded
[2023-12-07T16:48:29.498] [64.extern] debug2: profile signaling type Task
[2023-12-07T16:48:29.499] [64.extern] debug:  Message thread started pid = 4176
[2023-12-07T16:48:29.503] [64.extern] task/affinity: init: task affinity plugin loaded with CPU mask 0xffffffffffffffff
[2023-12-07T16:48:29.507] [64.extern] debug:  task/cgroup: init: core enforcement enabled
[2023-12-07T16:48:29.507] [64.extern] debug:  task/cgroup: task_cgroup_memory_init: task/cgroup/memory: total:257579M allowed:100%(enforced), swap:0%(permissive), max:100%(257579M) max+swap:100%(515158M) min:30M kmem:100%(257579M permissive) min:30M swappiness:0(unset)
[2023-12-07T16:48:29.507] [64.extern] debug:  task/cgroup: init: memory enforcement enabled
[2023-12-07T16:48:29.509] [64.extern] debug:  task/cgroup: task_cgroup_devices_init: unable to open /etc/slurm/cgroup_allowed_devices_file.conf: No such file or directory
[2023-12-07T16:48:29.509] [64.extern] debug:  task/cgroup: init: device enforcement enabled
[2023-12-07T16:48:29.509] [64.extern] debug:  task/cgroup: init: Tasks containment cgroup plugin loaded
[2023-12-07T16:48:29.510] [64.extern] cred/munge: init: Munge credential signature plugin loaded
[2023-12-07T16:48:29.510] [64.extern] debug:  job_container/tmpfs: init: job_container tmpfs plugin loaded
[2023-12-07T16:48:29.510] [64.extern] debug:  job_container/tmpfs: _read_slurm_jc_conf: Reading job_container.conf file /etc/slurm/job_container.conf
[2023-12-07T16:48:29.513] [64.extern] debug2: _spawn_job_container: Before call to spank_init()
[2023-12-07T16:48:29.513] [64.extern] debug:  spank: opening plugin stack /etc/slurm/plugstack.conf
[2023-12-07T16:48:29.513] [64.extern] debug:  /etc/slurm/plugstack.conf: 1: include "/etc/slurm/plugstack.conf.d/*.conf"
[2023-12-07T16:48:29.513] [64.extern] debug2: _spawn_job_container: After call to spank_init()
[2023-12-07T16:48:29.555] [64.extern] debug:  task/cgroup: task_cgroup_cpuset_create: job abstract cores are '0-63'
[2023-12-07T16:48:29.555] [64.extern] debug:  task/cgroup: task_cgroup_cpuset_create: step abstract cores are '0-63'
[2023-12-07T16:48:29.555] [64.extern] debug:  task/cgroup: task_cgroup_cpuset_create: job physical CPUs are '0-63'
[2023-12-07T16:48:29.555] [64.extern] debug:  task/cgroup: task_cgroup_cpuset_create: step physical CPUs are '0-63'
[2023-12-07T16:48:29.556] [64.extern] task/cgroup: _memcg_initialize: job: alloc=0MB mem.limit=257579MB memsw.limit=unlimited
[2023-12-07T16:48:29.556] [64.extern] task/cgroup: _memcg_initialize: step: alloc=0MB mem.limit=257579MB memsw.limit=unlimited
[2023-12-07T16:48:29.556] [64.extern] debug:  cgroup/v1: _oom_event_monitor: started.
[2023-12-07T16:48:29.579] [64.extern] debug2: adding task 3 pid 4185 on node 3 to jobacct
[2023-12-07T16:48:29.582] debug2: Finish processing RPC: REQUEST_LAUNCH_PROLOG
[2023-12-07T16:48:29.830] debug2: Start processing RPC: REQUEST_LAUNCH_TASKS
[2023-12-07T16:48:29.830] debug2: Processing RPC: REQUEST_LAUNCH_TASKS
[2023-12-07T16:48:29.830] launch task StepId=64.0 request from UID:1000 GID:1000 HOST:10.115.79.9 PORT:48642
[2023-12-07T16:48:29.830] debug:  Checking credential with 868 bytes of sig data
[2023-12-07T16:48:29.830] task/affinity: lllp_distribution: JobId=64 manual binding: none,one_thread
[2023-12-07T16:48:29.830] debug:  Waiting for job 64's prolog to complete
[2023-12-07T16:48:29.830] debug:  Finished wait for job 64's prolog to complete
[2023-12-07T16:48:29.839] debug2: debug level read from slurmd is 'debug2'.
[2023-12-07T16:48:29.839] debug2: read_slurmd_conf_lite: slurmd sent 8 TRES.
[2023-12-07T16:48:29.839] debug:  acct_gather_energy/none: init: AcctGatherEnergy NONE plugin loaded
[2023-12-07T16:48:29.839] debug:  acct_gather_Profile/none: init: AcctGatherProfile NONE plugin loaded
[2023-12-07T16:48:29.839] debug:  acct_gather_interconnect/none: init: AcctGatherInterconnect NONE plugin loaded
[2023-12-07T16:48:29.839] debug:  acct_gather_filesystem/none: init: AcctGatherFilesystem NONE plugin loaded
[2023-12-07T16:48:29.839] debug2: Received CPU frequency information for 64 CPUs
[2023-12-07T16:48:29.840] debug:  switch/none: init: switch NONE plugin loaded
[2023-12-07T16:48:29.840] debug:  switch Cray/Aries plugin loaded.
[2023-12-07T16:48:29.840] [64.0] debug2: setup for a launch_task
[2023-12-07T16:48:29.840] [64.0] debug2: hwloc_topology_init
[2023-12-07T16:48:29.845] [64.0] debug2: xcpuinfo_hwloc_topo_load: xml file (/var/spool/slurmd/hwloc_topo_whole.xml) found
[2023-12-07T16:48:29.846] [64.0] debug:  CPUs:64 Boards:1 Sockets:1 CoresPerSocket:64 ThreadsPerCore:1
[2023-12-07T16:48:29.847] [64.0] debug:  cgroup/v1: init: Cgroup v1 plugin loaded
[2023-12-07T16:48:29.851] [64.0] debug:  jobacct_gather/cgroup: init: Job accounting gather cgroup plugin loaded
[2023-12-07T16:48:29.852] [64.0] debug2: profile signaling type Task
[2023-12-07T16:48:29.852] [64.0] debug:  Message thread started pid = 4188
[2023-12-07T16:48:29.852] debug2: Finish processing RPC: REQUEST_LAUNCH_TASKS
[2023-12-07T16:48:29.857] [64.0] task/affinity: init: task affinity plugin loaded with CPU mask 0xffffffffffffffff
[2023-12-07T16:48:29.861] [64.0] debug:  task/cgroup: init: core enforcement enabled
[2023-12-07T16:48:29.861] [64.0] debug:  task/cgroup: task_cgroup_memory_init: task/cgroup/memory: total:257579M allowed:100%(enforced), swap:0%(permissive), max:100%(257579M) max+swap:100%(515158M) min:30M kmem:100%(257579M permissive) min:30M swappiness:0(unset)
[2023-12-07T16:48:29.861] [64.0] debug:  task/cgroup: init: memory enforcement enabled
[2023-12-07T16:48:29.863] [64.0] debug:  task/cgroup: task_cgroup_devices_init: unable to open /etc/slurm/cgroup_allowed_devices_file.conf: No such file or directory
[2023-12-07T16:48:29.863] [64.0] debug:  task/cgroup: init: device enforcement enabled
[2023-12-07T16:48:29.863] [64.0] debug:  task/cgroup: init: Tasks containment cgroup plugin loaded
[2023-12-07T16:48:29.863] [64.0] cred/munge: init: Munge credential signature plugin loaded
[2023-12-07T16:48:29.863] [64.0] debug:  job_container/tmpfs: init: job_container tmpfs plugin loaded
[2023-12-07T16:48:29.863] [64.0] debug:  mpi type = none
[2023-12-07T16:48:29.863] [64.0] debug2: Before call to spank_init()
[2023-12-07T16:48:29.863] [64.0] debug:  spank: opening plugin stack /etc/slurm/plugstack.conf
[2023-12-07T16:48:29.864] [64.0] debug:  /etc/slurm/plugstack.conf: 1: include "/etc/slurm/plugstack.conf.d/*.conf"
[2023-12-07T16:48:29.864] [64.0] debug2: After call to spank_init()
[2023-12-07T16:48:29.864] [64.0] debug:  mpi type = (null)
[2023-12-07T16:48:29.864] [64.0] debug:  mpi/none: p_mpi_hook_slurmstepd_prefork: mpi/none: slurmstepd prefork
[2023-12-07T16:48:29.864] [64.0] error: cpu_freq_cpuset_validate: cpu_bind string is null
[2023-12-07T16:48:29.883] [64.0] debug:  task/cgroup: task_cgroup_cpuset_create: job abstract cores are '0-63'
[2023-12-07T16:48:29.883] [64.0] debug:  task/cgroup: task_cgroup_cpuset_create: step abstract cores are '0-63'
[2023-12-07T16:48:29.883] [64.0] debug:  task/cgroup: task_cgroup_cpuset_create: job physical CPUs are '0-63'
[2023-12-07T16:48:29.883] [64.0] debug:  task/cgroup: task_cgroup_cpuset_create: step physical CPUs are '0-63'
[2023-12-07T16:48:29.883] [64.0] task/cgroup: _memcg_initialize: job: alloc=0MB mem.limit=257579MB memsw.limit=unlimited
[2023-12-07T16:48:29.884] [64.0] task/cgroup: _memcg_initialize: step: alloc=0MB mem.limit=257579MB memsw.limit=unlimited
[2023-12-07T16:48:29.884] [64.0] debug:  cgroup/v1: _oom_event_monitor: started.
[2023-12-07T16:48:29.886] [64.0] debug2: hwloc_topology_load
[2023-12-07T16:48:29.918] [64.0] debug2: hwloc_topology_export_xml
[2023-12-07T16:48:29.922] [64.0] debug2: Entering _setup_normal_io
[2023-12-07T16:48:29.922] [64.0] debug2: io_init_msg_write_to_fd: entering
[2023-12-07T16:48:29.922] [64.0] debug2: io_init_msg_write_to_fd: msg->nodeid = 2
[2023-12-07T16:48:29.922] [64.0] debug2: io_init_msg_write_to_fd: leaving
[2023-12-07T16:48:29.923] [64.0] debug2: Leaving  _setup_normal_io
[2023-12-07T16:48:29.923] [64.0] debug levels are stderr='error', logfile='debug2', syslog='quiet'
[2023-12-07T16:48:29.923] [64.0] debug:  IO handler started pid=4188
[2023-12-07T16:48:29.925] [64.0] starting 1 tasks
[2023-12-07T16:48:29.925] [64.0] task 2 (4194) started 2023-12-07T16:48:29
[2023-12-07T16:48:29.926] [64.0] debug:  Setting slurmstepd oom_adj to -1000
[2023-12-07T16:48:29.926] [64.0] debug:  job_container/tmpfs: _read_slurm_jc_conf: Reading job_container.conf file /etc/slurm/job_container.conf
[2023-12-07T16:48:29.959] [64.0] debug2: adding task 2 pid 4194 on node 2 to jobacct
[2023-12-07T16:48:29.960] [64.0] debug:  Sending launch resp rc=0
[2023-12-07T16:48:29.961] [64.0] debug:  mpi type = (null)
[2023-12-07T16:48:29.961] [64.0] debug:  mpi/none: p_mpi_hook_slurmstepd_task: Using mpi/none
[2023-12-07T16:48:29.961] [64.0] debug:  task/affinity: task_p_pre_launch: affinity StepId=64.0, task:2 bind:none,one_thread
[2023-12-07T16:48:29.961] [64.0] debug2: _set_limit: conf setrlimit RLIMIT_CPU no change in value: 18446744073709551615
[2023-12-07T16:48:29.961] [64.0] debug2: _set_limit: conf setrlimit RLIMIT_FSIZE no change in value: 18446744073709551615
[2023-12-07T16:48:29.961] [64.0] debug2: _set_limit: conf setrlimit RLIMIT_DATA no change in value: 18446744073709551615
[2023-12-07T16:48:29.961] [64.0] debug2: _set_limit: RLIMIT_STACK  : max:inf cur:inf req:8388608
[2023-12-07T16:48:29.961] [64.0] debug2: _set_limit: conf setrlimit RLIMIT_STACK succeeded
[2023-12-07T16:48:29.961] [64.0] debug2: _set_limit: RLIMIT_CORE   : max:inf cur:inf req:0
[2023-12-07T16:48:29.961] [64.0] debug2: _set_limit: conf setrlimit RLIMIT_CORE succeeded
[2023-12-07T16:48:29.961] [64.0] debug2: _set_limit: conf setrlimit RLIMIT_RSS no change in value: 18446744073709551615
[2023-12-07T16:48:29.961] [64.0] debug2: _set_limit: RLIMIT_NPROC  : max:1030021 cur:1030021 req:1030020
[2023-12-07T16:48:29.961] [64.0] debug2: _set_limit: conf setrlimit RLIMIT_NPROC succeeded
[2023-12-07T16:48:29.961] [64.0] debug2: _set_limit: RLIMIT_NOFILE : max:131072 cur:4096 req:1024
[2023-12-07T16:48:29.961] [64.0] debug2: _set_limit: conf setrlimit RLIMIT_NOFILE succeeded
[2023-12-07T16:48:29.961] [64.0] debug2: _set_limit: RLIMIT_MEMLOCK: max:inf cur:inf req:33761472512
[2023-12-07T16:48:29.961] [64.0] debug2: _set_limit: conf setrlimit RLIMIT_MEMLOCK succeeded
[2023-12-07T16:48:29.961] [64.0] debug2: _set_limit: conf setrlimit RLIMIT_AS no change in value: 18446744073709551615
[2023-12-07T16:48:59.498] [64.extern] debug2: profile signaling type Task
[2023-12-07T16:48:59.852] [64.0] debug2: profile signaling type Task
[2023-12-07T16:51:03.457] debug:  Log file re-opened
[2023-12-07T16:51:03.457] debug:  _step_connect: connect() failed for /var/spool/slurmd/cn4_64.4294967292: Connection refused
[2023-12-07T16:51:03.457] debug:  _step_connect: connect() failed for /var/spool/slurmd/cn4_64.0: Connection refused
[2023-12-07T16:51:03.457] debug2: hwloc_topology_init
[2023-12-07T16:51:03.462] debug2: hwloc_topology_load
[2023-12-07T16:51:03.480] debug2: hwloc_topology_export_xml
[2023-12-07T16:51:03.482] debug:  CPUs:64 Boards:1 Sockets:1 CoresPerSocket:64 ThreadsPerCore:1
[2023-12-07T16:51:03.483] debug2: hwloc_topology_init
[2023-12-07T16:51:03.484] debug2: xcpuinfo_hwloc_topo_load: xml file (/var/spool/slurmd/hwloc_topo_whole.xml) found
[2023-12-07T16:51:03.485] debug:  CPUs:64 Boards:1 Sockets:1 CoresPerSocket:64 ThreadsPerCore:1
[2023-12-07T16:51:03.485] topology/none: init: topology NONE plugin loaded
[2023-12-07T16:51:03.485] route/default: init: route default plugin loaded
[2023-12-07T16:51:03.485] debug2: Gathering cpu frequency information for 64 cpus
[2023-12-07T16:51:03.487] debug:  Resource spec: No specialized cores configured by default on this node
[2023-12-07T16:51:03.487] debug:  Resource spec: Reserved system memory limit not configured for this node
[2023-12-07T16:51:03.490] task/affinity: init: task affinity plugin loaded with CPU mask 0xffffffffffffffff
[2023-12-07T16:51:03.490] debug:  task/cgroup: init: Tasks containment cgroup plugin loaded
[2023-12-07T16:51:03.490] debug:  auth/munge: init: Munge authentication plugin loaded
[2023-12-07T16:51:03.490] debug:  spank: opening plugin stack /etc/slurm/plugstack.conf
[2023-12-07T16:51:03.490] debug:  /etc/slurm/plugstack.conf: 1: include "/etc/slurm/plugstack.conf.d/*.conf"
[2023-12-07T16:51:03.491] cred/munge: init: Munge credential signature plugin loaded
[2023-12-07T16:51:03.491] slurmd version 21.08.5 started
[2023-12-07T16:51:03.491] debug:  jobacct_gather/cgroup: init: Job accounting gather cgroup plugin loaded
[2023-12-07T16:51:03.491] debug:  job_container/tmpfs: init: job_container tmpfs plugin loaded
[2023-12-07T16:51:03.491] debug:  job_container/tmpfs: _read_slurm_jc_conf: Reading job_container.conf file /etc/slurm/job_container.conf
[2023-12-07T16:51:03.492] debug:  job_container/tmpfs: container_p_restore: job_container.conf read successfully
[2023-12-07T16:51:03.492] debug:  job_container/tmpfs: _restore_ns: _restore_ns: Job 58 not found, deleting the namespace
[2023-12-07T16:51:03.492] error: _delete_ns: umount2 /var/nvme/storage/cn4/58/.ns failed: Invalid argument
[2023-12-07T16:51:03.492] debug:  job_container/tmpfs: _restore_ns: _restore_ns: Job 56 not found, deleting the namespace
[2023-12-07T16:51:03.492] error: _delete_ns: umount2 /var/nvme/storage/cn4/56/.ns failed: Invalid argument
[2023-12-07T16:51:03.492] debug:  _step_connect: connect() failed for /var/spool/slurmd/cn4_64.4294967292: Connection refused
[2023-12-07T16:51:03.492] error: _restore_ns: failed to connect to stepd for 64.
[2023-12-07T16:51:03.492] error: _delete_ns: umount2 /var/nvme/storage/cn4/64/.ns failed: Invalid argument
[2023-12-07T16:51:03.492] debug:  job_container/tmpfs: _restore_ns: _restore_ns: Job 54 not found, deleting the namespace
[2023-12-07T16:51:03.492] error: _delete_ns: umount2 /var/nvme/storage/cn4/54/.ns failed: Invalid argument
[2023-12-07T16:51:03.492] debug:  job_container/tmpfs: _restore_ns: _restore_ns: Job 59 not found, deleting the namespace
[2023-12-07T16:51:03.492] error: _delete_ns: umount2 /var/nvme/storage/cn4/59/.ns failed: Invalid argument
[2023-12-07T16:51:03.492] error: Encountered an error while restoring job containers.
[2023-12-07T16:51:03.492] error: Unable to restore job_container state.
[2023-12-07T16:51:03.493] debug:  switch/none: init: switch NONE plugin loaded
[2023-12-07T16:51:03.493] debug:  switch Cray/Aries plugin loaded.
[2023-12-07T16:51:03.493] slurmd started on Thu, 07 Dec 2023 16:51:03 -0600
[2023-12-07T16:51:03.493] debug:  _step_connect: connect() failed for /var/spool/slurmd/cn4_64.4294967292: Connection refused
[2023-12-07T16:51:03.494] debug:  _step_connect: connect() failed for /var/spool/slurmd/cn4_64.0: Connection refused
[2023-12-07T16:51:03.494] CPUs=64 Boards=1 Sockets=1 Cores=64 Threads=1 Memory=257579 TmpDisk=937291 Uptime=14 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
[2023-12-07T16:51:03.494] debug:  _step_connect: connect() failed for /var/spool/slurmd/cn4_64.4294967292: Connection refused
[2023-12-07T16:51:03.494] debug:  _step_connect: connect() failed for /var/spool/slurmd/cn4_64.0: Connection refused
[2023-12-07T16:51:03.494] debug:  acct_gather_energy/none: init: AcctGatherEnergy NONE plugin loaded
[2023-12-07T16:51:03.494] debug:  acct_gather_Profile/none: init: AcctGatherProfile NONE plugin loaded
[2023-12-07T16:51:03.495] debug:  acct_gather_interconnect/none: init: AcctGatherInterconnect NONE plugin loaded
[2023-12-07T16:51:03.495] debug:  acct_gather_filesystem/none: init: AcctGatherFilesystem NONE plugin loaded
[2023-12-07T16:51:03.495] debug2: No acct_gather.conf file (/etc/slurm/acct_gather.conf)
[2023-12-07T16:51:03.499] debug:  _handle_node_reg_resp: slurmctld sent back 8 TRES.
[2023-12-07T16:51:03.500] debug2: Start processing RPC: REQUEST_TERMINATE_JOB
[2023-12-07T16:51:03.500] debug2: Processing RPC: REQUEST_TERMINATE_JOB
[2023-12-07T16:51:03.500] debug:  _rpc_terminate_job: uid = 64030 JobId=64
[2023-12-07T16:51:03.500] debug:  credential for job 64 revoked
[2023-12-07T16:51:03.500] debug:  _step_connect: connect() failed for /var/spool/slurmd/cn4_64.4294967292: Connection refused
[2023-12-07T16:51:03.500] debug:  signal for nonexistent StepId=64.extern stepd_connect failed: Connection refused
[2023-12-07T16:51:03.500] debug:  _step_connect: connect() failed for /var/spool/slurmd/cn4_64.0: Connection refused
[2023-12-07T16:51:03.500] debug:  signal for nonexistent StepId=64.0 stepd_connect failed: Connection refused
[2023-12-07T16:51:03.500] debug2: No steps in jobid 64 were able to be signaled with 998
[2023-12-07T16:51:03.500] debug:  _step_connect: connect() failed for /var/spool/slurmd/cn4_64.4294967292: Connection refused
[2023-12-07T16:51:03.500] debug:  signal for nonexistent StepId=64.extern stepd_connect failed: Connection refused
[2023-12-07T16:51:03.500] debug:  _step_connect: connect() failed for /var/spool/slurmd/cn4_64.0: Connection refused
[2023-12-07T16:51:03.500] debug:  signal for nonexistent StepId=64.0 stepd_connect failed: Connection refused
[2023-12-07T16:51:03.500] debug2: No steps in jobid 64 were able to be signaled with 18
[2023-12-07T16:51:03.500] debug:  _step_connect: connect() failed for /var/spool/slurmd/cn4_64.4294967292: Connection refused
[2023-12-07T16:51:03.500] debug:  signal for nonexistent StepId=64.extern stepd_connect failed: Connection refused
[2023-12-07T16:51:03.500] debug:  _step_connect: connect() failed for /var/spool/slurmd/cn4_64.0: Connection refused
[2023-12-07T16:51:03.500] debug:  signal for nonexistent StepId=64.0 stepd_connect failed: Connection refused
[2023-12-07T16:51:03.500] debug2: No steps in jobid 64 were able to be signaled with 15
[2023-12-07T16:51:03.500] debug2: set revoke expiration for jobid 64 to 1701989583 UTS
[2023-12-07T16:51:03.501] error: _delete_ns: umount2 /var/nvme/storage/cn4/64/.ns failed: Invalid argument
[2023-12-07T16:51:03.501] error: container_g_delete(64): Invalid argument
[2023-12-07T16:51:03.501] debug2: Finish processing RPC: REQUEST_TERMINATE_JOB

我立即看到一些有关的错误/var/nvme/storage(这是每个节点上的本地文件夹,而不是 NAS 上的网络共享位置),但这在所有节点上都是相同的,并且只会在几个节点上引起问题。请注意,这是在中设置的基本路径job_container.conf:

AutoBasePath=true
BasePath=/var/nvme/storage

另外,这里是cgroup.conf:

CgroupAutomount=yes
CgroupReleaseAgentDir="/etc/slurm/cgroup"
ConstrainCores=yes
ConstrainRAMSpace=yes
ConstrainDevices=yes
ConstrainKmemSpace=no
TaskAffinity=no
CgroupMountpoint=/sys/fs/cgroup

...和slurm.conf​​:

# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ClusterName=cauchy
SlurmctldHost=cauchy
#SlurmctldHost=
#
#DisableRootJobs=NO
#EnforcePartLimits=NO
#Epilog=
#EpilogSlurmctld=
#FirstJobId=1
#MaxJobId=67043328
#GresTypes=
#GroupUpdateForce=0
#GroupUpdateTime=600
#JobFileAppend=0
#JobRequeue=1
#JobSubmitPlugins=lua
#KillOnBadExit=0
#LaunchType=launch/slurm
#Licenses=foo*4,bar
#MailProg=/bin/mail
MaxJobCount=1000000
#MaxStepCount=40000
#MaxTasksPerNode=512
MpiDefault=none
#MpiParams=ports=#-#
#PluginDir=
#PlugStackConfig=
#PrivateData=jobs
ProctrackType=proctrack/cgroup
#Prolog=
PrologFlags=contain
#PrologSlurmctld=
#PropagatePrioProcess=0
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
#RebootProgram=
ReturnToService=1
SlurmctldPidFile=/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
#SlurmdUser=root
#SrunEpilog=
#SrunProlog=
StateSaveLocation=/var/spool/slurmctld
SwitchType=switch/none
#TaskEpilog=
TaskPlugin=task/affinity,task/cgroup
#TaskProlog=
#TopologyPlugin=topology/tree
#TmpFS=/tmp
#TrackWCKey=no
#TreeWidth=
#UnkillableStepProgram=
#UsePAM=0
#
#
# TIMERS
#BatchStartTimeout=10
#CompleteWait=0
#EpilogMsgTime=2000
#GetEnvTimeout=2
#HealthCheckInterval=0
#HealthCheckProgram=
InactiveLimit=0
KillWait=120
#MessageTimeout=10
#ResvOverRun=0
MinJobAge=300
#OverTimeLimit=0
SlurmctldTimeout=120
SlurmdTimeout=300
#UnkillableStepTimeout=60
#VSizeFactor=0
Waittime=0
MaxArraySize=100000
#
#
# SCHEDULING
#DefMemPerCPU=0
#MaxMemPerCPU=0
#SchedulerTimeSlice=30
SchedulerType=sched/backfill
SchedulerParameters=enable_user_top,bf_job_part_count_reserve=5,bf_continue
SelectType=select/cons_tres
SelectTypeParameters=CR_Core
#
#
####### Priority Begin ##################
PriorityType=priority/multifactor
PriorityDecayHalfLife=14-0
PriorityWeightAge=100
PriorityWeightPartition=10000
PriorityWeightJobSize=0
PriorityMaxAge=14-0
PriorityFavorSmall=YES
#PriorityWeightQOS=10000
#PriorityWeightTRES=cpu=2000,mem=1,gres/gpu=400
#AccountingStorageTRES=gres/gpu
#AccountingStorageEnforce=all
#FairShareDampeningFactor=5
####### Priority End ##################
#
#
# LOGGING AND ACCOUNTING
#AccountingStorageEnforce=0
#AccountingStorageHost=
#AccountingStoragePass=
#AccountingStoragePort=
AccountingStorageType=accounting_storage/none
#AccountingStorageUser=
#AccountingStoreFlags=
#JobCompHost=
#JobCompLoc=
#JobCompParams=
#JobCompPass=
#JobCompPort=
JobCompType=jobcomp/none
#JobCompUser=
JobContainerType=job_container/tmpfs
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/cgroup
SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=debug2
SlurmdLogFile=/var/log/slurmd.log

#SlurmSchedLogFile=
#SlurmSchedLogLevel=
#DebugFlags=
#
#
# POWER SAVE SUPPORT FOR IDLE NODES (optional)
#SuspendProgram=
#ResumeProgram=
#SuspendTimeout=
#ResumeTimeout=
#ResumeRate=
#SuspendExcNodes=
#SuspendExcParts=
#SuspendRate=
#SuspendTime=
#
#
# COMPUTE NODES
NodeName=cn1 CPUs=64 Sockets=1 CoresPerSocket=64 ThreadsPerCore=1 State=UNKNOWN RealMemory=257000
NodeName=cn2 CPUs=64 Sockets=1 CoresPerSocket=64 ThreadsPerCore=1 State=UNKNOWN RealMemory=257000
NodeName=cn3 CPUs=64 Sockets=1 CoresPerSocket=64 ThreadsPerCore=1 State=UNKNOWN RealMemory=257000
NodeName=cn4 CPUs=64 Sockets=1 CoresPerSocket=64 ThreadsPerCore=1 State=UNKNOWN RealMemory=257000
NodeName=cn5 CPUs=64 Sockets=1 CoresPerSocket=64 ThreadsPerCore=1 State=UNKNOWN RealMemory=257000
NodeName=cn6 CPUs=64 Sockets=1 CoresPerSocket=64 ThreadsPerCore=1 State=UNKNOWN RealMemory=257000
NodeName=cn7 CPUs=64 Sockets=1 CoresPerSocket=64 ThreadsPerCore=1 State=UNKNOWN RealMemory=257000
NodeName=cn8 CPUs=64 Sockets=1 CoresPerSocket=64 ThreadsPerCore=1 State=UNKNOWN RealMemory=257000
NodeName=cn9 CPUs=64 Sockets=1 CoresPerSocket=64 ThreadsPerCore=1 State=UNKNOWN RealMemory=257000
NodeName=cn10 CPUs=64 Sockets=1 CoresPerSocket=64 ThreadsPerCore=1 State=UNKNOWN RealMemory=257000
NodeName=cn11 CPUs=64 Sockets=1 CoresPerSocket=64 ThreadsPerCore=1 State=UNKNOWN RealMemory=257000
NodeName=cn12 CPUs=64 Sockets=1 CoresPerSocket=64 ThreadsPerCore=1 State=UNKNOWN RealMemory=257000
NodeName=cn13 CPUs=64 Sockets=1 CoresPerSocket=64 ThreadsPerCore=1 State=UNKNOWN RealMemory=257000
NodeName=cn14 CPUs=64 Sockets=1 CoresPerSocket=64 ThreadsPerCore=1 State=UNKNOWN RealMemory=257000
NodeName=cn15 CPUs=64 Sockets=1 CoresPerSocket=64 ThreadsPerCore=1 State=UNKNOWN RealMemory=257000
NodeName=cn16 CPUs=64 Sockets=1 CoresPerSocket=64 ThreadsPerCore=1 State=UNKNOWN RealMemory=257000

PartitionName=main Nodes=ALL Default=YES MaxTime=INFINITE State=UP PriorityJobFactor=2000
PartitionName=low Nodes=ALL MaxTime=INFINITE State=UP PriorityJobFactor=1000

EDIT: I cancelled a job (submitted to 4 nodes, cn1 - cn4) that showed this error before it could reallocate to new nodes / overwrite the Slurm error file. Here are the contents of the error file:

Authorization required, but no authorization protocol specified
Authorization required, but no authorization protocol specified

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
slurmstepd-cn1: error: *** JOB 74 ON cn1 CANCELLED AT 2023-12-10T10:05:57 DUE TO NODE FAILURE, SEE SLURMCTLD LOG FOR DETAILS ***

The Authorization required... error is pervasive on all nodes / all simulations, so I'm not sure it is of particular consequence for the consistent failure of the cn4 node. The segfault doesn't appear when the same job is run on other nodes, so this is new info / anomalous. The slurmctld log is not particularly illuminating:

[2023-12-09T19:54:46.401] _slurm_rpc_submit_batch_job: JobId=65 InitPrio=10000 usec=493
[2023-12-09T19:54:46.826] sched/backfill: _start_job: Started JobId=65 in main on cn[1-4]
[2023-12-09T20:01:33.529] validate_node_specs: Node cn4 unexpectedly rebooted boot_time=1702173678 last response=1702173587
[2023-12-09T20:01:33.529] requeue job JobId=65 due to failure of node cn4
[2023-12-09T20:01:38.334] Requeuing JobId=65
cluster
  • 1 个回答
  • 228 Views
Martin Hope
Waslap
Asked: 2023-08-11 19:59:36 +0800 CST

Corosync/Pacemaker/DRBD 弹性调整

  • 10

我有一个 DRBD 集群,其中一个节点关闭了几天。单节点运行良好,没有出现任何问题。当我打开它时,我遇到了一种情况,所有资源都停止了,一个 DRBD 卷是辅助卷,其他卷是主要卷,因为它似乎试图对刚刚打开的节点执行角色交换(ha1 处于活动状态,然后我打开 ha2)为了便于理解日志,在 08:06)

我的问题:

  • 谁能帮我弄清楚这里发生了什么?(如果这个问题被认为太费力,我愿意考虑付费咨询来获得正确的配置)。
  • 作为一个附带问题,如果情况自行解决,是否有办法让电脑自行清理资源?如果故障转移后故障情况清除,LinuxHA 集群不需要干预,所以我要么被宠坏了,要么不知道如何实现这一点。

下面是我能想象到的所有可能有用的信息。

bash-5.1# cat /proc/drbd 
version: 8.4.11 (api:1/proto:86-101)
srcversion: 60F610B702CC05315B04B50 
 0: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r-----
    ns:109798092 nr:90528 dw:373317496 dr:353811713 al:558387 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
 1: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r-----
    ns:415010252 nr:188601628 dw:1396698240 dr:1032339078 al:1387347 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
 2: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r-----
    ns:27957772 nr:21354732 dw:97210572 dr:100798651 al:5283 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0

集群状态最​​终为

bash-5.1# pcs status
Cluster name: HA
Status of pacemakerd: 'Pacemaker is running' (last updated 2023-08-10 08:38:40Z)
Cluster Summary:
  * Stack: corosync
  * Current DC: ha2.local (version 2.1.4-5.el9_1.2-dc6eb4362e) - partition with quorum
  * Last updated: Thu Aug 10 08:38:40 2023
  * Last change:  Mon Jul 10 06:49:08 2023 by hacluster via crmd on ha1.local
  * 2 nodes configured
  * 14 resource instances configured

Node List:
  * Online: [ ha1.local ha2.local ]

Full List of Resources:
  * Clone Set: LV_BLOB-clone [LV_BLOB] (promotable):
    * Promoted: [ ha2.local ]
    * Unpromoted: [ ha1.local ]
  * Resource Group: nsdrbd:
    * LV_BLOBFS (ocf:heartbeat:Filesystem):  Started ha2.local
    * LV_POSTGRESFS (ocf:heartbeat:Filesystem):  Stopped
    * LV_HOMEFS (ocf:heartbeat:Filesystem):  Stopped
    * ClusterIP (ocf:heartbeat:IPaddr2):     Stopped
  * Clone Set: LV_POSTGRES-clone [LV_POSTGRES] (promotable):
    * Promoted: [ ha1.local ]
    * Unpromoted: [ ha2.local ]
  * postgresql  (systemd:postgresql):    Stopped
  * Clone Set: LV_HOME-clone [LV_HOME] (promotable):
    * Promoted: [ ha1.local ]
    * Unpromoted: [ ha2.local ]
  * ns_mhswdog  (lsb:mhswdog):   Stopped
  * Clone Set: pingd-clone [pingd]:
    * Started: [ ha1.local ha2.local ]

Failed Resource Actions:
  * LV_POSTGRES promote on ha2.local could not be executed (Timed Out: Resource agent did not complete within 1m30s) at Thu Aug 10 08:19:27 2023 after 1m30.003s
  * LV_BLOB promote on ha2.local could not be executed (Timed Out: Resource agent did not complete within 1m30s) at Thu Aug 10 08:15:38 2023 after 1m30.001s

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

我附上两个节点的日志

Aug 10 08:07:00 [1032387] ha1.local corosync info    [KNET  ] rx: host: 2 link: 0 is up
Aug 10 08:07:00 [1032387] ha1.local corosync info    [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 10 08:07:00 [1032387] ha1.local corosync info    [KNET  ] pmtud: Global data MTU changed to: 1397
Aug 10 08:07:00 [1032387] ha1.local corosync notice  [QUORUM] Sync members[2]: 1 2
Aug 10 08:07:00 [1032387] ha1.local corosync notice  [QUORUM] Sync joined[1]: 2
Aug 10 08:07:00 [1032387] ha1.local corosync notice  [TOTEM ] A new membership (1.12d) was formed. Members joined: 2
Aug 10 08:07:00 [1032387] ha1.local corosync notice  [QUORUM] Members[2]: 1 2
Aug 10 08:07:00 [1032387] ha1.local corosync notice  [MAIN  ] Completed service synchronization, ready to provide service.
Aug 10 08:07:07 [1032387] ha1.local corosync info    [KNET  ] rx: host: 2 link: 1 is up
Aug 10 08:07:07 [1032387] ha1.local corosync info    [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 10 08:11:48 [1032387] ha1.local corosync info    [KNET  ] link: host: 2 link: 1 is down
Aug 10 08:11:48 [1032387] ha1.local corosync info    [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 10 08:11:50 [1032387] ha1.local corosync info    [KNET  ] rx: host: 2 link: 1 is up
Aug 10 08:11:50 [1032387] ha1.local corosync info    [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 10 08:12:22 [1032387] ha1.local corosync info    [KNET  ] link: host: 2 link: 1 is down
Aug 10 08:12:22 [1032387] ha1.local corosync info    [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 10 08:12:23 [1032387] ha1.local corosync info    [KNET  ] rx: host: 2 link: 1 is up
Aug 10 08:12:23 [1032387] ha1.local corosync info    [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)

和

Aug 10 08:06:55 [1128] ha2.local corosync notice  [MAIN  ] Corosync Cluster Engine 3.1.5 starting up
Aug 10 08:06:55 [1128] ha2.local corosync info    [MAIN  ] Corosync built-in features: dbus systemd xmlconf vqsim nozzle snmp pie relro bindnow
Aug 10 08:06:56 [1128] ha2.local corosync notice  [TOTEM ] Initializing transport (Kronosnet).
Aug 10 08:06:57 [1128] ha2.local corosync info    [TOTEM ] totemknet initialized
Aug 10 08:06:57 [1128] ha2.local corosync info    [KNET  ] common: crypto_nss.so has been loaded from /usr/lib64/kronosnet/crypto_nss.so
Aug 10 08:06:57 [1128] ha2.local corosync notice  [SERV  ] Service engine loaded: corosync configuration map access [0]
Aug 10 08:06:57 [1128] ha2.local corosync info    [QB    ] server name: cmap
Aug 10 08:06:57 [1128] ha2.local corosync notice  [SERV  ] Service engine loaded: corosync configuration service [1]
Aug 10 08:06:57 [1128] ha2.local corosync info    [QB    ] server name: cfg
Aug 10 08:06:57 [1128] ha2.local corosync notice  [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
Aug 10 08:06:57 [1128] ha2.local corosync info    [QB    ] server name: cpg
Aug 10 08:06:57 [1128] ha2.local corosync notice  [SERV  ] Service engine loaded: corosync profile loading service [4]
Aug 10 08:06:57 [1128] ha2.local corosync notice  [QUORUM] Using quorum provider corosync_votequorum
Aug 10 08:06:57 [1128] ha2.local corosync notice  [VOTEQ ] Waiting for all cluster members. Current votes: 1 expected_votes: 2
Aug 10 08:06:57 [1128] ha2.local corosync notice  [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]
Aug 10 08:06:57 [1128] ha2.local corosync info    [QB    ] server name: votequorum
Aug 10 08:06:57 [1128] ha2.local corosync notice  [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]
Aug 10 08:06:57 [1128] ha2.local corosync info    [QB    ] server name: quorum
Aug 10 08:06:57 [1128] ha2.local corosync info    [TOTEM ] Configuring link 0
Aug 10 08:06:57 [1128] ha2.local corosync info    [TOTEM ] Configured link number 0: local addr: 192.168.51.216, port=5405
Aug 10 08:06:57 [1128] ha2.local corosync info    [TOTEM ] Configuring link 1
Aug 10 08:06:57 [1128] ha2.local corosync info    [TOTEM ] Configured link number 1: local addr: 10.0.0.2, port=5406
Aug 10 08:06:57 [1128] ha2.local corosync info    [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 10 08:06:57 [1128] ha2.local corosync warning [KNET  ] host: host: 1 has no active links
Aug 10 08:06:57 [1128] ha2.local corosync info    [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 10 08:06:57 [1128] ha2.local corosync warning [KNET  ] host: host: 1 has no active links
Aug 10 08:06:57 [1128] ha2.local corosync info    [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 10 08:06:57 [1128] ha2.local corosync warning [KNET  ] host: host: 1 has no active links
Aug 10 08:06:57 [1128] ha2.local corosync info    [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 10 08:06:57 [1128] ha2.local corosync warning [KNET  ] host: host: 1 has no active links
Aug 10 08:06:57 [1128] ha2.local corosync info    [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 10 08:06:57 [1128] ha2.local corosync warning [KNET  ] host: host: 1 has no active links
Aug 10 08:06:57 [1128] ha2.local corosync info    [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 10 08:06:57 [1128] ha2.local corosync warning [KNET  ] host: host: 1 has no active links
Aug 10 08:06:57 [1128] ha2.local corosync notice  [QUORUM] Sync members[1]: 2
Aug 10 08:06:57 [1128] ha2.local corosync notice  [QUORUM] Sync joined[1]: 2
Aug 10 08:06:57 [1128] ha2.local corosync notice  [TOTEM ] A new membership (2.126) was formed. Members joined: 2
Aug 10 08:06:57 [1128] ha2.local corosync notice  [VOTEQ ] Waiting for all cluster members. Current votes: 1 expected_votes: 2
Aug 10 08:06:57 [1128] ha2.local corosync notice  [VOTEQ ] Waiting for all cluster members. Current votes: 1 expected_votes: 2
Aug 10 08:06:57 [1128] ha2.local corosync notice  [VOTEQ ] Waiting for all cluster members. Current votes: 1 expected_votes: 2
Aug 10 08:06:57 [1128] ha2.local corosync notice  [QUORUM] Members[1]: 2
Aug 10 08:06:57 [1128] ha2.local corosync notice  [MAIN  ] Completed service synchronization, ready to provide service.
Aug 10 08:07:00 [1128] ha2.local corosync info    [KNET  ] rx: host: 1 link: 0 is up
Aug 10 08:07:00 [1128] ha2.local corosync info    [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 10 08:07:00 [1128] ha2.local corosync info    [KNET  ] pmtud: Global data MTU changed to: 469
Aug 10 08:07:00 [1128] ha2.local corosync notice  [QUORUM] Sync members[2]: 1 2
Aug 10 08:07:00 [1128] ha2.local corosync notice  [QUORUM] Sync joined[1]: 1
Aug 10 08:07:00 [1128] ha2.local corosync notice  [TOTEM ] A new membership (1.12d) was formed. Members joined: 1
Aug 10 08:07:00 [1128] ha2.local corosync notice  [VOTEQ ] Waiting for all cluster members. Current votes: 1 expected_votes: 2
Aug 10 08:07:00 [1128] ha2.local corosync notice  [QUORUM] This node is within the primary component and will provide service.
Aug 10 08:07:00 [1128] ha2.local corosync notice  [QUORUM] Members[2]: 1 2
Aug 10 08:07:00 [1128] ha2.local corosync notice  [MAIN  ] Completed service synchronization, ready to provide service.
Aug 10 08:07:05 [1128] ha2.local corosync info    [KNET  ] rx: host: 1 link: 1 is up
Aug 10 08:07:05 [1128] ha2.local corosync info    [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 10 08:07:08 [1128] ha2.local corosync info    [KNET  ] pmtud: PMTUD link change for host: 1 link: 0 from 469 to 1397
Aug 10 08:07:08 [1128] ha2.local corosync info    [KNET  ] pmtud: PMTUD link change for host: 1 link: 1 from 469 to 8885
Aug 10 08:07:08 [1128] ha2.local corosync info    [KNET  ] pmtud: Global data MTU changed to: 1397
Aug 10 08:14:13 [1128] ha2.local corosync info    [KNET  ] link: host: 1 link: 1 is down
Aug 10 08:14:13 [1128] ha2.local corosync info    [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 10 08:14:15 [1128] ha2.local corosync info    [KNET  ] rx: host: 1 link: 1 is up
Aug 10 08:14:15 [1128] ha2.local corosync info    [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 10 08:19:53 [1128] ha2.local corosync info    [KNET  ] link: host: 1 link: 1 is down
Aug 10 08:19:53 [1128] ha2.local corosync info    [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 10 08:19:54 [1128] ha2.local corosync info    [KNET  ] rx: host: 1 link: 1 is up
Aug 10 08:19:54 [1128] ha2.local corosync info    [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 10 08:23:18 [1128] ha2.local corosync info    [KNET  ] link: host: 1 link: 1 is down
Aug 10 08:23:18 [1128] ha2.local corosync info    [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 10 08:23:19 [1128] ha2.local corosync info    [KNET  ] rx: host: 1 link: 1 is up
Aug 10 08:23:19 [1128] ha2.local corosync info    [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)

我还附上了 pcs config show 的输出,还可以根据要求提供 pcs cluster cib

Cluster Name: HA
Corosync Nodes:
 ha1.local ha2.local
Pacemaker Nodes:
 ha1.local ha2.local

Resources:
  Resource: postgresql (class=systemd type=postgresql)
    Operations:
      monitor: postgresql-monitor-interval-60s
        interval=60s
      start: postgresql-start-interval-0s
        interval=0s
        timeout=100
      stop: postgresql-stop-interval-0s
        interval=0s
        timeout=100
  Resource: ns_mhswdog (class=lsb type=mhswdog)
    Operations:
      force-reload: ns_mhswdog-force-reload-interval-0s
        interval=0s
        timeout=15
      monitor: ns_mhswdog-monitor-interval-60s
        interval=60s
        timeout=10s
        on-fail=standby
      restart: ns_mhswdog-restart-interval-0s
        interval=0s
        timeout=140s
      start: ns_mhswdog-start-interval-0s
        interval=0s
        timeout=80s
      stop: ns_mhswdog-stop-interval-0s
        interval=0s
        timeout=80s
  Group: nsdrbd
    Resource: LV_BLOBFS (class=ocf provider=heartbeat type=Filesystem)
      Attributes: LV_BLOBFS-instance_attributes
        device=/dev/drbd0
        directory=/data
        fstype=ext4
      Operations:
        monitor: LV_BLOBFS-monitor-interval-20s
          interval=20s
          timeout=40s
        start: LV_BLOBFS-start-interval-0s
          interval=0s
          timeout=60s
        stop: LV_BLOBFS-stop-interval-0s
          interval=0s
          timeout=60s
    Resource: LV_POSTGRESFS (class=ocf provider=heartbeat type=Filesystem)
      Attributes: LV_POSTGRESFS-instance_attributes
        device=/dev/drbd1
        directory=/var/lib/pgsql
        fstype=ext4
      Operations:
        monitor: LV_POSTGRESFS-monitor-interval-20s
          interval=20s
          timeout=40s
        start: LV_POSTGRESFS-start-interval-0s
          interval=0s
          timeout=60s
        stop: LV_POSTGRESFS-stop-interval-0s
          interval=0s
          timeout=60s
    Resource: LV_HOMEFS (class=ocf provider=heartbeat type=Filesystem)
      Attributes: LV_HOMEFS-instance_attributes
        device=/dev/drbd2
        directory=/home
        fstype=ext4
      Operations:
        monitor: LV_HOMEFS-monitor-interval-20s
          interval=20s
          timeout=40s
        start: LV_HOMEFS-start-interval-0s
          interval=0s
          timeout=60s
        stop: LV_HOMEFS-stop-interval-0s
          interval=0s
          timeout=60s
    Resource: ClusterIP (class=ocf provider=heartbeat type=IPaddr2)
      Attributes: ClusterIP-instance_attributes
        cidr_netmask=32
        ip=192.168.51.75
      Operations:
        monitor: ClusterIP-monitor-interval-60s
          interval=60s
        start: ClusterIP-start-interval-0s
          interval=0s
          timeout=20s
        stop: ClusterIP-stop-interval-0s
          interval=0s
          timeout=20s
  Clone: LV_BLOB-clone
    Meta Attributes: LV_BLOB-clone-meta_attributes
      clone-max=2
      clone-node-max=1
      notify=true
      promotable=true
      promoted-max=1
      promoted-node-max=1
    Resource: LV_BLOB (class=ocf provider=linbit type=drbd)
      Attributes: LV_BLOB-instance_attributes
        drbd_resource=lv_blob
      Operations:
        demote: LV_BLOB-demote-interval-0s
          interval=0s
          timeout=90
        monitor: LV_BLOB-monitor-interval-60s
          interval=60s
          role=Promoted
        monitor: LV_BLOB-monitor-interval-63s
          interval=63s
          role=Unpromoted
        notify: LV_BLOB-notify-interval-0s
          interval=0s
          timeout=90
        promote: LV_BLOB-promote-interval-0s
          interval=0s
          timeout=90
        reload: LV_BLOB-reload-interval-0s
          interval=0s
          timeout=30
        start: LV_BLOB-start-interval-0s
          interval=0s
          timeout=240
        stop: LV_BLOB-stop-interval-0s
          interval=0s
          timeout=100
  Clone: LV_POSTGRES-clone
    Meta Attributes: LV_POSTGRES-clone-meta_attributes
      clone-max=2
      clone-node-max=1
      notify=true
      promotable=true
      promoted-max=1
      promoted-node-max=1
    Resource: LV_POSTGRES (class=ocf provider=linbit type=drbd)
      Attributes: LV_POSTGRES-instance_attributes
        drbd_resource=lv_postgres
      Operations:
        demote: LV_POSTGRES-demote-interval-0s
          interval=0s
          timeout=90
        monitor: LV_POSTGRES-monitor-interval-60s
          interval=60s
          role=Promoted
        monitor: LV_POSTGRES-monitor-interval-63s
          interval=63s
          role=Unpromoted
        notify: LV_POSTGRES-notify-interval-0s
          interval=0s
          timeout=90
        promote: LV_POSTGRES-promote-interval-0s
          interval=0s
          timeout=90
        reload: LV_POSTGRES-reload-interval-0s
          interval=0s
          timeout=30
        start: LV_POSTGRES-start-interval-0s
          interval=0s
          timeout=240
        stop: LV_POSTGRES-stop-interval-0s
          interval=0s
          timeout=100
  Clone: LV_HOME-clone
    Meta Attributes: LV_HOME-clone-meta_attributes
      clone-max=2
      clone-node-max=1
      notify=true
      promotable=true
      promoted-max=1
      promoted-node-max=1
    Resource: LV_HOME (class=ocf provider=linbit type=drbd)
      Attributes: LV_HOME-instance_attributes
        drbd_resource=lv_home
      Operations:
        demote: LV_HOME-demote-interval-0s
          interval=0s
          timeout=90
        monitor: LV_HOME-monitor-interval-60s
          interval=60s
          role=Promoted
        monitor: LV_HOME-monitor-interval-63s
          interval=63s
          role=Unpromoted
        notify: LV_HOME-notify-interval-0s
          interval=0s
          timeout=90
        promote: LV_HOME-promote-interval-0s
          interval=0s
          timeout=90
        reload: LV_HOME-reload-interval-0s
          interval=0s
          timeout=30
        start: LV_HOME-start-interval-0s
          interval=0s
          timeout=240
        stop: LV_HOME-stop-interval-0s
          interval=0s
          timeout=100
  Clone: pingd-clone
    Resource: pingd (class=ocf provider=pacemaker type=ping)
      Attributes: pingd-instance_attributes
        dampen=6s
        host_list=192.168.51.251
        multiplier=1000
      Operations:
        monitor: pingd-monitor-interval-10s
          interval=10s
          timeout=60s
        reload-agent: pingd-reload-agent-interval-0s
          interval=0s
          timeout=20s
        start: pingd-start-interval-0s
          interval=0s
          timeout=60s
        stop: pingd-stop-interval-0s
          interval=0s
          timeout=20s

Stonith Devices:
Fencing Levels:

Location Constraints:
  Resource: ClusterIP
    Constraint: location-ClusterIP
      Rule: boolean-op=or score=-INFINITY (id:location-ClusterIP-rule)
        Expression: pingd lt 1 (id:location-ClusterIP-rule-expr)
        Expression: not_defined pingd (id:location-ClusterIP-rule-expr-1)
Ordering Constraints:
  promote LV_BLOB-clone then start LV_BLOBFS (kind:Mandatory) (id:order-LV_BLOB-clone-LV_BLOBFS-mandatory)
  promote LV_POSTGRES-clone then start LV_POSTGRESFS (kind:Mandatory) (id:order-LV_POSTGRES-clone-LV_POSTGRESFS-mandatory)
  start LV_POSTGRESFS then start postgresql (kind:Mandatory) (id:order-LV_POSTGRESFS-postgresql-mandatory)
  promote LV_HOME-clone then start LV_HOMEFS (kind:Mandatory) (id:order-LV_HOME-clone-LV_HOMEFS-mandatory)
  start LV_HOMEFS then start ns_mhswdog (kind:Mandatory) (id:order-LV_HOMEFS-ns_mhswdog-mandatory)
  start LV_BLOBFS then start ns_mhswdog (kind:Mandatory) (id:order-LV_BLOBFS-ns_mhswdog-mandatory)
  start postgresql then start ns_mhswdog (kind:Mandatory) (id:order-postgresql-ns_mhswdog-mandatory)
  start ns_mhswdog then start ClusterIP (kind:Mandatory) (id:order-ns_mhswdog-ClusterIP-mandatory)
Colocation Constraints:
  LV_BLOBFS with LV_BLOB-clone (score:INFINITY) (with-rsc-role:Promoted) (id:colocation-LV_BLOBFS-LV_BLOB-clone-INFINITY)
  LV_POSTGRESFS with LV_POSTGRES-clone (score:INFINITY) (with-rsc-role:Promoted) (id:colocation-LV_POSTGRESFS-LV_POSTGRES-clone-INFINITY)
  postgresql with LV_POSTGRESFS (score:INFINITY) (id:colocation-postgresql-LV_POSTGRESFS-INFINITY)
  LV_HOMEFS with LV_HOME-clone (score:INFINITY) (with-rsc-role:Promoted) (id:colocation-LV_HOMEFS-LV_HOME-clone-INFINITY)
  ns_mhswdog with LV_HOMEFS (score:INFINITY) (id:colocation-ns_mhswdog-LV_HOMEFS-INFINITY)
  ns_mhswdog with LV_BLOBFS (score:INFINITY) (id:colocation-ns_mhswdog-LV_BLOBFS-INFINITY)
  ns_mhswdog with postgresql (score:INFINITY) (id:colocation-ns_mhswdog-postgresql-INFINITY)
  ClusterIP with ns_mhswdog (score:INFINITY) (id:colocation-ClusterIP-ns_mhswdog-INFINITY)
Ticket Constraints:

Alerts:
 No alerts defined

Resources Defaults:
  Meta Attrs: build-resource-defaults
    resource-stickiness=INFINITY
Operations Defaults:
  Meta Attrs: op_defaults-meta_attributes
    timeout=240s

Cluster Properties:
 cluster-infrastructure: corosync
 cluster-name: HA
 dc-version: 2.1.4-5.el9_1.2-dc6eb4362e
 have-watchdog: false
 last-lrm-refresh: 1688971748
 maintenance-mode: false
 no-quorum-policy: ignore
 stonith-enabled: false

Tags:
 No tags defined

Quorum:
  Options:
cluster
  • 2 个回答
  • 72 Views
Martin Hope
Sandeep
Asked: 2023-02-16 11:35:04 +0800 CST

ceph-deploy 安装命令失败并显示 [ceph_deploy][ERROR] RuntimeError: configparser.NoSectionError: No section: 'main'

  • 5

命令ceph-deploy install admin datanode_dn2 失败并输出:

[ceph_deploy.install][INFO  ] Distro info: rocky 9.1 blue onyx
[admin][INFO  ] installing Ceph on admin
[admin][INFO  ] Running command: sudo yum clean all
[admin][DEBUG ] 57 files removed
[admin][INFO  ] Running command: sudo yum -y install epel-release
[admin][DEBUG ] CentOS-9-stream - Ceph Quincy                   113 kB/s | 474 kB     00:04    
[admin][DEBUG ] Ceph aarch64                                     87  B/s | 257  B     00:02    
[admin][DEBUG ] Ceph noarch                                     2.4 kB/s | 8.8 kB     00:03    
[admin][DEBUG ] Ceph SRPMS                                      629  B/s | 1.8 kB     00:02    
[admin][DEBUG ] Extra Packages for Enterprise Linux 9 - aarch64 3.7 MB/s |  14 MB     00:03    
[admin][DEBUG ] Rocky Linux 9 - BaseOS                          544 kB/s | 1.4 MB     00:02    
[admin][DEBUG ] Rocky Linux 9 - AppStream                       2.0 MB/s | 5.5 MB     00:02    
[admin][DEBUG ] Rocky Linux 9 - Extras                          3.1 kB/s | 9.1 kB     00:02    
[admin][DEBUG ] Package epel-release-9-4.el9.noarch is already installed.
[admin][DEBUG ] Dependencies resolved.
[admin][DEBUG ] Nothing to do.
[admin][DEBUG ] Complete!
[ceph_deploy][ERROR ] RuntimeError: configparser.NoSectionError: No section: 'main'

我不太确定 ceph-deploy 抱怨的是哪个文件:肯定不是 ~/.cephdeploy.conf 或 ceph.conf。我还可以使用调试器,因为在调试器下运行会丢失有关配置文件位置的信息。

cluster
  • 1 个回答
  • 10 Views
Martin Hope
john_smith
Asked: 2022-12-29 20:15:05 +0800 CST

Auto-Scaling 可以被认为是集群的高级概念吗?

  • 2

因为使用了auto-scaling,所以看起来和集群没什么区别,只是比例是自动变化的。

我认为它们的共同点是都通过在多个实例之间分配流量来提高可用性。

不同之处在于自动缩放可以增加和减少实例。

我想知道我的概念是否正确。
如果您能附上链接以供参考,我将不胜感激。

cluster
  • 1 个回答
  • 32 Views
Martin Hope
Vishal
Asked: 2022-02-26 21:37:25 +0800 CST

我们可以在同一个集群中分配不同管理程序的多个核心吗?

  • 0

我已经使用 OnApp 配置了三个虚拟机管理程序。由于所有 Hypervisor 都使用 SAN 存储,因此如果任何 hypervisor 出现故障,托管在一个 hypervisor 上的 VPS 会在两个 hypervisor 上启动。

每个管理程序有 12 个内核,所以主要问题是我可以将 2 个管理程序的 CPU 内核分配给单个 VPS 吗?

例如,管理程序 1 有 12 个内核,管理程序 2 有 12 个内核,那么我可以将 24 个内核分配给该集群的 1 VPS 吗?

任何答案或澄清都会有所帮助。

windows cluster vps cloud hypervisor
  • 1 个回答
  • 26 Views
Martin Hope
Arun Varghese
Asked: 2022-02-09 16:16:40 +0800 CST

节点无法离开集群进行 eJabberd 升级

  • 0

环境

  • ejabberd 版本:20.04
  • Erlang 版本:Erlang (SMP,ASYNC_THREADS)(BEAM) 模拟器版本 9.2
  • 操作系统:Linux(Debian)
  • 安装自:源码

crash.log 中的错误

2022-02-08 22:42:45 =CRASH REPORT==== crasher:初始调用:pgsql_proto:init/1 pid:<0.27318.6018>registered_name:[]异常退出:{{init,{error,timeout} },[{gen_server,init_it,6,[{file,"gen_server.erl"},{line,349}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line, 247}]}]} 祖先:['ejabberd_sql_vhost1.xmpp_12','ejabberd_sql_sup_vhost1.xmpp',ejabberd_db_sup,ejabberd_sup,<0.87.0>] message_queue_len:0条消息:[]链接:[]字典:[]trap_exit:错误状态:运行堆大小:376 堆栈大小:27 减少:997 邻居:

错误描述 我正在尝试从 eJabberd 20.04 升级到 20.07。我的集群设置有三个节点。两个节点滚动升级成功。当 node1 试图离开集群进行升级时,它会给出以下错误:

与节点 '[email protected] 的 RPC 连接失败:超时

当我尝试 ejabberdctl 状态时,返回以下内容:节点 '[email protected]' 以状态启动:已启动 与节点 '[email protected]' 的 RPC 连接失败:{'EXIT', {timeout, {gen_server ,调用, [application_controller, which_applications]}}}

在 Erlang shell 上,节点仍然显示为集群的一部分

节点()。['[email protected]','[email protected]']

你能帮我解决这个问题吗?

cluster upgrade ejabberd
  • 2 个回答
  • 44 Views
Martin Hope
rgzv
Asked: 2022-01-15 20:20:54 +0800 CST

状态设置为启动后,我的 emr 集群因错误而终止

  • 0

您好,当我创建 EMR 集群时。状态说它正在创建,但 58 分钟后它抛出错误说Master - 1: Error provisioning instances. 错误消息(附加错误截图)我尝试了多次,但所有尝试都失败了。

我正在关注有关如何创建 EMR 集群的 AWS 文档

https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-gs.html

在 AWS 上创建 EMR 集群(图片来自附件文档)

我哪里做错了?我想成功创建 EMR 集群并将 Jupiter notebook 附加到集群。是否有文档可以成功创建集群并使集群在 58 分钟后不被终止而运行

请建议我必须做什么。

谢谢你。

cluster amazon-web-services amazon-emr
  • 1 个回答
  • 534 Views
Martin Hope
ReaperClown
Asked: 2021-11-17 03:13:44 +0800 CST

从本地网络外部访问集群中的 Kubernetes 服务 - 裸机

  • 1

我正在运行一个简单的裸机多主机“高可用性”环境,其中包含 2 个主机和 2 个工作人员,以及另一个使用 HAProxy 作为外部负载均衡器的 VM。

我的问题是:可以从集群外部访问服务(仪表板、ngnix、mysql(尤其是 mysql)等...),使用我正在运行的这个设置将它们暴露给网络?

我曾尝试在此环境中使用 MetalLB 将服务公开为 LoadBalancer,但它似乎不起作用,而且由于我对 Kubernetes 有点陌生,我不知道为什么。

编辑:现在开始工作了,遵循@c4f4t0r 的建议,而不是外部 HAProxy 负载均衡器,同一个 VM 成为第三个主节点,以及其他主节点,它们现在每个运行一个 HAProxy 和 Keepalived 的内部实例,而曾经作为外部 LB 的 VM 现在是其他节点加入集群的端点主机,MetalLB 在集群内运行,nginx 入口控制器将请求引导到已请求的服务。



>>> 以下是我创建环境以及设置中使用的所有配置所遵循的步骤。



使用 kubeadm 设置高可用 Kubernetes 集群

按照本文档使用Ubuntu 20.04 LTS设置高可用性 Kubernetes 集群。

本文档指导您使用 HAProxy 设置具有两个主节点、一个工作节点和一个负载平衡器节点的集群。

裸机环境

角色 全域名 知识产权 操作系统 内存 中央处理器
负载均衡器 loadbalancer.example.com 192.168.44.100 Ubuntu 21.04 1G 1
掌握 kmaster1.example.com 10.84.44.51 Ubuntu 21.04 2G 2
掌握 kmaster2.example.com 192.168.44.50 Ubuntu 21.04 2G 2
工人 kworker1.example.com 10.84.44.50 Ubuntu 21.04 2G 2
工人 kworker2.example.com 192.168.44.51 Ubuntu 21.04 2G 2
  • 所有这些虚拟机上的root帐户的密码都是kubeadmin
  • 除非另有说明,否则以 root 用户身份执行所有命令

先决条件

如果您想在工作站上的虚拟化环境中尝试此操作

  • 安装了虚拟机
  • 主机至少有 8 个内核
  • 主机至少8G内存

设置负载均衡器节点

安装 Haproxy
apt update && apt install -y haproxy
配置 haproxy

将以下行附加到/etc/haproxy/haproxy.cfg

frontend kubernetes-frontend
    bind 192.168.44.100:6443
    mode tcp
    option tcplog
    default_backend kubernetes-backend

backend kubernetes-backend
    mode tcp
    option tcp-check
    balance roundrobin
    server kmaster1 10.84.44.51:6443 check fall 3 rise 2
    server kmaster2 192.168.44.50:6443 check fall 3 rise 2
重启haproxy服务
systemctl restart haproxy

在所有 Kubernetes 节点上(kmaster1、kmaster2、kworker1)

禁用防火墙
ufw disable
禁用交换
swapoff -a; sed -i '/swap/d' /etc/fstab
更新 Kubernetes 网络的 sysctl 设置
cat >>/etc/sysctl.d/kubernetes.conf<<EOF
net.bridge.bridge-nf-call-ip6tables = 1
net.bridge.bridge-nf-call-iptables = 1
EOF
sysctl --system
安装 docker 引擎
{
  apt install -y apt-transport-https ca-certificates curl gnupg-agent software-properties-common
  curl -fsSL https://download.docker.com/linux/ubuntu/gpg | apt-key add -
  add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable"
  apt update && apt install -y docker-ce containerd.io
}

Kubernetes 设置

添加 Apt 存储库
{
  curl -s https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add -
  echo "deb https://apt.kubernetes.io/ kubernetes-xenial main" > /etc/apt/sources.list.d/kubernetes.list
}
安装 Kubernetes 组件
apt update && apt install -y kubeadm=1.19.2-00 kubelet=1.19.2-00 kubectl=1.19.2-00

在任何一个 Kubernetes 主节点上(例如:kmaster1)

初始化 Kubernetes 集群
kubeadm init --control-plane-endpoint="192.168.44.100:6443" --upload-certs

复制命令以加入其他主节点和工作节点。

部署印花布网络(我使用 Wea​​ve 而不是 Calico)
kubectl --kubeconfig=/etc/kubernetes/admin.conf create -f https://docs.projectcalico.org/v3.15/manifests/calico.yaml

将其他节点加入集群(kmaster2 & kworker1)

使用您从第一个 master 上的 kubeadm init 命令的输出中复制的相应 kubeadm join 命令。

重要提示:当您加入另一个主节点时,您还需要将 --apiserver-advertise-address 传递给 join 命令。

load-balancing cluster kubernetes haproxy bare-metal
  • 1 个回答
  • 213 Views
Martin Hope
deHaar
Asked: 2021-11-13 07:17:35 +0800 CST

Kubernetes 集群上没有可访问或可调度的 Pod

  • 0

我在 IBM 云中有 2 个 Kubernetes 集群,一个有 2 个节点,另一个有 4 个。

一个有 4 个节点的节点工作正常,但在另一个节点上,由于金钱原因,我不得不暂时移除工作节点(不应该在空闲时支付)。

当我重新激活这两个节点时,一切似乎都正常启动,只要我不尝试与 Pod 交互,它表面上看起来仍然很好,没有关于不可用或严重健康状态的消息。好的,我删除了两个Namespace卡在Terminating状态中的过时的 s,但我可以通过重新启动集群节点来解决该问题(不再确切知道它是哪一个)。

当一切正常时,我尝试访问 kubernetes 仪表板(之前所做的一切都是在 IBM 管理级别或命令行中完成的),但令人惊讶的是,我发现它无法访问,浏览器中的错误页面显示:

503服务不可用

该页面底部有一条小的 JSON 消息,内容是:

{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": { },
  "status": "Failure",
  "message": "error trying to reach service: read tcp 172.18.190.60:39946-\u003e172.19.151.38:8090: read: connection reset by peer",
  "reason": "ServiceUnavailable",
  "code": 503
}

我发送了一个kubectl logs kubernetes-dashboard-54674bdd65-nf6w7 --namespace=kube-system显示Pod为正在运行的位置,但结果不是要查看的日志,而是这条消息:

Error from server: Get "https://10.215.17.75:10250/containerLogs/kube-system/kubernetes-dashboard-54674bdd65-nf6w7/kubernetes-dashboard":
read tcp 172.18.135.195:56882->172.19.151.38:8090:
read: connection reset by peer

然后我发现我既无法获取该集群中运行的任何 Pod日志,也无法部署任何需要调度的新自定义 kubernetes 对象(我实际上可以应用Services 或s ,ConfigMap但没有Pod,或类似的) .ReplicaSetDeployment

我已经尝试过

  • 重新加载工作池中的工作节点
  • 重启workerpool中的worker节点
  • 重新启动 kubernetes-dashboardDeployment

不幸的是,上述操作都没有改变Pods.

还有另一件事可能是相关的(虽然我不太确定它是否真的是):

在另一个运行良好的集群中,有三个 calicoPod正在运行,并且所有三个都启动,而在有问题的集群中,三个 calicoPod中只有两个启动并运行,第三个保持Pending状态并且 akubectl describe pod calico-blablabla-blabla揭示了原因,Event

Warning  FailedScheduling  13s   default-scheduler
0/2 nodes are available: 2 node(s) didn't have free ports for the requested pod ports.

有没有人知道该集群中发生了什么并可以指出可能的解决方案?我真的不想删除集群并生成一个新集群。

编辑

结果kubectl describe pod kubernetes-dashboard-54674bdd65-4m2ch --namespace=kube-system:

Name:                 kubernetes-dashboard-54674bdd65-4m2ch
Namespace:            kube-system
Priority:             2000000000
Priority Class Name:  system-cluster-critical
Node:                 10.215.17.82/10.215.17.82
Start Time:           Mon, 15 Nov 2021 09:01:30 +0100
Labels:               k8s-app=kubernetes-dashboard
                      pod-template-hash=54674bdd65
Annotations:          cni.projectcalico.org/containerID: ca52cefaae58d8e5ce6d54883cb6a6135318c8db53d231dc645a5cf2e67d821e
                      cni.projectcalico.org/podIP: 172.30.184.2/32
                      cni.projectcalico.org/podIPs: 172.30.184.2/32
                      container.seccomp.security.alpha.kubernetes.io/kubernetes-dashboard: runtime/default
                      kubectl.kubernetes.io/restartedAt: 2021-11-10T15:47:14+01:00
                      kubernetes.io/psp: ibm-privileged-psp
Status:               Running
IP:                   172.30.184.2
IPs:
  IP:           172.30.184.2
Controlled By:  ReplicaSet/kubernetes-dashboard-54674bdd65
Containers:
  kubernetes-dashboard:
    Container ID:  containerd://bac57850055cd6bb944c4d893a5d315c659fd7d4935fe49083d9ef8ae03e5c31
    Image:         registry.eu-de.bluemix.net/armada-master/kubernetesui-dashboard:v2.3.1
    Image ID:      registry.eu-de.bluemix.net/armada-master/kubernetesui-dashboard@sha256:f14f581d36b83fc9c1cfa3b0609e7788017ecada1f3106fab1c9db35295fe523
    Port:          8443/TCP
    Host Port:     0/TCP
    Args:
      --auto-generate-certificates
      --namespace=kube-system
    State:          Running
      Started:      Mon, 15 Nov 2021 09:01:37 +0100
    Ready:          True
    Restart Count:  0
    Requests:
      cpu:        50m
      memory:     100Mi
    Liveness:     http-get https://:8443/ delay=30s timeout=30s period=10s #success=1 #failure=3
    Readiness:    http-get https://:8443/ delay=10s timeout=30s period=10s #success=1 #failure=3
    Environment:  <none>
    Mounts:
      /certs from kubernetes-dashboard-certs (rw)
      /tmp from tmp-volume (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-sc9kw (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  kubernetes-dashboard-certs:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  kubernetes-dashboard-certs
    Optional:    false
  tmp-volume:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  kube-api-access-sc9kw:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node-role.kubernetes.io/master:NoSchedule
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 600s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 600s
Events:                      <none>
port port-forwarding scheduling cluster kubernetes
  • 1 个回答
  • 129 Views

Sidebar

Stats

  • 问题 205573
  • 回答 270741
  • 最佳答案 135370
  • 用户 68524
  • 热门
  • 回答
  • Marko Smith

    新安装后 postgres 的默认超级用户用户名/密码是什么?

    • 5 个回答
  • Marko Smith

    SFTP 使用什么端口?

    • 6 个回答
  • Marko Smith

    命令行列出 Windows Active Directory 组中的用户?

    • 9 个回答
  • Marko Smith

    什么是 Pem 文件,它与其他 OpenSSL 生成的密钥文件格式有何不同?

    • 3 个回答
  • Marko Smith

    如何确定bash变量是否为空?

    • 15 个回答
  • Martin Hope
    Tom Feiner 如何按大小对 du -h 输出进行排序 2009-02-26 05:42:42 +0800 CST
  • Martin Hope
    Noah Goodrich 什么是 Pem 文件,它与其他 OpenSSL 生成的密钥文件格式有何不同? 2009-05-19 18:24:42 +0800 CST
  • Martin Hope
    Brent 如何确定bash变量是否为空? 2009-05-13 09:54:48 +0800 CST
  • Martin Hope
    cletus 您如何找到在 Windows 中打开文件的进程? 2009-05-01 16:47:16 +0800 CST

热门标签

linux nginx windows networking ubuntu domain-name-system amazon-web-services active-directory apache-2.4 ssh

Explore

  • 主页
  • 问题
    • 最新
    • 热门
  • 标签
  • 帮助

Footer

AskOverflow.Dev

关于我们

  • 关于我们
  • 联系我们

Legal Stuff

  • Privacy Policy

Language

  • Pt
  • Server
  • Unix

© 2023 AskOverflow.DEV All Rights Reserve