Daigo提出的问题 -server

Daigo

Asked: 2022-08-30 19:06:37 +0800 CST

当一个 initContainer 被 OOMKilled 时，Pod 卡在 PodInitializing 状态

2

我有以下本地 Kubernetes 环境：

操作系统：红帽企业 Linux 8.6 版（Ootpa）
Kubernetes：1.23.7（单节点，使用 kubeadm 构建）
英伟达驱动：515.65.01
nvidia-container-toolkit: 1.10.0-1.x86_64 (rpm)
容器化：v1.6.2
vcr.io/nvidia/k8s-device-plugin:v0.12.2

我在我的服务器上运行以下 Pod。只有 app2 (initContainer2) 使用 GPU。

initContainer1: app1
↓
initContainer2: app2 (Uses GPU)
↓
container1: app3

当 app2 使用太多 RAM 并被 OOM 杀死时，Pod 应该处于OOMKilled状态，但它卡在PodInitializing我的环境中的状态。

NAMESPACE     NAME       READY   STATUS            RESTARTS       AGE     IP               NODE      NOMINATED NODE   READINESS GATES
default       gpu-pod    0/1     PodInitializing   0              83m     xxx.xxx.xxx.xxx   xxxxx   <none>           <none>

结果kubectl describe pod如下：

Init Containers:
  app1:
    ...
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Tue, 30 Aug 2022 10:50:38 +0900
      Finished:     Tue, 30 Aug 2022 10:50:44 +0900
      ...
app2:
    ...
    State:          Terminated
      Reason:       OOMKilled
      Exit Code:    0
      Started:      Tue, 30 Aug 2022 10:50:45 +0900
      Finished:     Tue, 30 Aug 2022 10:50:48 +0900
      ...
app3:
    ...
    State:          Waiting
      Reason:       PodInitializing
      ...
    ...

当我将 app2 替换为另一个不使用 GPU 的容器时，或者当我将 app2 作为 Pod 的单个容器（不是 init 容器）启动时，这个问题永远不会发生。在这两种情况下，状态都是正确的OOMKilled。

这是一个错误吗？如果是这样，是否有任何解决方法？

Daigo

Asked: 2022-01-20 23:51:00 +0800 CST

无法在 kube-proxy 容器中执行 dpkg 命令

0

我目前正在尝试获取k8s.gcr.io/kube-proxy:v1.23.2容器上所有已安装的 debian 软件包 (dpkg) 的列表。

首先，我尝试dpkg -l在运行的容器中作为我的 kubernetes 集群的一部分执行，但出现以下错误。

dpkg-query: error: showing package list on pager subprocess returned error exit status 127

然后我也直接在 containerd 上尝试了该命令，nerdctl run -it k8s.gcr.io/kube-proxy:v1.23.2 dpkg -l但得到了同样的错误。

是否可以通过修改一些设置来获取列表，或者有什么不同的方法？

Daigo

Asked: 2022-01-11 23:50:09 +0800 CST

Kubernetes Pod 在调度后立即以 OutOfMemory 状态失败

1

我正在裸机 Kubernetes 集群（版本 1.22.1）上测试我的应用程序，并且在将我的应用程序作为作业启动时遇到问题。

我的集群有两个节点（主节点和工作节点），但工作节点被封锁。在主节点上，21GB 内存可供应用程序使用。

我试图同时将我的应用程序作为三个不同的工作启动。由于我将 16GB 的内存设置为资源请求和限制，因此只启动了一个 Job，其余两个处于 Pending 状态。我已将 backoffLimit: 0 设置为 Jobs。

NAME            READY   STATUS     RESTARTS   AGE
app1--1-8pp6l   0/1     Pending    0          42s
app2--1-42ssl   0/1     Pending    0          45s
app3--1-gxgwr   0/1     Running    0          46s

第一个 Pod 完成后，应该只启动两个处于 Pending 状态的 Pod 中的一个。但是，一个已启动，另一个处于 OutOfMemory 状态，即使 Pod 中没有启动任何容器。

NAME            READY   STATUS        RESTARTS   AGE
app1--1-8pp6l   0/1     Running       0          90s
app2--1-42ssl   0/1     OutOfmemory   0          93s
app3--1-gxgwr   0/1     Completed     0          94s

OutOfMemory Pod 的事件如下：

Events:
  Type     Reason            Age                   From               Message
  ----     ------            ----                  ----               -------
  Warning  FailedScheduling  3m41s (x2 over 5m2s)  default-scheduler  0/2 nodes are available: 1 Insufficient memory, 1 node(s) were unschedulable.
  Normal   Scheduled         3m38s                 default-scheduler  Successfully assigned test/app2--1-42ssl to master
  Warning  OutOfmemory       3m38s                 kubelet            Node didn't have enough resource: memory, requested: 16000000000, used: 31946743808, capacity: 37634150400

似乎 Pod 已分配给节点，即使没有足够的空间给它，因为另一个 Pod 刚刚启动。

我想这不是 Kubernetes 的预期行为，有人知道这个问题的原因吗？

Daigo

Asked: 2021-12-01 22:10:33 +0800 CST

在使用 kubeadm 引导集群之前，如何修改 CoreDNS 配置映射？

1

我需要使用 kubeadm 构建我的本地 Kubernetes 集群。

由于我的环境没有 DNS，我必须修改 CoreDNS 的配置映射，使其不包含转发部分。

部署集群后，我可以使用编辑 configmap kubectl edit cm coredns -n kube-system，但修改后 CoreDNS 需要一些时间才能正常工作，这可能对我的生产环境有问题。

是否可以在执行之前编辑此配置图kubeadm init？

Daigo

Asked: 2021-11-25 00:25:37 +0800 CST

使用 containerd 作为 CRI 时离线安装 kubernetes 失败

1

由于某种原因，我不得不构建一个没有 Internet 连接的裸机 Kubernetes 集群。

由于 dockershim 已被弃用，我决定使用 containerd 作为 CRI，但kubeadm init由于超时，使用 kubeadm 离线安装在执行时失败。

    Unfortunately, an error has occurred:
            timed out waiting for the condition

    This error is likely caused by:
            - The kubelet is not running
            - The kubelet is unhealthy due to a misconfiguration of the node in some way (required cgroups disabled)

    If you are on a systemd-powered system, you can try to troubleshoot the error with the following commands:
            - 'systemctl status kubelet'
            - 'journalctl -xeu kubelet'

由于以下原因，我可以看到很多错误日志journalctl -u kubelet -f：

11 24 16:25:25 rhel8 kubelet[9299]: E1124 16:25:25.473188    9299 controller.go:144] failed to ensure lease exists, will retry in 7s, error: Get "https://133.117.20.57:6443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/rhel8?timeout=10s": dial tcp 133.117.20.57:6443: connect: connection refused
11 24 16:25:25 rhel8 kubelet[9299]: E1124 16:25:25.533555    9299 kubelet.go:2407] "Error getting node" err="node \"rhel8\" not found"
11 24 16:25:25 rhel8 kubelet[9299]: I1124 16:25:25.588986    9299 kubelet_node_status.go:71] "Attempting to register node" node="rhel8"
11 24 16:25:25 rhel8 kubelet[9299]: E1124 16:25:25.589379    9299 kubelet_node_status.go:93] "Unable to register node with API server" err="Post \"https://133.117.20.57:6443/api/v1/nodes\": dial tcp 133.117.20.57:6443: connect: connection refused" node="rhel8"
11 24 16:25:25 rhel8 kubelet[9299]: E1124 16:25:25.634625    9299 kubelet.go:2407] "Error getting node" err="node \"rhel8\" not found"
11 24 16:25:25 rhel8 kubelet[9299]: E1124 16:25:25.735613    9299 kubelet.go:2407] "Error getting node" err="node \"rhel8\" not found"
11 24 16:25:25 rhel8 kubelet[9299]: E1124 16:25:25.835815    9299 kubelet.go:2407] "Error getting node" err="node \"rhel8\" not found"
11 24 16:25:25 rhel8 kubelet[9299]: E1124 16:25:25.936552    9299 kubelet.go:2407] "Error getting node" err="node \"rhel8\" not found"
11 24 16:25:26 rhel8 kubelet[9299]: E1124 16:25:26.036989    9299 kubelet.go:2407] "Error getting node" err="node \"rhel8\" not found"
11 24 16:25:26 rhel8 kubelet[9299]: E1124 16:25:26.137464    9299 kubelet.go:2407] "Error getting node" err="node \"rhel8\" not found"
11 24 16:25:26 rhel8 kubelet[9299]: E1124 16:25:26.238594    9299 kubelet.go:2407] "Error getting node" err="node \"rhel8\" not found"
11 24 16:25:26 rhel8 kubelet[9299]: E1124 16:25:26.338704    9299 kubelet.go:2407] "Error getting node" err="node \"rhel8\" not found"
11 24 16:25:26 rhel8 kubelet[9299]: E1124 16:25:26.394465    9299 event.go:273] Unable to write event: '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"rhel8.16ba6aab63e58bd8", GenerateName:"", Namespace:"default", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:"Node", Namespace:"", Name:"rhel8", UID:"rhel8", APIVersion:"", ResourceVersion:"", FieldPath:""}, Reason:"Starting", Message:"Starting kubelet.", Source:v1.EventSource{Component:"kubelet", Host:"rhel8"}, FirstTimestamp:v1.Time{Time:time.Time{wall:0xc05f9812b2b227d8, ext:5706873656, loc:(*time.Location)(0x55a228f25680)}}, LastTimestamp:v1.Time{Time:time.Time{wall:0xc05f9812b2b227d8, ext:5706873656, loc:(*time.Location)(0x55a228f25680)}}, Count:1, Type:"Normal", EventTime:v1.MicroTime{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'Post "https://133.117.20.57:6443/api/v1/namespaces/default/events": dial tcp 133.117.20.57:6443: connect: connection refused'(may retry after sleeping)
11 24 16:25:27 rhel8 kubelet[9299]: E1124 16:25:27.143503    9299 kubelet.go:2407] "Error getting node" err="node \"rhel8\" not found"
11 24 16:25:27 rhel8 kubelet[9299]: E1124 16:25:27.244526    9299 kubelet.go:2407] "Error getting node" err="node \"rhel8\" not found"
11 24 16:25:27 rhel8 kubelet[9299]: E1124 16:25:27.302890    9299 remote_runtime.go:116] "RunPodSandbox from runtime service failed" err="rpc error: code = Unknown desc = failed to get sandbox image \"k8s.gcr.io/pause:3.2\": failed to pull image \"k8s.gcr.io/pause:3.2\": failed to pull and unpack image \"k8s.gcr.io/pause:3.2\": failed to resolve reference \"k8s.gcr.io/pause:3.2\": failed to do request: Head \"https://k8s.gcr.io/v2/pause/manifests/3.2\": dial tcp: lookup k8s.gcr.io on [::1]:53: read udp [::1]:39732->[::1]:53: read: connection refused"
11 24 16:25:27 rhel8 kubelet[9299]: E1124 16:25:27.302949    9299 kuberuntime_sandbox.go:70] "Failed to create sandbox for pod" err="rpc error: code = Unknown desc = failed to get sandbox image \"k8s.gcr.io/pause:3.2\": failed to pull image \"k8s.gcr.io/pause:3.2\": failed to pull and unpack image \"k8s.gcr.io/pause:3.2\": failed to resolve reference \"k8s.gcr.io/pause:3.2\": failed to do request: Head \"https://k8s.gcr.io/v2/pause/manifests/3.2\": dial tcp: lookup k8s.gcr.io on [::1]:53: read udp [::1]:39732->[::1]:53: read: connection refused" pod="kube-system/kube-scheduler-rhel8"
11 24 16:25:27 rhel8 kubelet[9299]: E1124 16:25:27.302989    9299 kuberuntime_manager.go:815] "CreatePodSandbox for pod failed" err="rpc error: code = Unknown desc = failed to get sandbox image \"k8s.gcr.io/pause:3.2\": failed to pull image \"k8s.gcr.io/pause:3.2\": failed to pull and unpack image \"k8s.gcr.io/pause:3.2\": failed to resolve reference \"k8s.gcr.io/pause:3.2\": failed to do request: Head \"https://k8s.gcr.io/v2/pause/manifests/3.2\": dial tcp: lookup k8s.gcr.io on [::1]:53: read udp [::1]:39732->[::1]:53: read: connection refused" pod="kube-system/kube-scheduler-rhel8"
11 24 16:25:27 rhel8 kubelet[9299]: E1124 16:25:27.303080    9299 pod_workers.go:765] "Error syncing pod, skipping" err="failed to \"CreatePodSandbox\" for \"kube-scheduler-rhel8_kube-system(e5616b23d0312e4995fcb768f04aabbb)\" with CreatePodSandboxError: \"Failed to create sandbox for pod \\\"kube-scheduler-rhel8_kube-system(e5616b23d0312e4995fcb768f04aabbb)\\\": rpc error: code = Unknown desc = failed to get sandbox image \\\"k8s.gcr.io/pause:3.2\\\": failed to pull image \\\"k8s.gcr.io/pause:3.2\\\": failed to pull and unpack image \\\"k8s.gcr.io/pause:3.2\\\": failed to resolve reference \\\"k8s.gcr.io/pause:3.2\\\": failed to do request: Head \\\"https://k8s.gcr.io/v2/pause/manifests/3.2\\\": dial tcp: lookup k8s.gcr.io on [::1]:53: read udp [::1]:39732->[::1]:53: read: connection refused\"" pod="kube-system/kube-scheduler-rhel8" podUID=e5616b23d0312e4995fcb768f04aabbb

当我对 Internet 连接执行相同操作时，安装成功。并且当使用 docker 代替 containerd 时，即使没有 Internet 连接也可以成功完成安装。

Daigo

Asked: 2021-11-10 17:24:15 +0800 CST

我可以配置 Docker 来管理 containerd 的“k8s.io”命名空间中的资源吗？

0

我有一个带有 Docker 的 Kubernetes 集群，最近将其迁移到 containerd，但由于一些兼容性问题，我仍然想使用 Docker 来管理 Kubernetes 的镜像和容器。

当使用 Docker 作为运行时时，Docker 能够加载镜像以便 Kubernetes 可以使用它，并且能够使用 docker ps 命令列出作为 Kubernetes pod 运行的容器。

即使切换到 containerd，我仍然可以运行和使用 Docker。然而，由于 Docker 与 Kubernetes 世界隔离，因此无法使用 docker 命令管理 Kubernetes 中的资源。

似乎 Kubernetes 正在使用 containerd 的命名空间“k8s.io”运行，所以我希望我可以配置 Docker 来管理该命名空间中的资源，这可能吗？

Daigo

Asked: 2021-10-26 21:18:51 +0800 CST

安装 RHEL 8 后 /etc/resolv.conf 丢失

0

我在本地服务器上安装了 RedHat Enterprise Linux 8.4（最小安装），并遇到了一些与 docker 相关的问题。

然后我发现 /etc/resolv.conf 不见了。

执行“systemctl reload NetworkManager”后，文件已生成，docker 工作。

我不确定为什么我必须重新加载 NetworkManager 来创建 resolv.conf 以及我的网络是否正常工作。在 RHEL8 上有什么通用的方法可以做到这一点吗？

Daigo

Asked: 2021-10-26 18:04:04 +0800 CST

kubernetes coredns 处于 CrashLoopBackOff 状态，出现“未找到名称服务器”错误

1

我曾尝试在我的裸机服务器上使用 kubeadm 构建 kubernetes，并将 containerd 作为 cri，但似乎 coredns 在安装 cni（weave-net）后无法启动。

两个 coredns 容器现在处于“CrashLoopBackOff”状态，它们的日志是：

plugin/forward: no nameservers found

而“kubectl describe pod”的描述如下：

Events:
  Type     Reason            Age                    From               Message
  ----     ------            ----                   ----               -------
  Warning  FailedScheduling  4m52s (x9 over 13m)    default-scheduler  0/1 nodes are available: 1 node(s) had taint {node.kubernetes.io/not-ready: }, that the pod didn't tolerate.
  Normal   Scheduled         4m7s                   default-scheduler  Successfully assigned kube-system/coredns-58cf647449-8pq7k to k8s
  Normal   Pulled            3m13s (x4 over 4m6s)   kubelet            Container image "localhost:5000/coredns:v1.8.4" already present on machine
  Normal   Created           3m13s (x4 over 4m6s)   kubelet            Created container coredns
  Normal   Started           3m13s (x4 over 4m6s)   kubelet            Started container coredns
  Warning  Unhealthy         3m13s                  kubelet            Readiness probe failed: Get "http://10.32.0.3:8181/ready": dial tcp 10.32.0.3:8181: connect: connection refused
  Warning  BackOff           2m54s (x12 over 4m5s)  kubelet            Back-off restarting failed container

如果我在 /etc/resolv.conf 上添加一些设置，例如“nameserver 8.8.8.8”，coredns pods 就会开始运行。但是，目前我根本不使用任何外部 dns，并且使用 Docker 作为 cri，虽然 /etc/resolv.conf 上没有设置，但 coredns 运行良好。

是否可以在不在 resolv.conf 上设置一些上游 dns 服务器的情况下处理这个问题？

服务器信息：

OS: RedHat Enterprise Linux 8.4
cri: containerd 1.4.11
cni: weave-net 1.16
tools: kubeadm, kubectl, kubelet 1.22.1

我也尝试过使用 calico 作为 cni，但结果是一样的。

Daigo

Asked: 2021-10-03 04:13:07 +0800 CST

带有 containerd 的 Kubeadm 无法使用本地加载的图像

0

我正在尝试在裸机服务器（RHEL8）中使用 containerd 构建 kubernetes。

没有互联网连接，所以我手动下载了所需的图像（例如 k8s.gcr.io/kube-scheduler:v1.22.1）并使用“ctr image import”加载它们。

图像似乎已成功加载。

#ctr images ls -q
k8s.gcr.io/coredns/coredns:v1.8.4
k8s.gcr.io/etcd:3.5.0-0
k8s.gcr.io/kube-apiserver:v1.22.1
k8s.gcr.io/kube-controller-manager:v1.22.1
k8s.gcr.io/kube-proxy:v1.22.1
k8s.gcr.io/kube-scheduler:v1.22.1
k8s.gcr.io/pause:3.5

然后我执行了“kubeadm init”，但由于 ImagePull 错误而失败。

#kubeadm init --kubernetes-version=1.22.1 --cri-socket=/run/containerd/containerd.sock
[init] Using Kubernetes version: v1.22.1
[preflight] Running pre-flight checks
        [WARNING FileExisting-tc]: tc not found in system path
[preflight] Pulling images required for setting up a Kubernetes cluster
[preflight] This might take a minute or two, depending on the speed of your internet connection
[preflight] You can also perform this action in beforehand using 'kubeadm config images pull'
error execution phase preflight: [preflight] Some fatal errors occurred:

如何让 kubeadm 使用本地镜像？还是可以忽略这些预检错误？

编辑：这个过程（手动加载图像而不是执行 kubeadm config images pull）在 docker 和 CentOS7 中运行良好。

当一个 initContainer 被 OOMKilled 时，Pod 卡在 PodInitializing 状态

无法在 kube-proxy 容器中执行 dpkg 命令

Kubernetes Pod 在调度后立即以 OutOfMemory 状态失败

在使用 kubeadm 引导集群之前，如何修改 CoreDNS 配置映射？

使用 containerd 作为 CRI 时离线安装 kubernetes 失败

我可以配置 Docker 来管理 containerd 的“k8s.io”命名空间中的资源吗？

安装 RHEL 8 后 /etc/resolv.conf 丢失

kubernetes coredns 处于 CrashLoopBackOff 状态，出现“未找到名称服务器”错误

带有 containerd 的 Kubeadm 无法使用本地加载的图像

新安装后 postgres 的默认超级用户用户名/密码是什么？

SFTP 使用什么端口？

命令行列出 Windows Active Directory 组中的用户？

什么是 Pem 文件，它与其他 OpenSSL 生成的密钥文件格式有何不同？

如何确定bash变量是否为空？

Daigo's questions