我已经按照这个官方教程允许裸机 k8s 集群具有 GPU 访问权限。但是我在这样做时收到了错误。
Kubernetes 1.21 containerd 1.4.11 和 Ubuntu 20.04.3 LTS(GNU/Linux 5.4.0-91-generic x86_64)。
Nvidia 驱动程序预装在系统操作系统上,版本为 495 Headless
将以下配置粘贴到里面/etc/containerd/config.toml
并执行服务重启后,containerd 将无法以exit 1
.
容器化配置.toml
系统日志在这里。
# persistent data location
root = "/var/lib/containerd"
# runtime state information
state = "/run/containerd"
# Kubernetes doesn't use containerd restart manager.
disabled_plugins = ["restart"]
# NVIDIA CONFIG START HERE
version = 2
[plugins]
[plugins."io.containerd.grpc.v1.cri"]
[plugins."io.containerd.grpc.v1.cri".containerd]
default_runtime_name = "nvidia"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
privileged_without_host_devices = false
runtime_engine = ""
runtime_root = ""
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
BinaryName = "/usr/bin/nvidia-container-runtime"
# NVIDIA CONFIG ENDS HERE
[debug]
level = ""
[grpc]
max_recv_message_size = 16777216
max_send_message_size = 16777216
[plugins.linux]
shim = "/usr/bin/containerd-shim"
runtime = "/usr/bin/runc"
我可以确认 Nvidia Driver 确实通过运行检测到 GPU(Nvidia GTX 750Ti)nvidia-smi
并得到以下输出
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.44 Driver Version: 495.44 CUDA Version: 11.5 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:02:00.0 Off | N/A |
| 34% 34C P8 1W / 38W | 0MiB / 2000MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
修改了config.toml让它工作。
据我所知,是这样的:
因此,如果您知道
restart
-ish 插件实际上已启用,则需要跟踪其新的 URI 语法,但我实际上建议仅注释掉该节,或使用disabled_plugins = []
,因为我们使用的 containerd ansible 角色没有不要提及任何关于“重新启动”的内容,并且确实有= []
味道切线地,您可能希望
journalctl
将来将调用限制为仅查看containerd.service
,因为它会抛出很多令人分心的文本:journalctl -u containerd.service
您甚至可以将其限制在最后几行,这有时可以帮助进一步:journalctl -u containerd.service --lines=250