关于【nvidia】的问题- 第1页

XPLOT1ON

Asked: 2021-12-02 03:37:51 +0800 CST

Nvidia Config 后 Containerd 无法启动

0

我已经按照这个官方教程允许裸机 k8s 集群具有 GPU 访问权限。但是我在这样做时收到了错误。

Kubernetes 1.21 containerd 1.4.11 和 Ubuntu 20.04.3 LTS（GNU/Linux 5.4.0-91-generic x86_64）。

Nvidia 驱动程序预装在系统操作系统上，版本为 495 Headless

将以下配置粘贴到里面/etc/containerd/config.toml并执行服务重启后，containerd 将无法以exit 1.

容器化配置.toml

系统日志在这里。

# persistent data location
root = "/var/lib/containerd"
# runtime state information
state = "/run/containerd"

# Kubernetes doesn't use containerd restart manager.
disabled_plugins = ["restart"]

# NVIDIA CONFIG START HERE

version = 2
[plugins]
  [plugins."io.containerd.grpc.v1.cri"]
    [plugins."io.containerd.grpc.v1.cri".containerd]
      default_runtime_name = "nvidia"

      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
          privileged_without_host_devices = false
          runtime_engine = ""
          runtime_root = ""
          runtime_type = "io.containerd.runc.v2"
          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
            BinaryName = "/usr/bin/nvidia-container-runtime"

# NVIDIA CONFIG ENDS HERE

[debug]
  level = ""

[grpc]
  max_recv_message_size = 16777216
  max_send_message_size = 16777216

[plugins.linux]
  shim = "/usr/bin/containerd-shim"
  runtime = "/usr/bin/runc"

我可以确认 Nvidia Driver 确实通过运行检测到 GPU（Nvidia GTX 750Ti）nvidia-smi并得到以下输出

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.44       Driver Version: 495.44       CUDA Version: 11.5     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:02:00.0 Off |                  N/A |
| 34%   34C    P8     1W /  38W |      0MiB /  2000MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

修改了config.toml让它工作。

Ash

Asked: 2021-05-06 09:31:03 +0800 CST

Google Kubernetes Engine 上是否禁用了 Pod 资源 API？

0

问题总结：

我们正在使用DCGM Exporter收集有关 GPU 工作负载的指标。在 GKE 上部署时，导出器不会返回有关其他 pod 或容器的 GPU 信息（当它应该返回该信息时）。

该导出器在每个节点上运行一个副本，并查询kubelet 公开的Pod 资源 API以获取所需的数据。似乎在 GKE 上，与其他 kubernetes 发行版相比，此 API 被禁用或配置不同。

问题演示：

我们的测试场景包括部署一个dcgm-exporter在其上运行的单节点集群以及一个cuda-test使用 GPU 资源的单副本部署（在此演示中称为）。

我们通过exporter的/metricsendpoint查询exporter，结果如下。

在 rancher k3s 上运行时v1.20.4+k3s1，container和pod标签包含一个值：

dcgm_sm_clock{gpu="0",UUID="GPU-a2bf9768-0411-f0bb-791c-67d5fec65e2f",device="nvidia0",Hostname="dcgm-exporter-xzpt9",container="cuda-test-main",namespace="default",pod="cuda-test-687bddf45c-qjl6x"} 1860
dcgm_memory_clock{gpu="0",UUID="GPU-a2bf9768-0411-f0bb-791c-67d5fec65e2f",device="nvidia0",Hostname="dcgm-exporter-xzpt9",container="cuda-test-main",namespace="default",pod="cuda-test-687bddf45c-qjl6x"} 9501
dcgm_gpu_temp{gpu="0",UUID="GPU-a2bf9768-0411-f0bb-791c-67d5fec65e2f",device="nvidia0",Hostname="dcgm-exporter-xzpt9",container="cuda-test-main",namespace="default",pod="cuda-test-687bddf45c-qjl6x"} 41

但是在 GKE 上运行时v1.19.8-gke.1600，container和pod标签都有空值：

dcgm_sm_clock{gpu="0",UUID="GPU-bcb71627-8b00-b4aa-70d7-ce39fd6cbb01",device="nvidia0",Hostname="dcgm-exporter-ln6x4",container="",namespace="",pod=""} 585
dcgm_memory_clock{gpu="0",UUID="GPU-bcb71627-8b00-b4aa-70d7-ce39fd6cbb01",device="nvidia0",Hostname="dcgm-exporter-ln6x4",container="",namespace="",pod=""} 5000
dcgm_gpu_temp{gpu="0",UUID="GPU-bcb71627-8b00-b4aa-70d7-ce39fd6cbb01",device="nvidia0",Hostname="dcgm-exporter-ln6x4",container="",namespace="",pod=""} 76

我无法找到有关 GKE 是否禁用此 API（在 k8s 中引入1.13）或限制某些值被公开的任何信息。我想了解有关此事的更多信息并找到解决方案，以便出口商访问和收集信息。

JohnA.Zoidberg

Asked: 2021-03-22 10:26:08 +0800 CST

slurm nvidia-docker 忽略 CUDA_VISIBLE_DEVICES

1

我在 slurm 集群上运行 nvidia-docker 容器时遇到问题。当在容器内时，所有 gpus 都是可见的，所以基本上它会忽略 slurm 设置的 CUDA_VISIBLE_DEVICES 环境。在容器外，可见的 gpus 是正确的。

有没有办法限制容器，例如使用 -e NVIDIA_VISIBLE_DEVICES ？或者有没有办法将 NVIDIA_VISIBLE_DEVICES 设置为 CUDA_VISIBLE_DEVICES ？

Elras

Asked: 2020-07-18 00:19:10 +0800 CST

GKE 无法在具有 GPU 的新添加节点上调度需要 GPU 的新创建的 Pod

1

当使用 GPU 添加新的池节点时，Google Kubernetes Engine 无法在这些新节点上安排需要 GPU 的新创建的 Pod，应该是自动的，但我猜不是 GPU 资源，新的 Pod 永远处于“待定”状态，如何解决这个问题?

编辑：这是部署 yaml 文件，我的目标是不将部署绑定到特定节点：

    ---
    apiVersion: machinelearning.seldon.io/v1alpha2
    kind: SldDeployment
    metadata:
      labels:
        app: sld
      name: trs-sld
      namespace: trs
    spec:
      annotations:
        project_name: Trs
        deployment_version: v1.0
        seldon.io/rest-connect-retries: '5'
        seldon.io/grpc-connect-retries: '5'
        seldon.io/istio-retries: '10' 
        seldon.io/istio-retries-timeout: '12' 
      name: trs
      predictors:
      - componentSpecs:
        - spec:
            containers:
            - image: eu.gcr.io/trs-141513/trs-native:latest
              imagePullPolicy: Always
              name: classifier
              resources:
                limits:
                  nvidia.com/gpu: 2
              volumeMounts:
                - mountPath: /etc/google_storage/creds
                  name: service-account-creds
                  readOnly: true
            volumes:
              - name: service-account-creds
                secret:
                  secretName: service-account-creds
            terminationGracePeriodSeconds: 20
        graph:
          children: []
          name: classifier
          endpoint:
            type: REST
          type: MODEL
        name: model
        replicas: 1
        annotations:
          predictor_version: v1.0
    ---

Ginger

Asked: 2020-07-16 11:19:44 +0800 CST

无法在带 Quadro 4000 卡的 Dell T7500 上查看 BIOS/启动屏幕

0

我有一台带 Quadro 4000 卡的戴尔 T7500。

我刚刚通过显示端口连接了一个新的 Phillips 328E1CA。新显示器只有显示端口和 HDMI 输入。显示器规格在这里：

https://www.philips.co.uk/cp/328E1CA_00/curved-lcd-monitor-with-ultra-wide-color

我的操作系统是 Ubuntu。

我没有看到任何引导/BIOS 屏幕。我的系统有一个加密的硬盘驱动器，启动时需要密码。我也没有看到那个屏幕。但是如果我输入我的密码并按下回车，它会带我进入 Ubuntu 桌面，我可以看到一些东西。

如何通过显示端口查看启动/BIOS 屏幕？

编辑：

我看到亚马逊销售 DVI 到 HDMI 电缆。如果我使用其中之一连接到显示器，是否有可能让我看到启动屏幕？

编辑：

我没有更新任何驱动程序，但今天早上我看到了密码屏幕。我不确定有什么不同。也许我打开计算机/显示器的顺序不同？

Travis DePrato

Asked: 2017-01-16 15:20:58 +0800 CST

为无盘环境安装 NVIDIA 驱动程序

2

我正在尝试建立一个由 8 台计算机和一个主文件服务器组成的集群。理想情况下，我想在 pxe-boot、准无盘/准无状态环境中进行设置（即，唯一的本地存储是/var，扭矩配置之类的东西会去的地方）。8 个计算节点中的每一个都有 4 个 NVIDIA Tesla K40m，但根文件服务器没有 GPU。

理想情况下，我希望能够在文件服务器 (at /node) 上创建完整的安装，然后 PXE 将其引导到计算节点，但是，我还没有找到在没有 NVIDIA GPU 的情况下安装 NVIDIA 驱动程序的方法木板。我在 NVIDIA 的论坛上发现了一个关于有人尝试此操作未成功的问题...

或者，我可以将 NVIDIA 驱动程序安装到一个计算节点（一个当前在其本地磁盘上运行 CentOS）以（例如）/usr/local/nvidia并跟踪它创建的文件并创建一个 tarball 以复制到文件服务器安装。

最后，我可以只维护八个独立的安装，但是，从长期维护的角度来看，我不喜欢这样（每个计算节点都将运行扭矩作业，所以我希望节点看起来或多或少相同） .

总之，我要求的是：

我可以在没有板载 NVIDIA GPU 的情况下安装 NVIDIA 驱动程序吗？
我还有其他方法可以解决这个问题吗？

作为参考，我们正在运行 CentOS 7。

[root@compute-3 /]# uname -a
Linux compute-3 3.10.0-514.2.2.el7.x86_64 #1 SMP Tue Dec 6 23:06:41 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

Locane

Asked: 2016-09-23 13:49:14 +0800 CST

安装 Nvidia 驱动程序后 CentOS 7 w/Gnome 在启动时挂起？

0

有很多关于这些主题的单独信息，但我无法找到我认为非常常见的情况的答案。

我在带有 CentOS 7 和 Gnome 桌面的服务器中有 2 个 Nvidia GTX 1080。GPU 将专门用于 CUDA 计算，而不是视频输出。

请参阅内核加载屏幕的屏幕截图。

我的 xorg.conf 看起来像这样：

[root@0cc47a8a1a10 ~]# cat /etc/X11/xorg.conf
# nvidia-xconfig: X configuration file generated by nvidia-xconfig
# nvidia-xconfig:  version 367.44  (buildmeister@swio-display-x86-rhel47-01)  Wed Aug 17 22:54:35 PDT 2016

Section "ServerLayout"
    Identifier     "Layout0"
    Screen      0  "Screen0"
    InputDevice    "Keyboard0" "CoreKeyboard"
    InputDevice    "Mouse0" "CorePointer"
EndSection

Section "Files"
    FontPath        "/usr/share/fonts/default/Type1"
EndSection

Section "InputDevice"
    # generated from default
    Identifier     "Mouse0"
    Driver         "mouse"
    Option         "Protocol" "auto"
    Option         "Device" "/dev/input/mice"
    Option         "Emulate3Buttons" "no"
    Option         "ZAxisMapping" "4 5"
EndSection

Section "InputDevice"
    # generated from default
    Identifier     "Keyboard0"
    Driver         "kbd"
EndSection

Section "Monitor"
    Identifier     "Monitor0"
    VendorName     "Unknown"
    ModelName      "Unknown"
    HorizSync       28.0 - 33.0
    VertRefresh     43.0 - 72.0
    Option         "DPMS"
EndSection

Section "Device"
    Identifier     "Device0"
    Driver         "nvidia"
    VendorName     "NVIDIA Corporation"
EndSection

Section "Screen"
    Identifier     "Screen0"
    Device         "Device0"
    Monitor        "Monitor0"
    DefaultDepth    24
    SubSection     "Display"
        Depth       24
    EndSubSection
EndSection

[root@0cc47a8a1a10 ~]#

这是 /var/log/Xorg.5.log 的最后一部分：

[    37.157] (==) ModulePath set to "/usr/lib64/xorg/modules"
[    37.157] (WW) Hotplugging is on, devices using drivers 'kbd', 'mouse' or 'vmmouse' will be disabled.
[    37.157] (WW) Disabling Keyboard0
[    37.157] (WW) Disabling Mouse0
[    37.157] (II) Loader magic: 0x7fd419fc1020
[    37.157] (II) Module ABI versions:
[    37.157]    X.Org ANSI C Emulation: 0.4
[    37.157]    X.Org Video Driver: 19.0
[    37.157]    X.Org XInput driver : 21.0
[    37.157]    X.Org Server Extension : 9.0
[    37.157] (II) xfree86: Adding drm device (/dev/dri/card1)
[    37.157] (II) xfree86: Adding drm device (/dev/dri/card2)
[    37.157] (II) xfree86: Adding drm device (/dev/dri/card0)
[    37.157] (II) xfree86: Adding drm device (/dev/dri/card3)
[    37.157] (II) xfree86: Adding drm device (/dev/dri/card4)
[    37.165] (--) PCI: (0:2:0:0) 10de:1b80:10de:119e rev 161, Mem @ 0xcf000000/16777216, 0x383fe0000000/268435456, 0x383ff0000000/33554432, I/O @ 0x00006000/128, BIOS @ 0x????????/524288
[    37.165] (--) PCI: (0:3:0:0) 10de:1b80:10de:119e rev 161, Mem @ 0xcd000000/16777216, 0x383fc0000000/268435456, 0x383fd0000000/33554432, I/O @ 0x00005000/128, BIOS @ 0x????????/524288
[    37.165] (--) PCI:*(0:6:0:0) 1a03:2000:15d9:0852 rev 48, Mem @ 0xcb000000/16777216, 0xcc000000/131072, I/O @ 0x00004000/128, BIOS @ 0x????????/131072
[    37.165] (--) PCI: (0:131:0:0) 10de:1b80:10de:119e rev 161, Mem @ 0xfa000000/16777216, 0x387fe0000000/268435456, 0x387ff0000000/33554432, I/O @ 0x0000d000/128, BIOS @ 0x????????/524288
[    37.165] (--) PCI: (0:132:0:0) 10de:1b80:10de:119e rev 161, Mem @ 0xf8000000/16777216, 0x387fc0000000/268435456, 0x387fd0000000/33554432, I/O @ 0x0000c000/128, BIOS @ 0x????????/524288
[    37.165] (II) LoadModule: "glx"
[    37.165] (II) Loading /usr/lib64/xorg/modules/extensions/libglx.so
[    37.171] (II) Module glx: vendor="NVIDIA Corporation"
[    37.171]    compiled for 4.0.2, module version = 1.0.0
[    37.171]    Module class: X.Org Server Extension
[    37.171] (II) NVIDIA GLX Module  367.44  Wed Aug 17 21:50:26 PDT 2016
[    37.171] (II) LoadModule: "nvidia"
[    37.171] (II) Loading /usr/lib64/xorg/modules/drivers/nvidia_drv.so
[    37.171] (II) Module nvidia: vendor="NVIDIA Corporation"
[    37.171]    compiled for 4.0.2, module version = 1.0.0
[    37.171]    Module class: X.Org Video Driver
[    37.171] (II) NVIDIA dlloader X Driver  367.44  Wed Aug 17 21:28:13 PDT 2016
[    37.171] (II) NVIDIA Unified Driver for all Supported NVIDIA GPUs
[    37.171] (++) using VT number 1

[    37.171] (EE) No devices detected.
[    37.171] (EE)
Fatal server error:
[    37.171] (EE) no screens found(EE)
[    37.171] (EE)
Please consult the The X.Org Foundation support
         at http://wiki.x.org
 for help.
[    37.171] (EE) Please also check the log file at "/var/log/Xorg.5.log" for additional information.
[    37.171] (EE)

MTSS

Asked: 2016-04-26 20:16:07 +0800 CST

在 ProLiant DL580 Gen8 服务器中安装显卡

2

我们有一台 ProLiant DL580 Gen8 服务器，想在 PCIE 插槽中安装技嘉 GForce GTX 980 ti 显卡，当我们连接 8 针插座电源时，服务器无法打开，当电源插座未连接时，服务器启动但显卡可以未在服务器中检测到，现在我的问题是：

这个服务器支持这个显卡吗？如果是，有什么问题？

我们的电源线是普通的 8 针电缆。

Nvidia Config 后 Containerd 无法启动

Google Kubernetes Engine 上是否禁用了 Pod 资源 API？

问题总结：

问题演示：

slurm nvidia-docker 忽略 CUDA_VISIBLE_DEVICES

GKE 无法在具有 GPU 的新添加节点上调度需要 GPU 的新创建的 Pod

无法在带 Quadro 4000 卡的 Dell T7500 上查看 BIOS/启动屏幕

为无盘环境安装 NVIDIA 驱动程序

安装 Nvidia 驱动程序后 CentOS 7 w/Gnome 在启动时挂起？

在 ProLiant DL580 Gen8 服务器中安装显卡

新安装后 postgres 的默认超级用户用户名/密码是什么？

SFTP 使用什么端口？

命令行列出 Windows Active Directory 组中的用户？

什么是 Pem 文件，它与其他 OpenSSL 生成的密钥文件格式有何不同？

如何确定bash变量是否为空？

问题[nvidia](server)

问题总结：

问题演示：