我对多节点数据中心还不熟悉。以下情况发生在我身上。
首先,我使用此答案中的程序来检查 CUDA 设备。我构建了它(我在那里遇到了一些问题,但那是另一个问题),可执行文件名为device_info8
。
所以我登录到我的数据中心,并从登录节点运行文件
me@login01 test]$ ./device_info8
Number of devices: 1
Device Number: 0
Device name: Tesla V100-PCIE-16GB
Memory Clock Rate (MHz): 856
Memory Bus Width (bits): 4096
Peak Memory Bandwidth (GB/s): 898.0
Total global memory (Gbytes) 15.8
Shared memory per block (Kbytes) 48.0
minor-major: 0-7
Warp-size: 32
Concurrent kernels: yes
Concurrent computation/communication: yes
我无法直接访问我想要测试的节点,所以我
me@login01 test]$ srun -p partition1 --nodelist Node-11 --gres=gpu:all --pty -u bash -i
[me@Node-11 test]$
现在我明白了
[me@Node-11 test]$./device_info8
Number of devices: 0
然而,当我运行时,nvidia-smi
我可以清楚地看到我有 8 个可用的 GPU!
[me@Node-11 test]$ nvidia-smi
Tue Dec 3 18:16:04 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67 Driver Version: 418.67 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-PCIE... On | 00000000:2D:00.0 Off | 0 |
| N/A 28C P0 26W / 250W | 0MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-PCIE... On | 00000000:31:00.0 Off | 0 |
| N/A 26C P0 25W / 250W | 0MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-PCIE... On | 00000000:35:00.0 Off | 0 |
| N/A 26C P0 25W / 250W | 0MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-PCIE... On | 00000000:39:00.0 Off | 0 |
| N/A 27C P0 24W / 250W | 0MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 4 Tesla V100-PCIE... On | 00000000:A9:00.0 Off | 0 |
| N/A 26C P0 26W / 250W | 0MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 5 Tesla V100-PCIE... On | 00000000:AD:00.0 Off | 0 |
| N/A 29C P0 25W / 250W | 0MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 6 Tesla V100-PCIE... On | 00000000:B1:00.0 Off | 0 |
| N/A 27C P0 24W / 250W | 0MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 7 Tesla V100-PCIE... On | 00000000:B5:00.0 Off | 0 |
| N/A 28C P0 27W / 250W | 0MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
为什么会发生这种情况?我忽略了什么?如何让 GPU 可用于程序?