公司内部环境的GPU卡无法使用,重启服务器后频繁异常,使用gpu-burn工具对显卡进行测试。
操作步骤
下载源代码
git clone https://github.com/wilicc/gpu-burn.git
依赖安装
需要提前安装cuda,否则无法编译和使用。
yum install -y gcc-c++
编译
cd gpu-burn/
make
使用
编译完成后即可本地执行
[root@lolicp gpu-burn]# ./gpu_burn
Run length not specified in the command line. Using compare file: compare.ptx
Burning for 10 seconds.
GPU 0: Tesla T4 (UUID: GPU-91e6dfa9-4514-b9a7-d868-442a1b821a62)
GPU 1: Tesla T4 (UUID: GPU-6863b855-2e09-6ab0-a62f-e1b224a625c9)
Initialized device 1 with 15109 MB of memory (6207 MB available, using 5586 MB of it), using FLOATS
Results are 268435456 bytes each, thus performing 19 iterations
Initialized device 0 with 15109 MB of memory (14801 MB available, using 13321 MB of it), using FLOATS
Results are 268435456 bytes each, thus performing 50 iterations
100.0% proc'd: 0 (0 Gflop/s) - 19 (1805 Gflop/s) errors: 0 - 0 temps: 36 C - 59 C
Summary at: Tue May 14 13:21:17 CST 2024
Killing processes with SIGTERM (soft kill)
Freed memory for dev 1
Uninitted cublas
Freed memory for dev 0
Uninitted cublas
done
Tested 2 GPUs:
GPU 0: OK
GPU 1: OK