PowerEdge:故障診斷實用的「NVIDIA-SMI」查詢
Summary: 本文顯示針對 NVIDIA GPU 卡故障排除的實用「NVIDIA-SMI」查詢。
Instructions
VBIOS 版本
查詢各裝置的 VBIOS 版本:
$ nvidia-smi --query-gpu=gpu_name,gpu_bus_id,vbios_version --format=csv
name, pci.bus_id, vbios_version
GRID K2, 0000:87:00.0, 80.04.D4.00.07
GRID K2, 0000:88:00.0, 80.04.D4.00.08
Query |
Description |
|---|---|
timestamp |
The timestamp of where the query was made in format "YYYY/MM/DD HH:MM:SS.msec". |
gpu_name |
The official product name of the GPU. This is an alphanumeric string. For all products. |
gpu_bus_id |
PCI bus id as "domain:bus:device.function", in hex. |
vbios_version |
The BIOS of the GPU board. |
查詢主機端紀錄記錄的 GPU 指標
此查詢適用於監視虛擬機監控程式端 GPU 指標。
此查詢適用於 ESXi 和 XenServer:
$ nvidia-smi --query-gpu=timestamp,name,pci.bus_id,driver_version,pstate,pcie.link.gen.max,
pcie.link.gen.current,temperature.gpu,utilization.gpu,utilization.memory,
memory.total,memory.free,memory.used --format=csv -l 5
向查詢添加其他參數時,請確保在查詢選項之間不添加空格。
Query |
Description |
|---|---|
timestamp |
The timestamp of where the query was made in format "YYYY/MM/DD HH:MM:SS.msec". |
name |
The official product name of the GPU. This is an alphanumeric string. For all products. |
pci.bus_id |
PCI bus id as "domain:bus:device.function", in hex. |
driver_version |
The version of the installed NVIDIA display driver. This is an alphanumeric string. |
pstate |
The current performance state for the GPU. States range from P0 (maximum performance) to P12 (minimum performance). |
pcie.link.gen.max |
The maximum PCI-E link generation possible with this GPU and system configuration. For example, if the GPU supports a higher PCIe generation than the system supports then this reports the system PCIe generation. |
pcie.link.gen.current |
The current PCI-E link generation. These may be reduced when the GPU is not in use. |
temperature.gpu |
Core GPU temperature. in degrees C. |
|
|
|
|
|
|
|
|
Total installed GPU memory. |
|
|
Total free memory. |
|
|
Total memory allocated by active contexts. |
可以通過頒發以下命令來獲取查詢參數的完整清單: nvidia-smi --help-query-gpu
nvidia-smi 記錄的用途
短期記錄
新增選項「-f <filename>」以將輸出重新導向至檔案。
前置”timeout -t <seconds>」以執行查詢 <seconds> 並停止記錄。
確保查詢粒度的大小適合所需的用途:
Purpose |
nvidia-smi "-l" value |
interval |
timeout "-t" value |
Duration |
|---|---|---|---|---|
Fine-grain GPU behavior |
5 |
5 seconds |
600 |
10 minutes |
General GPU behavior |
60 |
1 minute |
3600 |
1 hour |
Broad GPU behavior |
3600 |
1 hour |
86400 |
24 hours |
長期紀錄
創建 shell 文稿以自動建立紀錄檔,並將時間戳數據添加到檔名和查詢參數。
新增自訂 cron 作業到 /var/spool/cron/crontabs 按所需的時間間隔調用腳本。
用於時鐘和電源的其他低電平命令
啟用「持續」模式。
以下任何時鐘和電源設置都會在程式運行之間重置,除非為驅動程式啟用持久模式 (PM)。
還有 nvidia-smi 如果啟用了 PM 模式,命令執行速度會更快。
nvidia-smi -pm 1 - 使時鐘、電源和其他設置在程式運行和驅動程式調用中保持不變。
時鐘
Command |
Detail |
|---|---|
nvidia-smi -ac <MEM clock, Graphics clock> |
View clocks supported |
nvidia-smi –q –d SUPPORTED_CLOCKS |
Set one of supported clocks |
nvidia-smi -q –d CLOCK |
View current clock |
nvidia-smi --auto-boost-default=ENABLED -i 0 |
Enable boosting GPU clocks (K80 and later) |
nvidia-smi --rac |
Reset clocks back to base |
電源
nvidia-smi –pl N |
Set power cap (maximum wattage the GPU will use) |
nvidia-smi -pm 1 |
Enable persistence mode |
nvidia-smi stats -i <device#> -d pwrDraw |
Command that provides continuous monitoring of detail stats such as power |
nvidia-smi --query-gpu=index, timestamp,power.draw,clocks.sm,clocks.mem,clocks.gr --format=csv -l 1 |
Continuously provide time stamped power and clock |
其他有用的命令
| 命令 | 說明 |
nvidia-smi -q |
Query all the GPUs seen by the driver and display all readable attributes for a GPU. |
nvidia-smi |
Displays current GPU status, driver information and host of other statistics. |
nvidia-smi -l |
Scrolls the output of nvidia-smi continuously until stopped. |
nvidia-smi --query gpu=index,timestamp,power.draw,clocks.sm,clocks.mem,clocks.gr --format=csv |
Continuously provides time stamped power and clock information. |
nvidia-smi --query-gpu=gpu_name,gpu_bus_id,vbios_version --format=csv |
Query the VBIOS version of each GPU in a system. |
lspci -n | grep 10de |
Determines if the GPU is in compute mode or graphics mode. |
nvidia-smi nvlink -s -i<device#> |
Displays NVLink state for a specific GPU. |
gpuswitchmode --listgpumodes |
Displays the capability of GRID 2.0 cards and switching between compute and graphics. The package is not in the normal CUDA or NVIDIA driver. |
nvidia-smi -h |
Displays the smi commands and syntax form. |
nvidia-bug-report.sh |
Pulls out a bug report which is sent to Level 3 support technician/NVIDIA. |
nvidia-smi --query-retired-pages=gpu_uuid,retired_pages.address,retired_page.cause --format=csv |
Pulls out retired pages, GPU UUID, page fault address and the cause of page fault. |
nvidia-smi stats |
Displays device statistics. |
nvcc --version |
Shows installed CUDA version. |
nvidia-smi pmon |
Displays process statistics in scrolling format. |
nvidia-smi nvlink -c -i<device#> |
Displays NVLink capabilities for a specific GPU. |
gpuswitchmode --gpumode graphics |
Changes the personality of the GPU to graphics from compute (M6 and M60 GPUs). |
gpuswitchmode --gpumode compute |
Changes the personality of the GPU to compute from graphics (M6 and M60 GPUs). |