Tenho um AMD EPYC 7502P 32-Core
servidor Linux (kernel 6.10.6
) com 6 drives NVMe, onde de repente o desempenho de E/S caiu. Todas as operações levam muito tempo. Instalar atualizações de pacotes leva horas em vez de segundos (talvez minutos).
Eu tentei rodar fio
em um sistema de arquivos com RAID5. Há uma diferença enorme na clat
métrica:
clat (nsec): min=190, max=359716k, avg=16112.91, stdev=592031.05
stdev
o valor é extremo.
saída completa:
$ fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=4k --numjobs=1 --size=4g --iodepth=1 --runtime=60 --time_based --end_fsync=1
random-write: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=posixaio, iodepth=1
fio-3.33
Starting 1 process
random-write: Laying out IO file (1 file / 4096MiB)
Jobs: 1 (f=1): [F(1)][100.0%][w=53.3MiB/s][w=13.6k IOPS][eta 00m:00s]
random-write: (groupid=0, jobs=1): err= 0: pid=48391: Wed Sep 25 09:17:02 2024
write: IOPS=45.5k, BW=178MiB/s (186MB/s)(10.6GiB/61165msec); 0 zone resets
slat (nsec): min=552, max=123137, avg=2016.89, stdev=468.03
clat (nsec): min=190, max=359716k, avg=16112.91, stdev=592031.05
lat (usec): min=10, max=359716, avg=18.13, stdev=592.03
clat percentiles (usec):
| 1.00th=[ 11], 5.00th=[ 12], 10.00th=[ 14], 20.00th=[ 15],
| 30.00th=[ 15], 40.00th=[ 15], 50.00th=[ 15], 60.00th=[ 16],
| 70.00th=[ 16], 80.00th=[ 16], 90.00th=[ 17], 95.00th=[ 18],
| 99.00th=[ 20], 99.50th=[ 22], 99.90th=[ 42], 99.95th=[ 119],
| 99.99th=[ 186]
bw ( KiB/s): min=42592, max=290232, per=100.00%, avg=209653.41, stdev=46502.99, samples=105
iops : min=10648, max=72558, avg=52413.32, stdev=11625.75, samples=105
lat (nsec) : 250=0.01%, 500=0.01%, 1000=0.01%
lat (usec) : 10=0.01%, 20=99.15%, 50=0.76%, 100=0.03%, 250=0.06%
lat (usec) : 500=0.01%, 750=0.01%, 1000=0.01%
lat (msec) : 2=0.01%, 4=0.01%, 10=0.01%, 500=0.01%
cpu : usr=12.62%, sys=30.97%, ctx=2800981, majf=0, minf=28
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,2784519,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
WRITE: bw=178MiB/s (186MB/s), 178MiB/s-178MiB/s (186MB/s-186MB/s), io=10.6GiB (11.4GB), run=61165-61165msec
Disk stats (read/write):
md1: ios=0/710496, merge=0/0, ticks=0/12788992, in_queue=12788992, util=23.31%, aggrios=319833/649980, aggrmerge=0/0, aggrticks=118293/136983, aggrin_queue=255276, aggrutil=14.78%
nvme1n1: ios=318781/638009, merge=0/0, ticks=118546/131154, in_queue=249701, util=14.71%
nvme5n1: ios=321508/659460, merge=0/0, ticks=118683/138996, in_queue=257679, util=14.77%
nvme2n1: ios=320523/647922, merge=0/0, ticks=120634/134284, in_queue=254918, util=14.71%
nvme3n1: ios=320809/651642, merge=0/0, ticks=118823/135985, in_queue=254808, util=14.73%
nvme0n1: ios=316267/642934, merge=0/0, ticks=116772/143909, in_queue=260681, util=14.75%
nvme4n1: ios=321110/659918, merge=0/0, ticks=116300/137570, in_queue=253870, util=14.78%
Provavelmente um disco está com defeito. Existe alguma maneira de determinar qual é o disco lento?
Todos os discos têm atributos SMART semelhantes, nada de excepcional. SAMSUNG 7T:
Model Number: SAMSUNG MZQL27T6HBLA-00A07
Firmware Version: GDC5902Q
Data Units Read: 2,121,457,831 [1.08 PB]
Data Units Written: 939,728,748 [481 TB]
Controller Busy Time: 40,224
Power Cycles: 5
Power On Hours: 6,913
o desempenho de gravação parece ser muito semelhante:
iostat -xh
Linux 6.10.6+bpo-amd64 (ts01b) 25/09/24 _x86_64_ (64 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
5.0% 0.0% 4.3% 0.6% 0.0% 90.2%
r/s rkB/s rrqm/s %rrqm r_await rareq-sz Device
0.12 7.3k 0.00 0.0% 0.43 62.9k md0
6461.73 548.7M 0.00 0.0% 0.22 87.0k md1
3583.93 99.9M 9.60 0.3% 1.13 28.5k nvme0n1
3562.77 98.9M 0.80 0.0% 1.15 28.4k nvme1n1
3584.54 99.8M 9.74 0.3% 1.18 28.5k nvme2n1
3565.96 98.8M 1.06 0.0% 1.16 28.4k nvme3n1
3585.04 99.9M 9.78 0.3% 1.16 28.5k nvme4n1
3577.56 99.0M 0.86 0.0% 1.17 28.3k nvme5n1
w/s wkB/s wrqm/s %wrqm w_await wareq-sz Device
0.00 0.0k 0.00 0.0% 0.00 4.0k md0
366.41 146.5M 0.00 0.0% 14.28 409.4k md1
8369.26 32.7M 1.18 0.0% 3.73 4.0k nvme0n1
8364.63 32.7M 1.12 0.0% 3.63 4.0k nvme1n1
8355.48 32.6M 1.10 0.0% 3.56 4.0k nvme2n1
8365.23 32.7M 1.10 0.0% 3.46 4.0k nvme3n1
8365.37 32.7M 1.25 0.0% 3.37 4.0k nvme4n1
8356.70 32.6M 1.06 0.0% 3.29 4.0k nvme5n1
d/s dkB/s drqm/s %drqm d_await dareq-sz Device
0.00 0.0k 0.00 0.0% 0.00 0.0k md0
0.00 0.0k 0.00 0.0% 0.00 0.0k md1
0.00 0.0k 0.00 0.0% 0.00 0.0k nvme0n1
0.00 0.0k 0.00 0.0% 0.00 0.0k nvme1n1
0.00 0.0k 0.00 0.0% 0.00 0.0k nvme2n1
0.00 0.0k 0.00 0.0% 0.00 0.0k nvme3n1
0.00 0.0k 0.00 0.0% 0.00 0.0k nvme4n1
0.00 0.0k 0.00 0.0% 0.00 0.0k nvme5n1
f/s f_await aqu-sz %util Device
0.00 0.00 0.00 0.0% md0
0.00 0.00 6.68 46.8% md1
0.00 0.00 35.24 14.9% nvme0n1
0.00 0.00 34.50 14.6% nvme1n1
0.00 0.00 33.98 14.9% nvme2n1
0.00 0.00 33.06 14.6% nvme3n1
0.00 0.00 32.33 14.8% nvme4n1
0.00 0.00 31.72 14.6% nvme5n1
tipo de problema parece ser interrupções
$ dstat -tf --int24 60
----system---- -------------------------------interrupts------------------------------
time | 120 128 165 199 213 342 LOC PMI IWI RES CAL TLB
25-09 10:53:45|2602 2620 2688 2695 2649 2725 136k 36 1245 2739 167k 795
25-09 10:54:45| 64 64 65 64 66 65 2235 1 26 16 2156 3
25-09 10:55:45| 33 31 32 32 32 30 2050 1 24 10 2162 20
25-09 10:56:45| 31 31 30 35 30 33 2303 1 26 63 2245 9
25-09 10:57:45| 36 29 27 34 35 35 2016 1 23 72 2645 10
25-09 10:58:45| 9 8 9 8 7 8 1766 0 27 4 1892 15
25-09 10:59:45| 59 62 59 58 60 60 1585 1 22 20 1704 9
25-09 11:00:45| 25 21 21 26 26 26 1605 0 26 10 1862 10
25-09 11:01:45| 34 32 32 33 36 31 1515 0 23 24 1948 10
25-09 11:02:45| 21 23 23 25 22 24 1772 0 27 27 1781 9
os campos com interrupções aumentadas são mapeados para 9-edge
todas as unidades nvme[0-5]q9
, por exemplo:
$ cat /proc/interrupts | grep 120:
IR-PCI-MSIX-0000:01:00.0 9-edge nvme2q9
EDIT: 9-edge
Provavelmente são dispositivos Metadisk (RAID de software).