我正在尝试使用 Linux 优化某些 Sun 硬件上的存储设置。任何想法将不胜感激。
我们有以下硬件:
- 太阳刀片 X6270
- 2* LSISAS1068E SAS 控制器
- 2* Sun J4400 JBOD 和 1 TB 磁盘(每个 JBOD 24 个磁盘)
- Fedora 核心 12
- 来自 FC13 的 2.6.33 发布内核(也尝试使用来自 FC12 的最新 2.6.31 内核,结果相同)
这是 SAS 硬件的数据表:
http://www.sun.com/storage/storage_networking/hba/sas/PCIe.pdf
它使用 PCI Express 1.0a,8x 通道。每个通道的带宽为 250 MB/秒,每个 SAS 控制器应该能够达到 2000 MB/秒。
每个控制器每个端口的速度可以达到 3 Gb/秒,并且有两个 4 端口 PHY。我们将两个 PHY 从控制器连接到 JBOD。因此,在 JBOD 和控制器之间,我们有 2 个 PHY * 4 个 SAS 端口 * 3 Gb/秒 = 24 Gb/秒的带宽,这比 PCI Express 带宽还多。
启用写缓存并在进行大写时,每个磁盘可以维持大约 80 MB/秒(接近磁盘的开头)。使用 24 个磁盘,这意味着每个 JBOD 应该能够达到 1920 MB/秒。
多路径{ rr_min_io 100 用户界面 0 path_grouping_policy 多总线 故障回复手册 path_selector "循环 0" rr_weight 优先级 别名 somealias no_path_retry 队列 模式 0644 gid 0 不知不觉 }
我为 rr_min_io 尝试了 50、100、1000 的值,但似乎没有太大区别。
随着 rr_min_io 的变化,我尝试在启动 dd 之间添加一些延迟,以防止它们同时在同一个 PHY 上写入,但这没有任何区别,所以我认为 I/O 得到了适当的分散。
根据 /proc/interrupts,SAS 控制器使用“IR-IO-APIC-fasteoi”中断方案。出于某种原因,只有机器中的核心#0 正在处理这些中断。我可以通过分配一个单独的内核来处理每个 SAS 控制器的中断来稍微提高性能:
回声 2 > /proc/irq/24/smp_affinity 回声 4 > /proc/irq/26/smp_affinity
使用 dd 写入磁盘会生成“函数调用中断”(不知道这些是什么),这些中断由核心 #4 处理,因此我也将其他进程排除在此核心之外。
我运行 48 个 dd(每个磁盘一个),将它们分配给不处理中断的内核,如下所示:
taskset -c somecore dd if=/dev/zero of=/dev/mapper/mpathx oflag=direct bs=128M
oflag=direct 防止涉及任何类型的缓冲区缓存。
我的核心似乎都没有达到极限。处理中断的核心大部分是空闲的,所有其他核心都在等待 I/O,正如人们所期望的那样。
CPU0:0.0%us,1.0%sy,0.0%ni,91.2%id,7.5%wa,0.0%hi,0.2%si,0.0%st CPU1:0.0%us,0.8%sy,0.0%ni,93.0%id,0.2%wa,0.0%hi,6.0%si,0.0%st CPU2:0.0%us,0.6%sy,0.0%ni,94.4%id,0.1%wa,0.0%hi,4.8%si,0.0%st CPU3:0.0%us,7.5%sy,0.0%ni,36.3%id,56.1%wa,0.0%hi,0.0%si,0.0%st CPU4:0.0%us,1.3%sy,0.0%ni,85.7%id,4.9%wa,0.0%hi,8.1%si,0.0%st CPU5:0.1%us,5.5%sy,0.0%ni,36.2%id,58.3%wa,0.0%hi,0.0%si,0.0%st CPU6:0.0%us,5.0%sy,0.0%ni,36.3%id,58.7%wa,0.0%hi,0.0%si,0.0%st CPU7:0.0%us,5.1%sy,0.0%ni,36.3%id,58.5%wa,0.0%hi,0.0%si,0.0%st CPU8:0.1%us,8.3%sy,0.0%ni,27.2%id,64.4%wa,0.0%hi,0.0%si,0.0%st CPU9:0.1%us,7.9%sy,0.0%ni,36.2%id,55.8%wa,0.0%hi,0.0%si,0.0%st CPU10:0.0%us,7.8%sy,0.0%ni,36.2%id,56.0%wa,0.0%hi,0.0%si,0.0%st CPU11:0.0%us,7.3%sy,0.0%ni,36.3%id,56.4%wa,0.0%hi,0.0%si,0.0%st CPU12:0.0%us,5.6%sy,0.0%ni,33.1%id,61.2%wa,0.0%hi,0.0%si,0.0%st CPU13:0.1%us,5.3%sy,0.0%ni,36.1%id,58.5%wa,0.0%hi,0.0%si,0.0%st CPU14:0.0%us,4.9%sy,0.0%ni,36.4%id,58.7%wa,0.0%hi,0.0%si,0.0%st CPU15:0.1%us,5.4%sy,0.0%ni,36.5%id,58.1%wa,0.0%hi,0.0%si,0.0%st
鉴于这一切,运行“dstat 10”报告的吞吐量在 2200-2300 MB/秒的范围内。
鉴于上面的数学,我希望在 2*1920 ~= 3600+ MB/sec 的范围内。
有人知道我丢失的带宽去哪儿了吗?
谢谢!
很好,准备充分的问题:)
我自己就是一个速度快的人,老实说,我认为你很有钱。我有一半期望看到您的吞吐量低于实际水平,但我认为您所得到的是轻微的、预期的低效率。例如,PCIe 总线很难一直达到 100%,最好假设总体速率较低,为 90%。考虑到抖动,这将导致它也意味着 PHY 不会一直 100% 被“馈送”,所以你会在那里丢失一点,缓存、磁盘、非合并中断、IO 调度等也是如此。基本上它是轻微的低效率乘以轻微的低效率时间......等等,它最终超过了 5-10% 的预期低效率。我见过这样的事情,HP DL 服务器使用 W2K3 与他们的 MSA SAS 盒交谈,然后成为 NLB' 在多个 NIC 上编辑 - 我猜这令人沮丧但可以理解。无论如何,那是我的 2c,抱歉,它不太积极。