使用 TOAD 客户端通信配置保活

Question

nagylzs

Asked: 2024-07-26 20:59:53 +0800 CST2024-07-26 20:59:53 +0800 CST 2024-07-26 20:59:53 +0800 CST

通过 wireguard 出现间歇性 TCP 连接断开和超时

772

我有三台服务器，通过 wireguard 完全连接。它们运行 Ubuntu Server 22.04 和带有流复制的 postgresql repmr 集群。

所有计算机都有一个公共地址，但 PostgreSQL 实例和数据库客户端正在使用内部地址（在 wireguard VPN 上）。

在其中一台计算机上，我在日志中看到了以下内容：

2024-07-26 07:23:14.463 UTC [147915] FATAL:  could not receive data from WAL stream: server closed the connection unexpectedly
2024-07-26 07:25:56.242 UTC [148509] FATAL:  could not receive data from WAL stream: server closed the connection unexpectedly
2024-07-26 07:28:17.567 UTC [148818] FATAL:  could not receive data from WAL stream: server closed the connection unexpectedly
2024-07-26 07:33:13.234 UTC [149090] FATAL:  could not receive data from WAL stream: server closed the connection unexpectedly
2024-07-26 07:48:42.721 UTC [149723] FATAL:  terminating walreceiver due to timeout
2024-07-26 07:52:17.298 UTC [151521] FATAL:  could not receive data from WAL stream: server closed the connection unexpectedly
2024-07-26 08:01:25.141 UTC [151889] FATAL:  could not receive data from WAL stream: server closed the connection unexpectedly
2024-07-26 08:02:16.337 UTC [152868] FATAL:  could not receive data from WAL stream: server closed the connection unexpectedly
2024-07-26 08:06:13.169 UTC [152951] FATAL:  could not receive data from WAL stream: server closed the connection unexpectedly
2024-07-26 08:22:04.180 UTC [153377] FATAL:  could not receive data from WAL stream: server closed the connection unexpectedly

此外，当我尝试从 go 或 python 程序连接到主数据库时，有时我会看到“连接超时”或“对等方重置连接”、“操作过程中连接已关闭”等类似消息。需要注意的是，这些消息只会发生在一台计算机上，而不会发生在其他计算机上。

在服务器端（主 postgresql），我在日志中看到以下内容：

2024-07-26 12:31:36.667 UTC [3778655] telegraf@telegraf LOG:  could not receive data from client: Connection reset by peer
2024-07-26 12:31:36.897 UTC [3777638] telegraf@telegraf LOG:  could not receive data from client: Connection reset by peer
2024-07-26 12:31:39.462 UTC [3775606] telegraf@telegraf LOG:  could not receive data from client: Connection reset by peer
2024-07-26 12:31:39.480 UTC [3780628] telegraf@telegraf LOG:  could not receive data from client: Connection reset by peer

这些错误每小时只发生几次。这足以让我的应用程序变得不可靠，但它们是间歇性的。我在公共地址之间运行了此 ping 测试：

ping -c 3600 primary.public.com
# waited an hour...
--- primary.public.com ping statistics ---
3600 packets transmitted, 3600 received, 0% packet loss, time 3603052ms
rtt min/avg/max/mdev = 72.849/73.214/101.325/0.881 ms

我还对私有 IP 地址进行了 ping 测试：

ping -c 1008 primary.private.com
# waited...
--- primary.private.com ping statistics ---
1008 packets transmitted, 783 received, 22.3214% packet loss, time 1013304ms
rtt min/avg/max/mdev = 80.742/91.383/256.720/16.133 ms

换句话说，22% 的 ping 数据包在 wireguard 上丢失。

所有 wireguard 设备的 MTU 值都是默认的 1420。

3: dev0: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1420 qdisc noqueue state UNKNOWN group default qlen 1000
    link/none 
    inet 10.241.64.3/32 scope global dev0
       valid_lft forever preferred_lft forever

还尝试使用此脚本测试 MTU：

size=1272
while ping -s $size -c1 -M do primary.internaladdress.com >&/dev/null; do 
  ((size+=4))
done
echo "Max MTU size: $((size-4+28))

而且它还打印了1420。

请注意，问题只存在于三台计算机中的两台之间。例如，A 和 B 之间是坏的，但 BC 之间是好的。

必须注意的是，有问题的计算机距离很远（在不同的大陆）。但这不应该导致这种情况。

据我了解，wireguard 将 IP 数据包封装成加密的 UDP 数据包，TCP 协议负责重新发送丢失的数据包。

非常奇怪的是，公有地址之间的 IP 数据包的丢弃率为 0%，而 wireguard/UDP 数据包的丢弃率却超过 20%。UDP 数据包是否可能被某些路由器或交换机丢弃？也许 QoS 正在发生？

这些服务器是租用的，彼此相距很远。显然，我无法采取任何措施来消除数据包丢失。我知道 UDP 总是不可靠的。但我想知道我是否可以以某种方式修复 TCP 连接。即使它们有时会变慢（即使它们一两秒钟无法通信），它们也不应该重置连接。我有什么选择？

1 个回答

Voted

nagylzs · Answer 1 · 2024-07-27T00:49:16+08:00

Best Answer

nagylzs

2024-07-27T00:49:16+08:002024-07-27T00:49:16+08:00

好吧，这是我发现的。这三台服务器都有 IPv6 和 IPv4 地址，这些地址被分配给它们的 FQDN。出于某种原因，服务器 BC 在建立连接时使用了 IPv4 地址。但 AB 使用了 IPv6 地址。似乎在 IPv6 UDP（wireguard）数据包上封装 IPv4 数据包是有问题的。我无法找出确切的原因，可能是我的 VPS 服务器之外的原因。但事实是，在 IPv6 上，数据包丢失率为 20-70%，而且是最糟糕的情况（例如，一分钟内没有丢失，然后几秒钟内 100% 丢失）。此外，响应时间非常荒谬，在同一个数据中心内为 20-90 毫秒。

然后我从公共接口中删除了所有 IPv6 地址，将所有 wireguard 流量强制改为 IPv4。突然间，响应时间平均下降到约 2 毫秒，数据包丢失率为 0%。

我无法确定确切原因，但几乎可以肯定问题不是由于网络超额订阅。这不是真正的“解决方案”，但它对我有用，对其他人可能也有效。

0

通过 wireguard 出现间歇性 TCP 连接断开和超时

新安装后 postgres 的默认超级用户用户名/密码是什么？

SFTP 使用什么端口？

命令行列出 Windows Active Directory 组中的用户？

什么是 Pem 文件，它与其他 OpenSSL 生成的密钥文件格式有何不同？

如何确定bash变量是否为空？

通过 wireguard 出现间歇性 TCP 连接断开和超时

1 个回答

相关问题