我们拥有支持许多独立客户环境的非集群单节点 Cassandra 3.11.2 实例。最近,我们看到许多案例,客户端脚本成功启动 Cassandra 实例,但由于“连接被拒绝”错误而无法立即与该客户端连接。以下是客户端日志的摘录,显示了启动消息和连接错误。客户端使用的是 Java 驱动程序版本 3.0.1。
2023-08-11 00:21:39,669 DEBUG [main:2cb1] STDOUT - 00:21:39.668 [main] DEBUG com.datastax.driver.core.Cluster - Starting new cluster with contact points [<host>:18512]
...
2023-08-11 00:21:39,915 DEBUG [cluster1-nio-worker-0:2cb1] STDOUT - 00:21:39.915 [cluster1-nio-worker-0] DEBUG com.datastax.driver.core.Connection - Connection[<host>, inFlight=0, closed=false] Error connecting to <host>:18512 (Connection refused: <host>:18512)
2023-08-11 00:21:39,920 DEBUG [cluster1-nio-worker-0:2cb1] STDOUT - 00:21:39.920 [cluster1-nio-worker-0] DEBUG com.datastax.driver.core.Host.STATES - Defuncting Connection[<host>:18512-1, inFlight=0, closed=false] because: [<host>] Cannot connect
2023-08-11 00:21:39,921 DEBUG [cluster1-nio-worker-0:2cb1] STDOUT - 00:21:39.921 [cluster1-nio-worker-0] DEBUG com.datastax.driver.core.Host.STATES - [<host>:18512] preventing new connections for the next 1000 ms
2023-08-11 00:21:39,921 DEBUG [cluster1-nio-worker-0:2cb1] STDOUT - 00:21:39.921 [cluster1-nio-worker-0] DEBUG com.datastax.driver.core.Host.STATES - [<host>:18512] Connection[<host>:18512-1, inFlight=0, closed=false] failed, remaining = 0
2023-08-11 00:21:39,922 DEBUG [cluster1-nio-worker-0:2cb1] STDOUT - 00:21:39.921 [cluster1-nio-worker-0] DEBUG com.datastax.driver.core.Connection - Connection[<host>:18512-1, inFlight=0, closed=true] closing connection
2023-08-11 00:21:39,931 DEBUG [main:2cb1] STDOUT - 00:21:39.931 [main] DEBUG c.d.driver.core.ControlConnection - [Control connection] error on <host>:18512 connection, no more host to try
com.datastax.driver.core.exceptions.TransportException: [<host>] Cannot connect
at com.datastax.driver.core.Connection$1.operationComplete(Connection.java:158) ~[cassandra-driver-core-3.0.1-shaded.jar:na]
at com.datastax.driver.core.Connection$1.operationComplete(Connection.java:141) ~[cassandra-driver-core-3.0.1-shaded.jar:na]
at com.datastax.shaded.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:680) ~[cassandra-driver-core-3.0.1-shaded.jar:na]
at com.datastax.shaded.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:603) ~[cassandra-driver-core-3.0.1-shaded.jar:na]
at com.datastax.shaded.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:563) ~[cassandra-driver-core-3.0.1-shaded.jar:na]
at com.datastax.shaded.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:424) ~[cassandra-driver-core-3.0.1-shaded.jar:na]
at com.datastax.shaded.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fulfillConnectPromise(AbstractNioChannel.java:276) ~[cassandra-driver-core-3.0.1-shaded.jar:na]
at com.datastax.shaded.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:292) ~[cassandra-driver-core-3.0.1-shaded.jar:na]
at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528) ~[cassandra-driver-core-3.0.1-shaded.jar:na]
at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) ~[cassandra-driver-core-3.0.1-shaded.jar:na]
at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) ~[cassandra-driver-core-3.0.1-shaded.jar:na]
at com.datastax.shaded.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) ~[cassandra-driver-core-3.0.1-shaded.jar:na]
at com.datastax.shaded.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:112) ~[cassandra-driver-core-3.0.1-shaded.jar:na]
at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_265]
Caused by: java.net.ConnectException: Connection refused: <host>:18512
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[na:1.8.0_265]
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:714) ~[na:1.8.0_265]
at com.datastax.shaded.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224) ~[cassandra-driver-core-3.0.1-shaded.jar:na]
at com.datastax.shaded.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289) ~[cassandra-driver-core-3.0.1-shaded.jar:na]
... 6 common frames omitted
在此之前很长一段时间启动和连接都工作正常,但我们发现该错误是由于用于定位Cassandra节点的多个DNS服务器的解析顺序发生变化而触发的。一个 DNS 服务器的响应速度比另一个服务器快约 0.2 毫秒,当较慢的服务器在序列中位于第一个时,就会引发错误。这似乎是一个微妙的时间问题。我们在服务器端 system.log 或 debug.log 中没有看到任何错误。
我的问题是:
DNS 解析时间如此微小的差异是否真的会导致 Cassandra 连接失败,还是有其他原因?我们不希望 Cassandra 连接依赖于 DNS 服务器排序。
在 3.0.1 驱动程序中,我看到了 RetryPolicy 和 ReconnectionPolicy 类以及自定义实现如何更改其行为。但是,在阅读驱动程序代码后,我认为这些选项中的任何一个都不会影响与 Cassandra 的初始连接,只会影响查询失败或连接丢失时的下游恢复场景。是这样吗?或者这些选项中的任何一个确实有帮助吗?
我发现较新的 4.13.0 驱动程序在这方面有更多配置选项,包括看起来相关的advance.reconnect-on-init属性。升级驱动程序并设置此属性可以解决该问题吗?
欢迎任何其他建议!