我们的一个应用程序使用 Elasticsearch (1.4.4) 作为内存缓存。该应用程序是使用 Oracle 1.7 部署在 Tomcat 7 上的 Java webapp。elasticsearch 实例是部署在同一台服务器上的单节点设置。
从 elasticsearch 1.3.3 开始,我们在应用程序和具有空闲应用程序的 Elasticsearch 节点之间的环回接口上的进出速度约为 40 MBit/s。
这不算多,但会导致其他扁平化系统的显着负载。我手头没有带有此应用程序的生产系统,因此我无法真正说出它在产品中的发展方式。
通过 tcpdump 获取流量并在 Wireshark 中对其进行分析表明,应用程序中的 Elasticsearch-Client 不断地询问节点,cluster/node/info
每次都会产生 10k 的答案。
也许完全不相关,但启用服务器和客户端日志记录给了我们:
Elasticsearch 服务器日志:
[2015-05-12 14:45:01,600][INFO ][node ] [Illyana Rasputin] initializing ...
[2015-05-12 14:45:01,608][INFO ][plugins ] [Illyana Rasputin] loaded [], sites []
[2015-05-12 14:45:06,666][INFO ][node ] [Illyana Rasputin] initialized
[2015-05-12 14:45:06,667][INFO ][node ] [Illyana Rasputin] starting ...
[2015-05-12 14:45:06,828][INFO ][transport ] [Illyana Rasputin] bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/10.24.1.128:9300]}
[2015-05-12 14:45:06,851][INFO ][discovery ] [Illyana Rasputin] bkbo_index/TITPDFdtR6SXX5EeOXaidg
[2015-05-12 14:45:09,892][INFO ][cluster.service ] [Illyana Rasputin] new_master [Illyana Rasputin][TITPDFdtR6SXX5EeOXaidg][dev06][inet[/10.24.1.128:9300]], reason: zen-disco-join (elected_as_master)
[2015-05-12 14:45:09,943][INFO ][http ] [Illyana Rasputin] bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/10.24.1.128:9200]}
[2015-05-12 14:45:09,944][INFO ][node ] [Illyana Rasputin] started
[2015-05-12 14:45:11,283][INFO ][gateway ] [Illyana Rasputin] recovered [2] indices into cluster_state
弹性搜索客户端:
2015-05-12 14:46:40,683 INFO [localhost-startStop-1] PluginsService:<init>:151 [Antiphon the Overseer] loaded [], sites []
2015-05-12 14:46:41,548 DEBUG [localhost-startStop-1] TransportClientNodesService:<init>:110 [Antiphon the Overseer] node_sampler_interval[5ms]
2015-05-12 14:46:41,594 DEBUG [localhost-startStop-1] TransportClientNodesService:addTransportAddresses:167 [Antiphon the Overseer] adding address [[#transport#-1][dev06][inet[localhost/127.0.0.1:9300]]]
2015-05-12 14:46:41,625 DEBUG [localhost-startStop-1] NettyTransport:connectToNode:751 [Antiphon the Overseer] connected to node [[#transport#-1][dev06][inet[localhost/127.0.0.1:9300]]]
2015-05-12 14:46:41,655 INFO [localhost-startStop-1] TransportClientNodesService$SimpleNodeSampler:doSample:371 [Antiphon the Overseer] failed to get node info for [#transport#-1][dev06][inet[localhost/127.0.0.1:9300]], disconnecting...
org.elasticsearch.transport.ReceiveTimeoutTransportException: [][inet[localhost/127.0.0.1:9300]][cluster:monitor/nodes/info] request_id [0] timed out after [6ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:366)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
2015-05-12 14:46:41,658 DEBUG [localhost-startStop-1] NettyTransport:disconnectFromNode:882 [Antiphon the Overseer] disconnecting from [[#transport#-1][dev06][inet[localhost/127.0.0.1:9300]]] due to explicit disconnect call
2015-05-12 14:46:41,661 DEBUG [elasticsearch[Antiphon the Overseer][generic][T#1]] NettyTransport:connectToNode:751 [Antiphon the Overseer] connected to node [[#transport#-1][dev06][inet[localhost/127.0.0.1:9300]]]
2015-05-12 14:46:41,669 INFO [elasticsearch[Antiphon the Overseer][generic][T#1]] TransportClientNodesService$SimpleNodeSampler:doSample:371 [Antiphon the Overseer] failed to get node info for [#transport#-1][dev06][inet[localhost/127.0.0.1:9300]], disconnecting...
org.elasticsearch.transport.ReceiveTimeoutTransportException: [][inet[localhost/127.0.0.1:9300]][cluster:monitor/nodes/info] request_id [1] timed out after [5ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:366)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
2015-05-12 14:46:41,670 DEBUG [elasticsearch[Antiphon the Overseer][generic][T#1]] NettyTransport:disconnectFromNode:882 [Antiphon the Overseer] disconnecting from [[#transport#-1][dev06][inet[localhost/127.0.0.1:9300]]] due to explicit disconnect call
2015-05-12 14:46:41,676 DEBUG [elasticsearch[Antiphon the Overseer][generic][T#1]] NettyTransport:connectToNode:751 [Antiphon the Overseer] connected to node [[#transport#-1][dev06][inet[localhost/127.0.0.1:9300]]]
2015-05-12 14:46:41,677 WARN [elasticsearch[Antiphon the Overseer][transport_client_worker][T#2]{New I/O worker #2}] TransportService$Adapter:remove:280 [Antiphon the Overseer] Received response for a request that has timed out, sent [14ms] ago, timed out [9ms] ago, action [cluster:monitor/nodes/info], node [[#transport#-1][dev06][inet[localhost/127.0.0.1:9300]]], id [1]
2015-05-12 14:46:41,682 INFO [localhost-startStop-1] PluginsService:<init>:151 [Ricochet] loaded [], sites []
2015-05-12 14:46:41,722 DEBUG [localhost-startStop-1] TransportClientNodesService:<init>:110 [Ricochet] node_sampler_interval[5ms]
2015-05-12 14:46:41,733 DEBUG [localhost-startStop-1] TransportClientNodesService:addTransportAddresses:167 [Ricochet] adding address [[#transport#-1][dev06][inet[localhost/127.0.0.1:9300]]]
2015-05-12 14:46:41,734 DEBUG [localhost-startStop-1] NettyTransport:connectToNode:751 [Ricochet] connected to node [[#transport#-1][dev06][inet[localhost/127.0.0.1:9300]]]
2015-05-12 14:46:41,759 DEBUG [elasticsearch[Antiphon the Overseer][generic][T#1]] NettyTransport:connectToNode:751 [Antiphon the Overseer] connected to node [[Illyana Rasputin][TITPDFdtR6SXX5EeOXaidg][dev06][inet[localhost/127.0.0.1:9300]]]
2015-05-12 14:46:41,760 DEBUG [localhost-startStop-1] NettyTransport:connectToNode:751 [Ricochet] connected to node [[Illyana Rasputin][TITPDFdtR6SXX5EeOXaidg][dev06][inet[localhost/127.0.0.1:9300]]]
是的,这个应用程序有两个客户端连接,应该没问题(根据开发人员的说法)。这些断开/重新连接循环大约每分钟发生一次。
有什么线索吗?我已经通过discovery.zen.ping.multicast.enabled: false
.
您的客户端,似乎已加入集群(这很好,但如果您使用 Kibana 4,您可能会收到 Kibana 的投诉(不确定这些投诉是否超出了 4 beta)
从您的客户日志中:
5ms 作为集群中的采样节点似乎相当激进。我还没有看到默认情况下这是什么,但我猜测在预期秒数时已经配置了毫秒?
此时,您需要考虑客户端 API 的设置,尽管客户端可能会从集群中选择此设置(因为它正在成为集群的一部分)
大概您使用的是 elastic.co 提供的 Java API?
您是否可能在
client.transport.nodes_sampler_interval
任何地方进行了配置?您是否根据 Java 客户端 API 的文档使用兼容的客户端/服务器版本
如果度量单位在版本之间发生变化,我不会感到惊讶,尽管文档确实说默认值是
5s
检查您的 elasticsearch.yaml 和您的代码中的
node_sampler_interval
. 你可能需要5
用5s
也许替换一个赤裸裸的?