我们已经创建了 GKE 集群,并且我们从 gke-metrics-agent 收到错误。错误每 cca 30 分钟出现一次。总是相同的 62 个错误。
所有错误都有标签k8s-pod/k8s-app: "gke-metrics-agent"。
第一个错误是:
error exporterhelper/queued_retry.go:245 Exporting failed. Try enabling retry_on_failure config option. {"kind": "exporter", "name": "googlecloud", "error": "rpc error: code = DeadlineExceeded desc = Deadline expired before operation could complete."
这个错误后面跟着这些错误的顺序
- “go.opentelemetry.io/collector/exporter/exporterhelper.(*retrySender).send”
- “/go/src/gke-logmon/gke-metrics-agent/vendor/go.opentelemetry.io/collector/exporter/exporterhelper/queued_retry.go:245”
- go.opentelemetry.io/collector/exporter/exporterhelper.(*metricsSenderWithObservability).send
- /go/src/gke-logmon/gke-metrics-agent/vendor/go.opentelemetry.io/collector/exporter/exporterhelper/metrics.go:120
有这样的 cca 40 错误。两个突出的错误是:
- error exporterhelper/queued_retry.go:175 Exporting failed. Dropping data. Try enabling sending_queue to survive temporary failures. {"kind": "exporter", "name": "googlecloud", "dropped_items": 19}"
- warn batchprocessor/batch_processor.go:184 Sender failed {"kind": "processor", "name": "batch", "error": "rpc error: code = DeadlineExceeded desc = Deadline expired before operation could complete."}"
我试图在谷歌上搜索这些错误,但我找不到任何东西。我什至找不到 gke-metrics-agent 的任何文档。
我尝试过的事情:
- 检查配额
- 将 GKE 更新到更新版本(当前版本为 1.21.3-gke.2001)
- 更新节点
- 禁用所有防火墙规则
- 将所有权限授予 k8s 节点
我可以提供有关我们的 kubernetes 集群的更多信息,但我不知道哪些信息可能对解决这个问题很重要。