我正在使用 eksctl 在 EKS/AWS 上设置集群。
按照 EKS 文档中的指南,我对几乎所有内容都使用默认值。
集群创建成功,我从集群更新了 Kubernetes 配置,我可以成功运行各种 kubectl 命令——例如“kubectl get nodes”显示节点处于“就绪”状态。
我没有碰任何其他东西,我有一个开箱即用的干净集群,没有进行任何其他更改,到目前为止,一切似乎都按预期工作。我不向它部署任何应用程序,我只是不理会它。
问题是在一段相对较短的时间后(大约在集群创建后 30 分钟),节点从“Ready”变为“NotReady”并且它永远不会恢复。
事件日志显示了这一点(我编辑了 IP):
LAST SEEN TYPE REASON OBJECT MESSAGE
22m Normal Starting node/ip-[x] Starting kubelet.
22m Normal NodeHasSufficientMemory node/ip-[x] Node ip-[x] status is now: NodeHasSufficientMemory
22m Normal NodeHasNoDiskPressure node/ip-[x] Node ip-[x] status is now: NodeHasNoDiskPressure
22m Normal NodeHasSufficientPID node/ip-[x] Node ip-[x] status is now: NodeHasSufficientPID
22m Normal NodeAllocatableEnforced node/ip-[x] Updated Node Allocatable limit across pods
22m Normal RegisteredNode node/ip-[x] Node ip-[x] event: Registered Node ip-[x] in Controller
22m Normal Starting node/ip-[x] Starting kube-proxy.
21m Normal NodeReady node/ip-[x] Node ip-[x] status is now: NodeReady
7m34s Normal NodeNotReady node/ip-[x] Node ip-[x] status is now: NodeNotReady
集群中其他节点的相同事件。
连接到实例并检查 /var/log/messages 会在节点进入 NotReady 的同时显示这一点:
Mar 7 10:40:37 ip-[X] kubelet: E0307 10:40:37.259207 3896 kubelet_node_status.go:385] Error updating node status, will retry: error getting node "ip-[x]": Unauthorized
Mar 7 10:40:37 ip-[X] kubelet: E0307 10:40:37.385044 3896 kubelet_node_status.go:385] Error updating node status, will retry: error getting node "ip-[x]": Unauthorized
Mar 7 10:40:37 ip-[X] kubelet: E0307 10:40:37.621271 3896 reflector.go:270] object-"kube-system"/"aws-node-token-bdxwv": Failed to watch *v1.Secret: the server has asked for the client to provide credentials (get secrets)
Mar 7 10:40:37 ip-[X] kubelet: E0307 10:40:37.621320 3896 reflector.go:270] object-"kube-system"/"coredns": Failed to watch *v1.ConfigMap: the server has asked for the client to provide credentials (get configmaps)
Mar 7 10:40:37 ip-[X] kubelet: E0307 10:40:37.638850 3896 reflector.go:270] k8s.io/client-go/informers/factory.go:133: Failed to watch *v1beta1.RuntimeClass: the server has asked for the client to provide credentials (get runtimeclasses.node.k8s.io)
Mar 7 10:40:37 ip-[X] kubelet: E0307 10:40:37.707074 3896 reflector.go:270] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to watch *v1.Pod: the server has asked for the client to provide credentials (get pods)
Mar 7 10:40:37 ip-[X] kubelet: E0307 10:40:37.711386 3896 reflector.go:270] object-"kube-system"/"coredns-token-67fzd": Failed to watch *v1.Secret: the server has asked for the client to provide credentials (get secrets)
Mar 7 10:40:37 ip-[X] kubelet: E0307 10:40:37.714899 3896 reflector.go:270] object-"kube-system"/"kube-proxy-config": Failed to watch *v1.ConfigMap: the server has asked for the client to provide credentials (get configmaps)
Mar 7 10:40:37 ip-[X] kubelet: E0307 10:40:37.720884 3896 kubelet_node_status.go:385] Error updating node status, will retry: error getting node "ip-[x]": Unauthorized
Mar 7 10:40:37 ip-[X] kubelet: E0307 10:40:37.868003 3896 kubelet_node_status.go:385] Error updating node status, will retry: error getting node "ip-[x]": Unauthorized
Mar 7 10:40:37 ip-[X] kubelet: E0307 10:40:37.868067 3896 controller.go:125] failed to ensure node lease exists, will retry in 200ms, error: Get https://[X]/apis/coordination.k8s.io/v1beta1/namespaces/kube-node-lease/leases/ip-[x]?timeout=10s: write tcp 192.168.91.167:50866->34.249.27.158:443: use of closed network connection
Mar 7 10:40:38 ip-[X] kubelet: E0307 10:40:38.017157 3896 kubelet_node_status.go:385] Error updating node status, will retry: error getting node "ip-[x]": Unauthorized
Mar 7 10:40:38 ip-[X] kubelet: E0307 10:40:38.017182 3896 kubelet_node_status.go:372] Unable to update node status: update node status exceeds retry count
Mar 7 10:40:38 ip-[X] kubelet: E0307 10:40:38.200053 3896 controller.go:125] failed to ensure node lease exists, will retry in 400ms, error: Unauthorized
Mar 7 10:40:38 ip-[X] kubelet: E0307 10:40:38.517193 3896 reflector.go:270] object-"kube-system"/"kube-proxy": Failed to watch *v1.ConfigMap: the server has asked for the client to provide credentials (get configmaps)
Mar 7 10:40:38 ip-[X] kubelet: E0307 10:40:38.729756 3896 controller.go:125] failed to ensure node lease exists, will retry in 800ms, error: Unauthorized
Mar 7 10:40:38 ip-[X] kubelet: E0307 10:40:38.752267 3896 reflector.go:126] object-"kube-system"/"aws-node-token-bdxwv": Failed to list *v1.Secret: Unauthorized
Mar 7 10:40:38 ip-[X] kubelet: E0307 10:40:38.824988 3896 reflector.go:126] object-"kube-system"/"coredns": Failed to list *v1.ConfigMap: Unauthorized
Mar 7 10:40:38 ip-[X] kubelet: E0307 10:40:38.899566 3896 reflector.go:126] k8s.io/client-go/informers/factory.go:133: Failed to list *v1beta1.RuntimeClass: Unauthorized
Mar 7 10:40:38 ip-[X] kubelet: E0307 10:40:38.963756 3896 reflector.go:126] k8s.io/client-go/informers/factory.go:133: Failed to list *v1beta1.CSIDriver: Unauthorized
Mar 7 10:40:38 ip-[X] kubelet: E0307 10:40:38.963822 3896 reflector.go:126] object-"kube-system"/"kube-proxy-config": Failed to list *v1.ConfigMap: Unauthorized
身份验证器组件的 CloudWatch 日志显示其中许多消息:
time="2020-03-07T10:40:37Z" level=warning msg="access denied" arn="arn:aws:iam::[ACCOUNT_ID]]:role/AmazonSSMRoleForInstancesQuickSetup" client="127.0.0.1:50132" error="ARN is not mapped: arn:aws:iam::[ACCOUNT_ID]:role/amazonssmroleforinstancesquicksetup" method=POST path=/authenticate
我通过 IAM 控制台确认该角色确实存在。
显然,由于这些身份验证失败,该节点正在报告 NotReady。
这是一些在大约 30 分钟后超时的身份验证令牌,如果是这样,是否不应该自动请求新的令牌?还是我应该设置其他东西?
我很惊讶由 eksctl 创建的新集群会显示此问题。
我错过了什么?