我有一个 terraform 设置,在其中创建一个新的启动模板和一个节点组。如果没有启动模板,一切都会正常工作。使用启动模板,节点已准备就绪,但节点组永远不会完成创建。
main.tf
...
resource "aws_launch_template" "this" {
block_device_mappings {
device_name = "/dev/xvda"
ebs {
volume_type = var.block_device_mappings.type
volume_size = var.block_device_mappings.size
iops = var.block_device_mappings.iops
kms_key_id = var.block_device_mappings.kms_key_id
encrypted = var.block_device_mappings.encrypted
delete_on_termination = var.block_device_mappings.delete_on_termination
}
}
user_data = base64encode(templatefile("${path.module}/user_data.tpl", {
cluster_endpoint = var.cluster_endpoint
certificate_authority_data = var.certificate_authority_data
bootstrap_extra_args = "--use-max-pods false"
cluster_name = var.cluster_name
}))
}
resource "aws_eks_node_group" "this" {
cluster_name = var.cluster_name
node_group_name = var.node_group_name
node_role_arn = var.node_group_arn
instance_types = [var.instance_type]
subnet_ids = [
for subnet in var.subnets : subnet.id
]
capacity_type = var.capacity_type
scaling_config {
desired_size = var.desired_capacity
max_size = var.max_capacity
min_size = var.min_capacity
}
update_config {
max_unavailable = 1
}
labels = var.node_group_labels
dynamic "taint" {
for_each = toset(var.node_group_taints)
content {
key = taint.value.key
value = taint.value.value
effect = taint.value.effect
}
}
launch_template {
id = aws_launch_template.this.id
version = aws_launch_template.this.latest_version
}
}
...
user_data.tpl
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="/:/+++"
--/:/+++
Content-Type: text/x-shellscript; charset="us-ascii"
#!/bin/bash
/etc/eks/bootstrap.sh --apiserver-endpoint '${cluster_endpoint}' --b64-cluster-ca '${certificate_authority_data}' ${bootstrap_extra_args} '${cluster_name}'
--/:/+++--
kubectl get pods
NAME STATUS ROLES AGE VERSION
ip-192-168-1-128.us-west-1.compute.internal Ready <none> 13m v1.29.0-eks-5e0fdde
ip-192-168-1-140.us-west-1.compute.internal Ready <none> 13m v1.29.0-eks-5e0fdde
ip-192-168-1-157.us-west-1.compute.internal Ready <none> 13m v1.29.0-eks-5e0fdde
kubectl describe node ip-192-168-1-128.us-west-1.compute.internal
Name: ip-192-168-1-128.us-west-1.compute.internal
Roles: <none>
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/instance-type=m5.4xlarge
beta.kubernetes.io/os=linux
failure-domain.beta.kubernetes.io/region=us-west-1
failure-domain.beta.kubernetes.io/zone=us-west-1a
k8s.io/cloud-provider-aws=cff041cdc91d38d182baa77beef8bf9f
kubernetes.io/arch=amd64
kubernetes.io/hostname=ip-192-168-1-128.us-west-1.compute.internal
kubernetes.io/os=linux
node.kubernetes.io/instance-type=m5.2xlarge
topology.kubernetes.io/region=us-west-1
topology.kubernetes.io/zone=us-west-1a
Annotations: alpha.kubernetes.io/provided-node-ip: 192.168.1.128
csi.volume.kubernetes.io/nodeid: {"csi.tigera.io":"ip-192-168-1-128.us-gov-west-1.compute.internal"}
node.alpha.kubernetes.io/ttl: 0
projectcalico.org/IPv4Address: 192.168.1.128/24
projectcalico.org/IPv4VXLANTunnelAddr: 10.42.7.192
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Thu, 14 Mar 2024 10:40:54 -0400
Taints: <none>
Unschedulable: false
Lease:
HolderIdentity: ip-192-168-1-128.us-west-1.compute.internal
AcquireTime: <unset>
RenewTime: Thu, 14 Mar 2024 10:54:21 -0400
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
NetworkUnavailable False Thu, 14 Mar 2024 10:41:24 -0400 Thu, 14 Mar 2024 10:41:24 -0400 CalicoIsUp Calico is running on this node
MemoryPressure False Thu, 14 Mar 2024 10:52:09 -0400 Thu, 14 Mar 2024 10:40:54 -0400 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Thu, 14 Mar 2024 10:52:09 -0400 Thu, 14 Mar 2024 10:40:54 -0400 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Thu, 14 Mar 2024 10:52:09 -0400 Thu, 14 Mar 2024 10:40:54 -0400 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Thu, 14 Mar 2024 10:52:09 -0400 Thu, 14 Mar 2024 10:41:18 -0400 KubeletReady kubelet is posting ready status
Addresses:
InternalIP: 192.168.1.128
InternalDNS: ip-192-168-1-128.us-west-1.compute.internal
Hostname: ip-192-168-1-128.us-west-1.compute.internal
Capacity:
cpu: 16
ephemeral-storage: 20959212Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 64333324Ki
pods: 110
Allocatable:
cpu: 15890m
ephemeral-storage: 18242267924
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 61334028Ki
pods: 110
System Info:
Machine ID: ec2821bfac66895c1abc29a47021fe76
System UUID: ec2821bf-ac66-895c-1abc-29a47021fe76
Boot ID: 356d15db-1436-4c45-af1e-6a668eddd8e0
Kernel Version: 5.10.210-201.852.amzn2.x86_64
OS Image: Amazon Linux 2
Operating System: linux
Architecture: amd64
Container Runtime Version: containerd://1.7.11
Kubelet Version: v1.29.0-eks-5e0fdde
Kube-Proxy Version: v1.29.0-eks-5e0fdde
ProviderID: aws:///us-west-1a/i-0874068c9ab354407
Non-terminated Pods: (6 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age
--------- ---- ------------ ---------- --------------- ------------- ---
calico-apiserver calico-apiserver-5f98fdb745-cf4xg 0 (0%) 0 (0%) 0 (0%) 0 (0%) 12m
calico-system calico-node-6c98k 0 (0%) 0 (0%) 0 (0%) 0 (0%) 13m
calico-system calico-typha-695fb789b5-sfq4n 0 (0%) 0 (0%) 0 (0%) 0 (0%) 13m
calico-system csi-node-driver-qtczs 0 (0%) 0 (0%) 0 (0%) 0 (0%) 13m
ionic-system tigera-operator-967f9fc76-tghqf 0 (0%) 0 (0%) 0 (0%) 0 (0%) 15m
kube-system kube-proxy-cnlnc 100m (0%) 0 (0%) 0 (0%) 0 (0%) 13m
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 100m (0%) 0 (0%)
memory 0 (0%) 0 (0%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Starting 13m kube-proxy
Normal Synced 13m cloud-node-controller Node synced successfully
Normal Starting 13m kubelet Starting kubelet.
Warning InvalidDiskCapacity 13m kubelet invalid capacity 0 on image filesystem
Normal NodeHasSufficientMemory 13m (x2 over 13m) kubelet Node ip-192-168-1-128.us-west-1.compute.internal status is now: NodeHasSufficientMemory
Normal NodeHasNoDiskPressure 13m (x2 over 13m) kubelet Node ip-192-168-1-128.us-west-1.compute.internal status is now: NodeHasNoDiskPressure
Normal NodeHasSufficientPID 13m (x2 over 13m) kubelet Node ip-192-168-1-128.us-west-1.compute.internal status is now: NodeHasSufficientPID
Normal NodeAllocatableEnforced 13m kubelet Updated Node Allocatable limit across pods
Normal RegisteredNode 13m node-controller Node ip-192-168-1-128.us-west-1.compute.internal event: Registered Node ip-192-168-1-128.us-west-1.compute.internal in Controller
Normal NodeReady 13m kubelet Node ip-192-168-1-128.us-west-1.compute.internal status is now: NodeReady
更令人困惑的是,如果我--apiserver-endpoint '${cluster_endpoint}' --b64-cluster-ca '${certificate_authority_data}'
从.tpl
文件中删除,一切都会正常工作,除了最大 pod 计数错误(由于实例类型,它会下降到 58)。
笔记:
- 我们使用 Calico 而不是 AWS 节点 CNI。这是该项目的要求,所以我坚持这一点。
- 迄今为止唯一突出的奇怪之处是,当我在没有上述参数的情况下运行此命令时,污点会填充,而当我在没有上述参数的情况下运行时,污点不会填充,但这可能是一个转移注意力的事情。
任何建议表示赞赏。
经过几天的挖掘,但答案是 AWS 所需的启动模板没有应用标签:
eks.amazonaws.com/nodegroup
。下面是我们正在使用的提供这些标签的新模板。