我有以下外部谷歌云负载均衡器配置:
- GlobalNetworkEndpointGroupToClusterByIp是 Internet NEG,类型
INTERNET_IP_PORT
指向 Kubernetes 集群的 IP。 - GlobalNetworkEndpointGroupToManagedS3是 Internet NEG,类型
INTERNET_FQDN_PORT
指向由 Yandex S3 服务管理。
由于某种原因,某些后端服务无法工作,当我尝试连接它们时,它们会响应 HTML 页面显示502 Server Error:
错误:服务器错误
服务器遇到临时错误,无法完成您的请求。
请在 30 秒后重试。
在失败的后端服务日志中总是有以下错误:
jsonPayload: {
cacheId: "GRU-c0ee45d8"
@type: "type.googleapis.com/google.cloud.loadbalancing.type.LoadBalancerLogEntry"
statusDetails: "failed_to_pick_backend"
}
对后端服务的请求在 1 毫秒内失败(如日志中所述),因此它们似乎甚至没有尝试连接到我的 Kubernetes 集群的 IP 或托管 S3 并立即失败。
在发布此问题时,S3 和 Imgproxy 后端服务状况良好,但其他服务无法正常工作:
如果我重新部署所有内容,其他一些服务可能会失败,例如:
- API 和 Docs 会起作用,其他会失败
- API、Docs、FPS 和 Imgproxy 将工作,S3 将失败
- S3 会工作,其他人会失败
所以这绝对是随机的,我不明白为什么会发生。如果我足够幸运,重新部署后所有后端服务都会运行良好。也有可能它们都不起作用。
Kubernetes 集群可以正常工作,它接受连接,托管 S3 也可以正常工作。它看起来像一个错误,但我在 Google 中找不到任何关于此的内容。
这是我的 Terraform 配置的外观:
resource "google_compute_global_network_endpoint_group" "kubernetes-cluster" {
name = "kubernetes-cluster-${var.ENVIRONMENT_NAME}"
network_endpoint_type = "INTERNET_IP_PORT"
depends_on = [
module.kubernetes-resources
]
}
resource "google_compute_global_network_endpoint" "kubernetes-cluster" {
global_network_endpoint_group = google_compute_global_network_endpoint_group.kubernetes-cluster.name
port = 80
ip_address = yandex_vpc_address.kubernetes.external_ipv4_address.0.address
}
resource "google_compute_global_network_endpoint_group" "s3" {
name = "s3-${var.ENVIRONMENT_NAME}"
network_endpoint_type = "INTERNET_FQDN_PORT"
}
resource "google_compute_global_network_endpoint" "s3" {
global_network_endpoint_group = google_compute_global_network_endpoint_group.s3.name
port = 443
fqdn = trimprefix(local.s3.endpoint, "https://")
}
resource "google_compute_backend_service" "s3" {
name = "s3-${var.ENVIRONMENT_NAME}"
backend {
group = google_compute_global_network_endpoint_group.s3.self_link
}
custom_request_headers = [
"Host:${google_compute_global_network_endpoint.s3.fqdn}"
]
cdn_policy {
cache_key_policy {
include_host = true
include_protocol = false
include_query_string = false
}
}
enable_cdn = true
load_balancing_scheme = "EXTERNAL"
log_config {
enable = true
sample_rate = 1.0
}
port_name = "https"
protocol = "HTTPS"
timeout_sec = 60
}
resource "google_compute_backend_service" "imgproxy" {
name = "imgproxy-${var.ENVIRONMENT_NAME}"
backend {
group = google_compute_global_network_endpoint_group.kubernetes-cluster.self_link
}
cdn_policy {
cache_key_policy {
include_host = true
include_protocol = false
include_query_string = false
}
}
enable_cdn = true
load_balancing_scheme = "EXTERNAL"
log_config {
enable = true
sample_rate = 1.0
}
port_name = "http"
protocol = "HTTP"
timeout_sec = 60
}
resource "google_compute_backend_service" "api" {
name = "api-${var.ENVIRONMENT_NAME}"
custom_request_headers = [
"Access-Control-Allow-Origin:${var.ALLOWED_CORS_ORIGIN}"
]
backend {
group = google_compute_global_network_endpoint_group.kubernetes-cluster.self_link
}
load_balancing_scheme = "EXTERNAL"
log_config {
enable = true
sample_rate = 1.0
}
port_name = "http"
protocol = "HTTP"
timeout_sec = 60
}
resource "google_compute_backend_service" "front" {
name = "front-${var.ENVIRONMENT_NAME}"
backend {
group = google_compute_global_network_endpoint_group.kubernetes-cluster.self_link
}
cdn_policy {
cache_key_policy {
include_host = true
include_protocol = false
include_query_string = true
}
}
enable_cdn = true
load_balancing_scheme = "EXTERNAL"
log_config {
enable = true
sample_rate = 1.0
}
port_name = "http"
protocol = "HTTP"
timeout_sec = 60
}
resource "google_compute_url_map" "default" {
name = "default-${var.ENVIRONMENT_NAME}"
default_service = google_compute_backend_service.front.self_link
host_rule {
hosts = [
local.hosts.api,
local.hosts.fps
]
path_matcher = "api"
}
host_rule {
hosts = [
local.hosts.s3
]
path_matcher = "s3"
}
host_rule {
hosts = [
local.hosts.imgproxy
]
path_matcher = "imgproxy"
}
path_matcher {
default_service = google_compute_backend_service.api.self_link
name = "api"
}
path_matcher {
default_service = google_compute_backend_service.s3.self_link
name = "s3"
}
path_matcher {
default_service = google_compute_backend_service.imgproxy.self_link
name = "imgproxy"
}
test {
host = local.hosts.docs
path = "/"
service = google_compute_backend_service.front.self_link
}
test {
host = local.hosts.api
path = "/"
service = google_compute_backend_service.api.self_link
}
test {
host = local.hosts.fps
path = "/"
service = google_compute_backend_service.api.self_link
}
test {
host = local.hosts.s3
path = "/"
service = google_compute_backend_service.s3.self_link
}
test {
host = local.hosts.imgproxy
path = "/"
service = google_compute_backend_service.imgproxy.self_link
}
}
# See: https://github.com/hashicorp/terraform-provider-google/issues/5356
resource "random_id" "managed-certificate-name" {
byte_length = 4
prefix = "default-${var.ENVIRONMENT_NAME}-"
keepers = {
domains = join(",", values(local.hosts))
}
}
resource "google_compute_managed_ssl_certificate" "default" {
name = random_id.managed-certificate-name.hex
lifecycle {
create_before_destroy = true
}
managed {
domains = values(local.hosts)
}
}
resource "google_compute_ssl_policy" "default" {
name = "default-${var.ENVIRONMENT_NAME}"
profile = "MODERN"
}
resource "google_compute_target_https_proxy" "default" {
name = "default-${var.ENVIRONMENT_NAME}"
url_map = google_compute_url_map.default.self_link
ssl_policy = google_compute_ssl_policy.default.self_link
ssl_certificates = [
google_compute_managed_ssl_certificate.default.self_link
]
}
resource "google_compute_global_forwarding_rule" "default" {
name = "default-${var.ENVIRONMENT_NAME}"
load_balancing_scheme = "EXTERNAL"
port_range = "443-443"
target = google_compute_target_https_proxy.default.self_link
}
UPD。我发现重新创建 NEG 将解决这个问题:
- 等到 Terraform 完成部署。
- 通过具有相同配置的 Google Cloud Platform Console NEG 创建。
- 编辑后端服务以使用新创建的 NEG。
- 有用!
但这绝对是 hack,似乎没有办法用 Terraform 自动化它。我会继续调查这个问题。
很高兴听到您的问题已得到解决,并且我了解您已通过 GCP 控制台手动创建 NEG 并随后编辑后端服务而不是使用 Terraform 来实现它。这个问题最可能的原因似乎是竞速条件,即在 Terraform 中,我们通常以链的形式定义资源,因此定义的每个资源都依赖于另一个资源。通常在通过 Terraform 定义资源时,后端服务创建和 NE 附件依赖于 NEG 创建。
因此,在 Terraform 中创建后端服务时,我们必须将其定义为依赖(元参数)[1] NE 附加(即后端服务应仅在 NE 附加后运行)。
[1] https://www.terraform.io/docs/language/meta-arguments/depends_on.html
希望这能澄清你的疑问。