Na minha empresa, construí um cluster K8s com Terraform e configurei um sistema de log com EFK (Elasticsearch, Fluent-bit, Kibana).
k8s e Elasticsearch usam EKS e Opensearch Servcie da AWS (ES 7.10), e o Fluent-bit foi instalado separadamente como um Daemonset por meio do Helm.
https://artifacthub.io/packages/helm/fluent/fluent-bit
No caso do Fluent-bit Config, ele está estruturado da seguinte forma.
config:
inputs: |
[INPUT]
Name tail
Exclude_Path /var/log/containers/cloudwatch-agent*, /var/log/containers/fluent-bit*, /var/log/containers/aws-node*, /var/log/containers/kube-proxy*, / var/log/containers/aws-load-balancer*, /var/log/containers/cert-manager*, /var/log/containers/coredns*, /var/log/containers/ebs-csi*, /var/ log/containers/grafana*, /var/log/containers/karpenter*, /var/log/containers/metrics-server*, /var/log/containers/prometheus*
Path /var/log/containers/*.log
Tag kube.*
Mem_Buf_Limit 50MB
multiline.parser python-custom
outputs: |
[OUTPUT]
Name es
Match kube.*
Host ${es_domain_name}
Port 443
TLS On
AWS_Auth On
AWS_Region ${region}
Retry_Limit 6
Replace_Dots On
Trace_Error On
filters: |
[FILTER]
Name kubernetes
Match kube.*
Merge_Log On
Merge_Log_Key log_processed
Keep_LogOff
K8S-Logging.Parser On
K8S-Logging.Exclude On
customParsers: |
[MULTILINE_PARSER]
name python-custom
type regex
flush_timeout 1000
rule "start_state" "/stderr F(.*)/" "cont"
rule "cont" "/stderr F(.*)/" "cont"
No geral funcionou bem, mas um dia, após uma determinada situação, percebi que todos os logs coletados pelo Fluent-bit apontavam para o mesmo timestamp.
Anteriormente, a saída do log do pod era registrada com um carimbo de data/hora preciso, mas depois que o pod é recarregado, o log continua a ser registrado com um @timestamp fixo e, como os logs são acumulados no fuso horário anterior, o seguinte fenômeno aparece quando visto em kibana.
Os logs são todos transmitidos ao ES em tempo real, mas como todos são coletados em um horário fixo, as consultas adequadas não podem ser executadas.
insira a descrição da imagem aqui
// Unlike the timestamp output in the actual application, the timestamp recorded in Fluent-bit remains the same.
Oct 16, 2023 @ 16:13:36.142 2023-10-16T09:48:55...[2023-10-16 09:48:55]...
Oct 16, 2023 @ 16:13:36.142 2023-10-16T09:48:55...[2023-10-16 09:48:55]...
Oct 16, 2023 @ 16:13:36.142 2023-10-16T09:53:55...[2023-10-16 09:53:55]...
Oct 16, 2023 @ 16:13:36.142 2023-10-16T09:58:03...[2023-10-16 09:58:03]...
Oct 16, 2023 @ 16:13:36.142 2023-10-16T09:58:03...[2023-10-16 09:58:03]...
Oct 16, 2023 @ 16:13:36.142 2023-10-16T09:58:03...[2023-10-16 09:58:03]...
Oct 16, 2023 @ 16:13:36.142 2023-10-16T09:58:55...[2023-10-16 09:58:55]...
Oct 16, 2023 @ 16:13:36.142 2023-10-16T09:58:55...[2023-10-16 09:58:55]...
Oct 16, 2023 @ 16:13:36.142 2023-10-16T10:03:03...[2023-10-16 10:03:03]...
Oct 16, 2023 @ 16:13:36.142 2023-10-16T10:03:55...[2023-10-16 10:03:55]...
Esse problema ocorre ao atualizar e recarregar pods em execução para cada serviço. Confirmei que o log inotify_fs_add foi enviado para Fluent-bit para o pod recém-criado e nenhum outro log de erro pôde ser confirmado.
Recarregamento de pod
│ backend-api-deployment-5d9fc55496-8mppx ● 1/1 0 Running
│ backend-api-deployment-57d68dd4f5-djn8x ● 1/1 0 Terminating
Log de bits fluentes
│ Fluent Bit v2.1.8
│ * Copyright (C) 2015-2022 The Fluent Bit Authors
│ * Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
│ * https://fluentbit.io
│
│ [2023/10/16 16:05:00] [ info] [fluent bit] version=2.1.8, commit=1d83649441, pid=1
│ [2023/10/16 16:05:00] [ info] [storage] ver=1.4.0, type=memory, sync=normal, checksum=off, max_chunks_up=128
│ [2023/10/16 16:05:00] [ info] [cmetrics] version=0.6.3
│ [2023/10/16 16:05:00] [ info] [ctraces ] version=0.3.1
│ [2023/10/16 16:05:00] [ info] [input:tail:tail.0] initializing
│ [2023/10/16 16:05:00] [ info] [input:tail:tail.0] storage_strategy='memory' (memory only)
│ [2023/10/16 16:05:00] [ info] [input:tail:tail.0] multiline core started
│ [2023/10/16 16:05:00] [ info] [filter:kubernetes:kubernetes.0] https=1 host=kubernetes.default.svc port=443
│ [2023/10/16 16:05:00] [ info] [filter:kubernetes:kubernetes.0] token updated
│ [2023/10/16 16:05:00] [ info] [filter:kubernetes:kubernetes.0] local POD info OK
│ [2023/10/16 16:05:00] [ info] [filter:kubernetes:kubernetes.0] testing connectivity with API server...
│ [2023/10/16 16:05:00] [ info] [filter:kubernetes:kubernetes.0] connectivity OK
│ [2023/10/16 16:05:00] [ info] [output:es:es.0] worker #0 started
│ [2023/10/16 16:05:00] [ info] [output:es:es.0] worker #1 started
│ [2023/10/16 16:05:00] [ info] [http_server] listen iface=0.0.0.0 tcp_port=2020
│ [2023/10/16 16:05:00] [ info] [sp] stream processor started
│ [2023/10/16 16:05:00] [ info] [input:tail:tail.0] inotify_fs_add(): inode=41916664 watch_fd=1 name=/var/log/containers/backend-api-deployment-57d68dd4f5-djn8x_backend-dev_backend-api-f950104ab9f300586803316ad7beef5c9506db9fceeb4af0dc0575b3b9bacfd7.log
│ [2023/10/16 16:05:00] [ info] [input:tail:tail.0] inotify_fs_add(): inode=17488499 watch_fd=2 name=/var/log/containers/backend-api-deployment-6dc5b97645-tqwxh_backend-staging_backend-api-23f14f74af6cc163f233499ba92a2746cb2f263c9379ba53fc47cd3d76e00bbb.log
...
│ [2023/10/17 04:49:01] [error] [output:es:es.0] HTTP status=429 URI=/_bulk, response:
│ 429 Too Many Requests /_bulk
│
│ [2023/10/17 04:49:01] [ warn] [engine] failed to flush chunk '1-1697518140.870529031.flb', retry in 6 seconds: task_id=1, input=tail.0 > output=es.0 (out_id=0)
│ [2023/10/17 04:49:01] [error] [output:es:es.0] HTTP status=429 URI=/_bulk, response:
│ 429 Too Many Requests /_bulk
│
│ [2023/10/17 04:49:01] [ warn] [engine] failed to flush chunk '1-1697518140.276173730.flb', retry in 7 seconds: task_id=0, input=tail.0 > output=es.0 (out_id=0)
│ [2023/10/17 04:49:07] [ info] [engine] flush chunk '1-1697518140.870529031.flb' succeeded at retry 1: task_id=1, input=tail.0 > output=es.0 (out_id=0)
│ [2023/10/17 04:49:08] [ info] [engine] flush chunk '1-1697518140.276173730.flb' succeeded at retry 1: task_id=0, input=tail.0 > output=es.0 (out_id=0)
│ [2023/10/17 05:08:50] [ info] [input:tail:tail.0] inotify_fs_add(): inode=37547666 watch_fd=8 name=/var/log/containers/backend-api-deployment-5d9fc55496-8mppx_backend-dev_backend-api-839c128d6227f797722496a992ca9149b95c366997c2b1ec28ceb65490582e3b.log
│ [2023/10/17 05:08:52] [ info] [filter:kubernetes:kubernetes.0] token updated
│ [2023/10/17 05:09:25] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=41916664 watch_fd=1
│ [2023/10/17 05:09:37] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=17488499 watch_fd=2
O que é interessante é que se todos os Daemon Sets de bits fluentes forem eliminados manualmente e revividos por meio de autocura, todos eles serão temporariamente coletados normalmente novamente.
Depois disso, se você atualizar e recarregar o pod de serviço novamente, o mesmo problema ocorrerá novamente.
O que causa esse problema? Como posso consertar isso? E se houver alguma configuração recomendada do Fluent-bit, conselhos adicionais serão bem-vindos.
obrigado!