Nota: Há um TL;DR no final desta pergunta.
Eu tenho um script de backup noturno que funciona de maneira confiável há 6 anos e, há dois meses, começou a falhar no upload para o meu servidor de arquivamento (apenas cerca de 1 em cada 4 tentativas são bem-sucedidas). Isso começou a acontecer quando o tamanho do arquivo de backup ultrapassou cerca de 440 MB (uploads bem-sucedidos levam de 20 a 40 segundos para serem concluídos). Outros arquivos menores (> 200 MB) são carregados com sucesso todas as vezes. O início das falhas não coincidiu com nenhuma alteração de configuração ou atualização dos servidores em questão.
Aqui estão os detalhes do arquivo que estou enviando:
-rw-r--r-- 1 backup backup 486948951 Jun 12 13:20 snapshot-20190612.zip
Esses arquivos são carregados no meu servidor com um curl PUT
(veja o comando abaixo). O servidor de origem é uma instância do Amazon Linux na Virgínia e o servidor de destino é uma instância do Amazon Linux no Ohio (ambos estão na região leste dos EUA).
O serviço da web que trata do upload é uma instância Apache HTTPD 2.4.39 + mod24_ssl/OpenSSL 1.0.2k -> mod_jk 1.2.42 -> Apache Tomcat 8.5.40 -> Artifactory 5.3.2 .
Eu capturei a página jk_status antes e depois de um desses erros ocorrer (antes/depois estão em um GIF animado para destacar as diferenças). O erro está sendo interpretado como um erro de cliente pelo mod_jk .
Meu arquivo mod/jk workers.properties :
worker.list=main_ajp13_worker
# Set properties for main_ajp13_worker (ajp13)
worker.main_ajp13_worker.type=ajp13
worker.main_ajp13_worker.host=localhost
worker.main_ajp13_worker.port=8009
A seguir estão a saída do console (detalhes de autorização limpos) de um upload com falha, seguido imediatamente por um upload bem-sucedido:
-bash-4.2$ curl -v -k -u backup:PASSWORD -H X-Checksum-Sha1:6b9525bf9bb572ec7f70d0d474cb6533e0091bd3 -H X-Checksum-Md5:6da0c444ad1b53dc1cf06784e204cb10 -T snapshot-20190612.zip https://repo.mydomain.org/resource-snapshot/snapshot/20190612/snapshot-20190612.zip
* Trying 123.222.100.132...
* TCP_NODELAY set
* Connected to repo.mydomain.org (123.222.100.132) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* Cipher selection: ALL:!EXPORT:!EXPORT40:!EXPORT56:!aNULL:!LOW:!RC4:@STRENGTH
* successfully set certificate verify locations:
* CAfile: /etc/pki/tls/certs/ca-bundle.crt
CApath: none
* TLSv1.2 (OUT), TLS header, Certificate Status (22):
* TLSv1.2 (OUT), TLS handshake, Client hello (1):
* TLSv1.2 (IN), TLS handshake, Server hello (2):
* TLSv1.2 (IN), TLS handshake, Certificate (11):
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
* TLSv1.2 (IN), TLS handshake, Server finished (14):
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
* TLSv1.2 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.2 (OUT), TLS handshake, Finished (20):
* TLSv1.2 (IN), TLS change cipher, Change cipher spec (1):
* TLSv1.2 (IN), TLS handshake, Finished (20):
* SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256
* ALPN, server accepted to use http/1.1
* Server certificate:
* subject: OU=Domain Control Validated; OU=PositiveSSL Wildcard; CN=*.mydomain.org
* start date: May 28 00:00:00 2019 GMT
* expire date: May 27 23:59:59 2021 GMT
* issuer: C=GB; ST=Greater Manchester; L=Salford; O=Sectigo Limited; CN=Sectigo RSA Domain Validation Secure Server CA
* SSL certificate verify ok.
* Server auth using Basic with user 'backup'
> PUT /resource-snapshot/snapshot/20190612/snapshot-20190612.zip HTTP/1.1
> Host: repo.mydomain.org
> Authorization: Basic DELETED_TOKEN
> User-Agent: curl/7.61.1
> Accept: */*
> X-Checksum-Sha1:6b9525bf9bb572ec7f70d0d474cb6533e0091bd3
> X-Checksum-Md5:6da0c444ad1b53dc1cf06784e204cb10
> Content-Length: 486948951
> Expect: 100-continue
>
< HTTP/1.1 100 Continue
* OpenSSL SSL_read: SSL_ERROR_SYSCALL, errno 104
* Closing connection 0
curl: (56) OpenSSL SSL_read: SSL_ERROR_SYSCALL, errno 104
-bash-4.2$ curl -v -u backup:PASSWORD -H X-Checksum-Sha1:6b9525bf9bb572ec7f70d0d474cb6533e0091bd3 -H X-Checksum-Md5:6da0c444ad1b53dc1cf06784e204cb10 -T snapshot-20190612.zip https://repo.mydomain.org/resource-snapshot/snapshot/20190612/snapshot-20190612.zip
* Trying 123.222.100.132...
* TCP_NODELAY set
* Connected to repo.mydomain.org (123.222.100.132) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* Cipher selection: ALL:!EXPORT:!EXPORT40:!EXPORT56:!aNULL:!LOW:!RC4:@STRENGTH
* successfully set certificate verify locations:
* CAfile: /etc/pki/tls/certs/ca-bundle.crt
CApath: none
* TLSv1.2 (OUT), TLS header, Certificate Status (22):
* TLSv1.2 (OUT), TLS handshake, Client hello (1):
* TLSv1.2 (IN), TLS handshake, Server hello (2):
* TLSv1.2 (IN), TLS handshake, Certificate (11):
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
* TLSv1.2 (IN), TLS handshake, Server finished (14):
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
* TLSv1.2 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.2 (OUT), TLS handshake, Finished (20):
* TLSv1.2 (IN), TLS change cipher, Change cipher spec (1):
* TLSv1.2 (IN), TLS handshake, Finished (20):
* SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256
* ALPN, server accepted to use http/1.1
* Server certificate:
* subject: OU=Domain Control Validated; OU=PositiveSSL Wildcard; CN=*.mydomain.org
* start date: May 28 00:00:00 2019 GMT
* expire date: May 27 23:59:59 2021 GMT
* issuer: C=GB; ST=Greater Manchester; L=Salford; O=Sectigo Limited; CN=Sectigo RSA Domain Validation Secure Server CA
* SSL certificate verify ok.
* Server auth using Basic with user 'backup'
> PUT /resource-snapshot/snapshot/20190612/snapshot-20190612.zip HTTP/1.1
> Host: repo.mydomain.org
> Authorization: Basic DELETED_TOKEN
> User-Agent: curl/7.61.1
> Accept: */*
> X-Checksum-Sha1:6b9525bf9bb572ec7f70d0d474cb6533e0091bd3
> X-Checksum-Md5:6da0c444ad1b53dc1cf06784e204cb10
> Content-Length: 486948951
> Expect: 100-continue
>
< HTTP/1.1 100 Continue
* OpenSSL SSL_read: SSL_ERROR_SYSCALL, errno 104
* Closing connection 0
curl: (56) OpenSSL SSL_read: SSL_ERROR_SYSCALL, errno 104
Com uma segunda (ou muitas vezes terceira ou quarta) tentativa de upload, ele eventualmente consegue:
-bash-4.2$ curl -v -u backup:PASSWORD -H X-Checksum-Sha1:6b9525bf9bb572ec7f70d0d474cb6533e0091bd3 -H X-Checksum-Md5:6da0c444ad1b53dc1cf06784e204cb10 -T snapshot-20190612.zip https://repo.mydomain.org/resource-snapshot/snapshot/20190612/snapshot-20190612.zip
* Trying 123.222.100.132...
* TCP_NODELAY set
* Connected to repo.mydomain.org (123.222.100.132) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* Cipher selection: ALL:!EXPORT:!EXPORT40:!EXPORT56:!aNULL:!LOW:!RC4:@STRENGTH
* successfully set certificate verify locations:
* CAfile: /etc/pki/tls/certs/ca-bundle.crt
CApath: none
* TLSv1.2 (OUT), TLS header, Certificate Status (22):
* TLSv1.2 (OUT), TLS handshake, Client hello (1):
* TLSv1.2 (IN), TLS handshake, Server hello (2):
* TLSv1.2 (IN), TLS handshake, Certificate (11):
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
* TLSv1.2 (IN), TLS handshake, Server finished (14):
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
* TLSv1.2 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.2 (OUT), TLS handshake, Finished (20):
* TLSv1.2 (IN), TLS change cipher, Change cipher spec (1):
* TLSv1.2 (IN), TLS handshake, Finished (20):
* SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256
* ALPN, server accepted to use http/1.1
* Server certificate:
* subject: OU=Domain Control Validated; OU=PositiveSSL Wildcard; CN=*.mydomain.org
* start date: May 28 00:00:00 2019 GMT
* expire date: May 27 23:59:59 2021 GMT
* issuer: C=GB; ST=Greater Manchester; L=Salford; O=Sectigo Limited; CN=Sectigo RSA Domain Validation Secure Server CA
* SSL certificate verify ok.
* Server auth using Basic with user 'backup'
> PUT /resource-snapshot/snapshot/20190612/snapshot-20190612.zip HTTP/1.1
> Host: repo.mydomain.org
> Authorization: Basic DELETED_TOKEN
> User-Agent: curl/7.61.1
> Accept: */*
> X-Checksum-Sha1:6b9525bf9bb572ec7f70d0d474cb6533e0091bd3
> X-Checksum-Md5:6da0c444ad1b53dc1cf06784e204cb10
> Content-Length: 486948951
> Expect: 100-continue
>
< HTTP/1.1 100 Continue
* We are completely uploaded and fine
< HTTP/1.1 201 201
< Date: Wed, 12 Jun 2019 13:22:46 GMT
< Server: Apache
< Strict-Transport-Security: max-age=15768000
< X-Artifactory-Id: 9e49bf90cd4b7700:7ba45cf1:16b1db56e4d:-8000
< Location: https://repo.mydomain.org/resource-snapshot/snapshot/20190612/snapshot-20190612.zip
< Transfer-Encoding: chunked
< Content-Type: application/vnd.org.jfrog.artifactory.storage.itemcreated+json;charset=ISO-8859-1
<
{
"repo" : "resource-snapshot",
"path" : "/snapshot/20190612/snapshot-20190612.zip",
"created" : "2019-06-12T13:23:05.805Z",
"createdBy" : "backup",
"downloadUri" : "https://repo.mydomain.org/resource-snapshot/snapshot/20190612/snapshot-20190612.zip",
"mimeType" : "application/zip",
"size" : "486948951",
"checksums" : {
"sha1" : "6b9525bf9bb572ec7f70d0d474cb6533e0091bd3",
"md5" : "6da0c444ad1b53dc1cf06784e204cb10"
},
"originalChecksums" : {
"sha1" : "6b9525bf9bb572ec7f70d0d474cb6533e0091bd3",
"md5" : "6da0c444ad1b53dc1cf06784e204cb10"
},
"uri" : "https://repo.mydomain.org/resource-snapshot/snapshot/20190612/snapshot-20190612.zip"
* Connection #0 to host repo.mydomain.org left intact
}
Os logs do servidor não são particularmente informativos. Aqui estão os logs relevantes quando um upload falha:
/var/log/httpd/ssl_request_log
[24/Jun/2019:12:17:45 +0000] 52.1.100.190 TLSv1.2 ECDHE-RSA-AES128-GCM-SHA256 "PUT /resource-snapshot/snapshot/20190612/snapshot-20190612.zip HTTP/1.1" -
/var/log/httpd/repo.mydomain.org-ssl_access_log
140.1.100.101 - - [24/Jun/2019:12:17:45 +0000] "PUT /resource-snapshot/snapshot/20190612/snapshot-20190612.zip HTTP/1.1" 400 -
catalina.out
Nenhuma entrada de registro.
/var/log/httpd/mod_jk.log
Estes são os logs quando o nível de log mod/jk é definido como info :
[Mon Jun 24 12:18:05 2019] [15679:140534117611584] [info] ajp_read_into_msg_buff::jk_ajp_common.c (1553): (main_ajp13_worker) receiving data from client failed. Connection aborted or network problems
[Mon Jun 24 12:18:05 2019] [15679:140534117611584] [info] ajp_process_callback::jk_ajp_common.c (2105): (main_ajp13_worker) Reading from client aborted or client network problems
[Mon Jun 24 12:18:05 2019] [15679:140534117611584] [info] ajp_service::jk_ajp_common.c (2778): (main_ajp13_worker) sending request to tomcat failed (unrecoverable), because of client read error (attempt=1)
[Mon Jun 24 12:18:05 2019] [15679:140534117611584] [info] jk_handler::mod_jk.c (2988): Aborting connection for worker=main_ajp13_worker
E aqui está a parte relevante das 2,4 milhões de entradas de log do mod_jk no nível de depuração , começando no HTTP 100 Continue e mostrando o início do upload do arquivo (você pode ver o número mágico do arquivo zip, 50 4B 03 04 ):
[Mon Jun 24 14:37:08 2019] [17826:139891516274752] [debug] ajp_connection_tcp_send_message::jk_ajp_common.c (1267): 0160 32 65 32 00 A0 08 00 09 34 39 36 32 37 38 36 30 - 2e2.....49627860
[Mon Jun 24 14:37:08 2019] [17826:139891516274752] [debug] ajp_connection_tcp_send_message::jk_ajp_common.c (1267): 0170 33 00 00 06 45 78 70 65 63 74 00 00 0C 31 30 30 - 3...Expect...100
[Mon Jun 24 14:37:08 2019] [17826:139891516274752] [debug] ajp_connection_tcp_send_message::jk_ajp_common.c (1267): 0180 2D 63 6F 6E 74 69 6E 75 65 00 0A 00 0F 41 4A 50 - -continue....AJP
[Mon Jun 24 14:37:08 2019] [17826:139891516274752] [debug] ajp_connection_tcp_send_message::jk_ajp_common.c (1267): 0190 5F 52 45 4D 4F 54 45 5F 50 4F 52 54 00 00 05 33 - _REMOTE_PORT...3
[Mon Jun 24 14:37:08 2019] [17826:139891516274752] [debug] ajp_connection_tcp_send_message::jk_ajp_common.c (1267): 01a0 32 39 37 30 00 0A 00 0E 41 4A 50 5F 4C 4F 43 41 - 2970....AJP_LOCA
[Mon Jun 24 14:37:08 2019] [17826:139891516274752] [debug] ajp_connection_tcp_send_message::jk_ajp_common.c (1267): 01b0 4C 5F 41 44 44 52 00 00 0C 31 37 32 2E 33 31 2E - L_ADDR...172.31.
[Mon Jun 24 14:37:08 2019] [17826:139891516274752] [debug] ajp_connection_tcp_send_message::jk_ajp_common.c (1267): 01c0 31 33 2E 39 36 00 0A 00 10 4A 4B 5F 4C 42 5F 41 - 13.96....JK_LB_A
[Mon Jun 24 14:37:08 2019] [17826:139891516274752] [debug] ajp_connection_tcp_send_message::jk_ajp_common.c (1267): 01d0 43 54 49 56 41 54 49 4F 4E 00 00 03 41 43 54 00 - CTIVATION...ACT.
[Mon Jun 24 14:37:08 2019] [17826:139891516274752] [debug] ajp_connection_tcp_send_message::jk_ajp_common.c (1267): 01e0 0A 00 0F 53 53 4C 5F 43 4C 49 45 4E 54 5F 43 45 - ...SSL_CLIENT_CE
[Mon Jun 24 14:37:08 2019] [17826:139891516274752] [debug] ajp_connection_tcp_send_message::jk_ajp_common.c (1267): 01f0 52 54 00 00 0F 53 53 4C 5F 43 4C 49 45 4E 54 5F - RT...SSL_CLIENT_
[Mon Jun 24 14:37:08 2019] [17826:139891516274752] [debug] ajp_connection_tcp_send_message::jk_ajp_common.c (1267): 0200 43 45 52 54 00 FF 00 00 00 00 00 00 00 00 00 00 - CERT............
[Mon Jun 24 14:37:08 2019] [17826:139891516274752] [debug] ajp_send_request::jk_ajp_common.c (1782): (main_ajp13_worker) request body to send 496278603 - request body to resend 0
[Mon Jun 24 14:37:08 2019] [17826:139891516274752] [debug] ajp_send_request::jk_ajp_common.c (1886): (main_ajp13_worker) sending 8186 bytes of request body
[Mon Jun 24 14:37:08 2019] [17826:139891516274752] [debug] ajp_connection_tcp_send_message::jk_ajp_common.c (1267): sending to ajp13 pos=4 len=8192 max=8192
[Mon Jun 24 14:37:08 2019] [17826:139891516274752] [debug] ajp_connection_tcp_send_message::jk_ajp_common.c (1267): 0000 12 34 1F FC 1F FA 50 4B 03 04 14 00 00 00 08 00 - .4....PK........
[Mon Jun 24 14:37:08 2019] [17826:139891516274752] [debug] ajp_connection_tcp_send_message::jk_ajp_common.c (1267): 0010 84 61 D8 4E 73 67 93 AF 7C 0A 00 00 E0 B0 00 00 - .a.Nsg..|.......
[Mon Jun 24 14:37:08 2019] [17826:139891516274752] [debug] ajp_connection_tcp_send_message::jk_ajp_common.c (1267): 0020 0A 00 1C 00 62 61 63 6B 75 70 2E 6C 6F 67 55 54 - ....backup.logUT
[Mon Jun 24 14:37:08 2019] [17826:139891516274752] [debug] ajp_connection_tcp_send_message::jk_ajp_common.c (1267): 0030 09 00 03 18 BE 10 5D 90 BD 10 5D 75 78 0B 00 01 - ......]...]ux...
[Mon Jun 24 14:37:08 2019] [17826:139891516274752] [debug] ajp_connection_tcp_send_message::jk_ajp_common.c (1267): 0040 04 1A 00 00 00 04 1A 00 00 00 ED 5C 4D 6F DC C8 - ...........\Mo..
[Mon Jun 24 14:37:08 2019] [17826:139891516274752] [debug] ajp_connection_tcp_send_message::jk_ajp_common.c (1267): 0050 11 BD E7 57 34 74 4A 80 D8 80 77 17 8B 45 80 1C - ...W4tJ...w..E..
.... [lots of encoded data trimmed out here, the next part shows re-filling the 8k buffer] ....
[Mon Jun 24 14:37:08 2019] [17826:139891516274752] [debug] ajp_connection_tcp_get_message::jk_ajp_common.c (1462): received from ajp13 pos=0 len=3 max=8192E D7 E9 E6 DD BB - 8...U?.er.>.....
[Mon Jun 24 14:37:08 2019] [17826:139891516274752] [debug] ajp_connection_tcp_get_message::jk_ajp_common.c (1462): 00000 06 1F FA 00 00 00 00 00 00 00 00 00 00 00 00 00 - .................
[Mon Jun 24 14:37:08 2019] [17826:139891516274752] [debug] ajp_connection_tcp_send_message::jk_ajp_common.c (1267): sending to ajp13 pos=4 len=8192 max=8192 2E 40 41 4C 94 - ..$z~YW.?1G.@AL.
[Mon Jun 24 14:37:08 2019] [17826:139891516274752] [debug] ajp_connection_tcp_send_message::jk_ajp_common.c (1267): 0000 12 34 1F FC 1F FA C9 62 3E 48 C5 70 E9 05 E1 8D - .4.....b>H.p....
.... [lots of encoded data trimmed out here, the next part shows where re-filling the 8k buffer fails] ....
[Mon Jun 24 14:37:28 2019] [17826:139891516274752] [debug] ajp_connection_tcp_send_message::jk_ajp_common.c (1267): 03e0 2D E5 36 F3 8C 12 8C 51 69 66 21 AE A5 58 BE 6E - -.6....Qif!..X.n
[Mon Jun 24 14:37:28 2019] [17826:139891516274752] [debug] ajp_connection_tcp_send_message::jk_ajp_common.c (1267): 03f0 57 F3 F4 EC CB E9 49 9B 59 44 AB FE 66 6E 67 DC - W.....I.YD..fng.
[Mon Jun 24 14:37:28 2019] [17826:139891516274752] [debug] ajp_connection_tcp_get_message::jk_ajp_common.c (1462): received from ajp13 pos=0 len=3 max=8192
[Mon Jun 24 14:37:28 2019] [17826:139891516274752] [debug] ajp_connection_tcp_get_message::jk_ajp_common.c (1462): 0000 06 1F FA 00 00 00 00 00 00 00 00 00 00 00 00 00 - ................
[Mon Jun 24 14:37:28 2019] [17826:139891516274752] [info] ajp_read_into_msg_buff::jk_ajp_common.c (1553): (main_ajp13_worker) receiving data from client failed. Connection aborted or network problems
[Mon Jun 24 14:37:28 2019] [17826:139891516274752] [info] ajp_process_callback::jk_ajp_common.c (2105): (main_ajp13_worker) Reading from client aborted or client network problems
[Mon Jun 24 14:37:28 2019] [17826:139891516274752] [info] ajp_service::jk_ajp_common.c (2778): (main_ajp13_worker) sending request to tomcat failed (unrecoverable), because of client read error (attempt=1)
[Mon Jun 24 14:37:28 2019] [17826:139891516274752] [debug] ajp_reset_endpoint::jk_ajp_common.c (851): (main_ajp13_worker) resetting endpoint with socket 24 (socket shutdown)
[Mon Jun 24 14:37:28 2019] [17826:139891516274752] [debug] ajp_abort_endpoint::jk_ajp_common.c (821): (main_ajp13_worker) aborting endpoint with socket 24
[Mon Jun 24 14:37:28 2019] [17826:139891516274752] [debug] jk_shutdown_socket::jk_connect.c (932): About to shutdown socket 24 [127.0.0.1:38678 -> 127.0.0.1:8009]
[Mon Jun 24 14:37:28 2019] [17826:139891516274752] [debug] jk_shutdown_socket::jk_connect.c (1005): shutting down the read side of socket 24 [127.0.0.1:38678 -> 127.0.0.1:8009]
[Mon Jun 24 14:37:28 2019] [17826:139891516274752] [debug] jk_shutdown_socket::jk_connect.c (1016): Shutdown socket 24 [127.0.0.1:38678 -> 127.0.0.1:8009] and read 265 lingering bytes in 0 sec.
[Mon Jun 24 14:37:28 2019] [17826:139891516274752] [debug] ajp_done::jk_ajp_common.c (3287): recycling connection pool for worker main_ajp13_worker and socket -1
[Mon Jun 24 14:37:28 2019] [17826:139891516274752] [info] jk_handler::mod_jk.c (2988): Aborting connection for worker=main_ajp13_worker
[Mon Jun 24 14:38:24 2019] [17825:139891516274752] [debug] wc_shutdown::jk_worker.c (390): Shutting down worker main_ajp13_worker
[Mon Jun 24 14:38:24 2019] [17825:139891516274752] [debug] jk_shm_close::jk_shm.c (745): Closed shared memory /var/log/httpd/mod_jk.shm.17823 childs=9
[Mon Jun 24 14:38:24 2019] [17877:139891516274752] [debug] wc_shutdown::jk_worker.c (390): Shutting down worker main_ajp13_worker
Observando a documentação do AJPv3 , vemos uma explicação dos pedaços de 8k:
O contêiner solicita mais dados da solicitação (se o corpo for muito grande para caber no primeiro pacote enviado ou quando a solicitação for dividida). O servidor enviará um pacote de corpo de volta com uma quantidade de dados que é o mínimo do request_length, o tamanho máximo do corpo de envio (8186 (8 Kbytes - 6)) e o número de bytes realmente restantes para enviar do corpo da solicitação. Se não houver mais dados no corpo (ou seja, o contêiner do servlet está tentando ler além do final do corpo), o servidor enviará de volta um pacote "vazio", que é um pacote do corpo com um comprimento de carga 0. ( 0x12,0x34,0x00,0x00)
Nos logs acima, contei o número de vezes que o buffer de 8k foi preenchido novamente (36.856 vezes) e descobri que 301.703.216 bytes (287 MB ou 62%) do arquivo foram transferidos antes que o erro ocorresse. Em tentativas repetidas, essa porcentagem varia muito, mas geralmente é superior a 300 MB e, às vezes, completa todos os 440 MB com êxito.
Minha interpretação dos logs acima
O handshake SSL está sendo bem-sucedido e o erro ocorre após o HTTP 100 Continue, portanto, suspeito que o arquivo tenha sido enviado parcialmente antes das seguintes mensagens em mod_jk.log :
- o recebimento de dados do cliente falhou. Conexão abortada ou problemas de rede
- o envio da solicitação para o tomcat falhou (irrecuperável), devido a um erro de leitura do cliente (tentativa=1)
Como não há erros no catalina.out e o mod_jk está reclamando de um erro de leitura do cliente, acredito que isso exclui o próprio Tomcat (e o aplicativo da web, Artifactory ).
A maioria dos comentários que vejo com uma pesquisa no Google sobre as mensagens mod_jk acima afirmam que são erros transitórios de rede, mas não acredito nisso, pois são transferências de arquivos entre instâncias do EC2 na mesma região. Também estou interpretando isso como não tendo nada a ver com um tempo limite de conexão, pois isso (eu acho) produziria uma mensagem " Conexão expirada após n milissegundos ".
Alterando as configurações do mod_jk e do Tomcat
Coisas adicionais que eu tentei (essas mudanças são cumulativas, sem mudanças observáveis nos resultados):
- mod_jk atualizado para a versão mais recente (1.2.46)
- aumento do tamanho do pacote de trabalho de 8.196 (padrão) para 16.384
- aumento do tamanho do pacote de trabalho de 16.384 para 32.768
- aumentou JkShmSize de 1kB (padrão) para 1GB (oops, o valor é especificado em kB)
- diminuiu JkShmSize de 1 GB para 1 MB
- retornou o trabalhador JkShmSize ao padrão
- retornou o tamanho do pacote do trabalhador para o padrão
- aumento de tentativas do trabalhador de 2 (padrão) para 5
- diminuiu o reply_timeout de sempre (padrão) para 1000ms (isso na verdade causou alguns problemas com o serviço subjacente durante uploads bem-sucedidos)
- trabalhador retornado tenta voltar ao padrão
- retornou reply_timeout ao padrão
- mudou socket_keepalive de false (padrão) para true
- retornou socket_keepalive para o padrão
- alterou o protocolo do conector de org.apache.coyote.http11.Http11NioProtocol para org.apache.coyote.http11.Http11Nio2Protocol
- aumentou o tempo limite de conexão do Tomcat de 20 segundos (20.000) para 60 segundos (60.000)
- definir Tomcat connectionUploadTimeout para 60 segundos (60000)
- defina Tomcat disableUploadTimeout como false
- retornou connectionUploadTimeout e disableUploadTimeout aos padrões
- Transferência em partes forçada via -H "Transfer-Encoding:chunked"
- Defina as propriedades do trabalhador como: retries=2,socket_timeout=100,reply_timeout=30000
- Propriedades do trabalhador revertidas para os padrões.
- Adicionado +DisableReuse para JkOptions.
- JkOptions revertido para os padrões.
- Criação de workers separados para os hosts virtuais de acesso ao repositório e ao repositório.
- Removidas as entradas de vhost não SSL, deixando apenas os dois hosts virtuais SSL.
Nota: Separar os trabalhadores pelo menos segregou as mensagens do trabalhador, então eu sei agora que o erro está ocorrendo inteiramente no serviço da web de repositório, e não no serviço da web de acesso ao repositório:
[Sat Jun 29 16:16:55 2019] [6174:139690136819776] [info] ajp_read_into_msg_buff::jk_ajp_common.c (1550): (repo) receiving data from client failed. Connection aborted or network problems
[Sat Jun 29 16:16:55 2019] [6174:139690136819776] [info] ajp_process_callback::jk_ajp_common.c (2102): (repo) Reading from client aborted or client network problems
[Sat Jun 29 16:16:55 2019] [6174:139690136819776] [info] ajp_service::jk_ajp_common.c (2775): (repo) sending request to tomcat failed (unrecoverable), because of client read error (attempt=1)
[Sat Jun 29 16:16:55 2019] [6174:139690136819776] [info] jk_handler::mod_jk.c (2984): Aborting connection for worker=repo
Alterando o cliente HTTP(S)
Consegui reproduzir o problema usando wget em vez de curl, o que significa que esse não é um problema esotérico com o software cliente:
-bash-4.2$ wget -v --tries=1 --user=backup --password=PASSWORD --header=X-Checksum-Sha1:6b9525bf9bb572ec7f70d0d474cb6533e0091bd3 --header=X-Checksum-Md5:6da0c444ad1b53dc1cf06784e204cb10 --method=PUT --body-file=snapshot-20190612.zip https://repo.mydomain.org/resource-snapshot/snapshot/20190612/snapshot-20190612.zip
--2019-06-24 19:59:16-- https://repo.mydomain.org/resource-snapshot/snapshot/20190612/snapshot-20190612.zip
Resolving repo.mydomain.org (repo.mydomain.org)... 123.222.100.132
Connecting to repo.mydomain.org (repo.mydomain.org)|123.222.100.132|:443... connected.
HTTP request sent, awaiting response... 401 401
Authentication selected: Basic realm="Artifactory Realm"
Reusing existing connection to repo.mydomain.org:443.
Giving up.
Testando no localhost
Testei o upload de localhost no servidor de arquivamento ( https://localhost:443/ ), que requer a adição do -k
sinalizador para curl para ignorar a validação do nome do host do certificado e o upload é sempre bem-sucedido (11 segundos). No entanto, adicionar --limit-rate 20M
faz com que o upload falhe todas as vezes (funciona com --limit-rate 25M
).
Ignorar Apache HTTPD, mod_jk e SSL funciona
Os uploads sempre são bem-sucedidos se eu ignorar o Apache HTTPD, mod_jk e SSL e passar diretamente pela porta 8080:
[ec2-user@repo ~]$ curl -v -k --limit-rate 1024K -u backup:PASSWORD -H X-Checksum-Sha1:6b9525bf9bb572ec7f70d0d474cb6533e0091bd3 -H X-Checksum-Md5:6da0c444ad1b53dc1cf06784e204cb10 -T snapshot-20190612.zip http://localhost:8080/resource-snapshot/snapshot/20190612/snapshot-20190612.zip
* Trying 127.0.0.1...
* TCP_NODELAY set
* Connected to localhost (127.0.0.1) port 8080 (#0)
* Server auth using Basic with user 'backup'
> PUT /resource-snapshot/snapshot/20190612/snapshot-20190612.zip HTTP/1.1
> Host: localhost:8080
> Authorization: Basic DELETED_TOKEN
> User-Agent: curl/7.61.1
> Accept: */*
> X-Checksum-Sha1:6b9525bf9bb572ec7f70d0d474cb6533e0091bd3
> X-Checksum-Md5:6da0c444ad1b53dc1cf06784e204cb10
> Content-Length: 496278603
> Expect: 100-continue
>
< HTTP/1.1 100
* We are completely uploaded and fine
< HTTP/1.1 201
< Server: Artifactory/5.3.2
...
Ignorar SSL apenas falha
Passando pelo Apache HTTPD + mod_jk sem SSL, ainda consigo replicar o problema. Isso significa que o problema não tem nada a ver com SSL.
[ec2-user@repo ~]$ curl -v -k --limit-rate 1024K -u backup:PASSWORD -H X-Checksum-Sha1:6b9525bf9bb572ec7f70d0d474cb6533e0091bd3 -H X-Checksum-Md5:6da0c444ad1b53dc1cf06784e204cb10 -T snapshot-20190612.zip http://localhost/resource-snapshot/snapshot/20190612/snapshot-20190612.zip
* Trying 127.0.0.1...
* TCP_NODELAY set
* Connected to localhost (127.0.0.1) port 8080 (#0)
* Server auth using Basic with user 'backup'
> PUT /resource-snapshot/snapshot/20190612/snapshot-20190612.zip HTTP/1.1
> Host: localhost:8080
> Authorization: Basic DELETED_TOKEN
> User-Agent: curl/7.61.1
> Accept: */*
> X-Checksum-Sha1:6b9525bf9bb572ec7f70d0d474cb6533e0091bd3
> X-Checksum-Md5:6da0c444ad1b53dc1cf06784e204cb10
> Content-Length: 496278603
> Expect: 100-continue
>
< HTTP/1.1 100 Continue
* Recv failure: Connection reset by peer
* Closing connection 0
curl: (56) Recv failure: Connection reset by peer
Ignorando HTTPD e mod_jk
I tried bypassing Apache HTTPD and mod_jk by uploading directly to Tomcat via port 8443 (still using SSL). These requests succeeded every time. This leads me to believe that the problem is with Apache HTTPD and/or mod_jk, and is not caused by "transient network errors" or a firewall.
TL;DR
The upload failures only occur when going through HTTPD + mod_jk, not when communicating with Tomcat directly. SSL has been eliminated as a potential source of the problem. This is not a network problem because the error is reproducible when uploading from localhost. The issue appears to be related only to the amount of time it takes to transfer the file, not the size of the file.
Minha pergunta
Existem opções de configuração que eu deveria procurar para HTTPD e/ou mod_jk , workers.properties ou mesmo Tomcat para eliminar esses problemas?
Como alternativa, existe alguma maneira de dizer ao mod_jk para tentar lidar com essas recargas de buffer com falha de maneira tolerante a falhas, para que os uploads possam ser concluídos com êxito?
Você tentou fazer o downgrade do Apache/modSSL para 2.4.38 ?
Tivemos problemas de upload de arquivos desde que atualizamos para 2.4.39.