我注意到 Apache 有一个非常特殊的问题。我设置了非常多的虚拟主机 - 大约是 501。
在虚拟主机编号 493 之后开始出现问题。前 493 个虚拟主机按预期工作,但是一旦我添加虚拟主机编号 494,PHP 就会停止与内存缓存通信,并且每次读/写访问都会超时。
实际上,我使用 memcache 作为后端会话存储,所以,php 函数:
session_start();
只需在 30 秒后超时。
如果我删除 494 个虚拟主机中的随机一个并重新启动 apache,它会再次开始工作。
我已经将 ulimit 设置得非常高(65k),但它没有帮助。我试过完全关闭 ulimit,但没有运气。
你们有什么想法我还能尝试什么吗?
在我在浏览器中输入并等待 30 秒后,我尝试跟踪我连接到的 httpd 进程。
这是 strace 输出:
select(1170, [1024 1169], [], NULL, {1, 0}) = 2 (in [1024 1169], left {0, 999998})
select(1170, [1024 1169], [], NULL, {1, 0}) = 2 (in [1024 1169], left {0, 999998})
select(1170, [1024 1169], [], NULL, {1, 0}) = 2 (in [1024 1169], left {0, 999998})
select(1170, [1024 1169], [], NULL, {1, 0}) = 2 (in [1024 1169], left {0, 999998})
select(1170, [1024 1169], [], NULL, {1, 0}) = 2 (in [1024 1169], left {0, 999998})
所以基本上apache卡在select()上,就是这样,它无限期地重复select()系统调用。
我想出的下一件事是 tcpdump,看看这个包是否真的从 apache 中通过,并且确实如此:
22:11:28.366677 IP6 ::1.51404 > ::1.11914: Flags [S], seq 2899674987, win 32752, options [mss 16376,sackOK,TS val 1384759049 ecr 0,nop,wscale 9], length 0
22:11:28.366697 IP6 ::1.11914 > ::1.51404: Flags [S.], seq 2034630080, ack 2899674988, win 32728, options [mss 16376,sackOK,TS val 1384759049 ecr 1384759049,nop,wscale 9], length 0
22:11:28.366709 IP6 ::1.51404 > ::1.11914: Flags [.], ack 1, win 64, options [nop,nop,TS val 1384759049 ecr 1384759049], length 0
22:11:28.366752 IP6 ::1.51404 > ::1.11914: Flags [P.], seq 1:41, ack 1, win 64, options [nop,nop,TS val 1384759049 ecr 1384759049], length 40
22:11:28.366758 IP6 ::1.11914 > ::1.51404: Flags [.], ack 41, win 64, options [nop,nop,TS val 1384759049 ecr 1384759049], length 0
22:11:28.366768 IP6 ::1.51404 > ::1.11914: Flags [P.], seq 41:90, ack 1, win 64, options [nop,nop,TS val 1384759050 ecr 1384759049], length 49
22:11:28.366772 IP6 ::1.11914 > ::1.51404: Flags [.], ack 90, win 64, options [nop,nop,TS val 1384759050 ecr 1384759050], length 0
22:11:28.366779 IP6 ::1.51404 > ::1.11914: Flags [P.], seq 90:122, ack 1, win 64, options [nop,nop,TS val 1384759050 ecr 1384759050], length 32
22:11:28.366783 IP6 ::1.11914 > ::1.51404: Flags [.], ack 122, win 64, options [nop,nop,TS val 1384759050 ecr 1384759050], length 0
22:11:28.367063 IP6 ::1.11914 > ::1.51404: Flags [P.], seq 1:12, ack 122, win 64, options [nop,nop,TS val 1384759050 ecr 1384759050], length 11
22:11:28.367070 IP6 ::1.51404 > ::1.11914: Flags [.], ack 12, win 64, options [nop,nop,TS val 1384759050 ecr 1384759050], length 0
22:11:28.367266 IP6 ::1.11914 > ::1.51404: Flags [P.], seq 12:20, ack 122, win 64, options [nop,nop,TS val 1384759050 ecr 1384759050], length 8
22:11:28.367275 IP6 ::1.51404 > ::1.11914: Flags [.], ack 20, win 64, options [nop,nop,TS val 1384759050 ecr 1384759050], length 0
22:11:28.367477 IP6 ::1.11914 > ::1.51404: Flags [P.], seq 20:25, ack 122, win 64, options [nop,nop,TS val 1384759050 ecr 1384759050], length 5
22:11:28.367489 IP6 ::1.51404 > ::1.11914: Flags [.], ack 25, win 64, options [nop,nop,TS val 1384759050 ecr 1384759050], length 0
22:11:28.367629 IP6 ::1.51404 > ::1.11914: Flags [P.], seq 122:181, ack 25, win 64, options [nop,nop,TS val 1384759050 ecr 1384759050], length 59
22:11:28.367859 IP6 ::1.11914 > ::1.51404: Flags [P.], seq 25:33, ack 181, win 64, options [nop,nop,TS val 1384759051 ecr 1384759050], length 8
22:11:28.367869 IP6 ::1.51404 > ::1.11914: Flags [P.], seq 181:230, ack 33, win 64, options [nop,nop,TS val 1384759051 ecr 1384759051], length 49
22:11:28.368102 IP6 ::1.11914 > ::1.51404: Flags [P.], seq 33:41, ack 230, win 64, options [nop,nop,TS val 1384759051 ecr 1384759051], length 8
22:11:28.368138 IP6 ::1.51404 > ::1.11914: Flags [F.], seq 230, ack 41, win 64, options [nop,nop,TS val 1384759051 ecr 1384759051], length 0
22:11:28.368195 IP6 ::1.11914 > ::1.51404: Flags [F.], seq 41, ack 231, win 64, options [nop,nop,TS val 1384759051 ecr 1384759051], length 0
22:11:28.368206 IP6 ::1.51404 > ::1.11914: Flags [.], ack 42, win 64, options [nop,nop,TS val 1384759051 ecr 1384759051], length 0
当我向包含 session_start() 的页面发出 curl 调用时,我做的下一件事是 Apache 进程的 GDB,这是输出:
232 *(*new)->local_addr = *sock->local_addr;
241 if (sock->local_addr->sa.sin.sin_family == AF_INET) {
238 (*new)->local_addr->pool = connection_context;
241 if (sock->local_addr->sa.sin.sin_family == AF_INET) {
238 (*new)->local_addr->pool = connection_context;
241 if (sock->local_addr->sa.sin.sin_family == AF_INET) {
245 else if (sock->local_addr->sa.sin.sin_family == AF_INET6) {
246 (*new)->local_addr->ipaddr_ptr = &(*new)->local_addr->sa.sin6.sin6_addr;
249 (*new)->remote_addr->port = ntohs((*new)->remote_addr->sa.sin.sin_port);
250 if (sock->local_port_unknown) {
256 if (apr_is_option_set(sock, APR_TCP_NODELAY) == 1) {
257 apr_set_option(*new, APR_TCP_NODELAY, 1);
266 if (sock->local_interface_unknown ||
267 !memcmp(sock->local_addr->ipaddr_ptr,
266 if (sock->local_interface_unknown ||
276 (*new)->local_interface_unknown = 1;
293 apr_pool_cleanup_register((*new)->pool, (void *)(*new), socket_cleanup,
292 (*new)->inherit = 0;
293 apr_pool_cleanup_register((*new)->pool, (void *)(*new), socket_cleanup,
296 }
unixd_accept (accepted=0x7fff14ecddf0, lr=0x7fe93a905aa8, ptrans=<value optimized out>) at /usr/src/debug/httpd-2.2.15/os/unix/unixd.c:507
507 if (status == APR_SUCCESS) {
508 *accepted = csd;
649 }
child_main (child_num_arg=<value optimized out>) at /usr/src/debug/httpd-2.2.15/server/mpm/prefork/prefork.c:650
650 SAFE_ACCEPT(accept_mutex_off()); /* unlock after "accept" */
652 if (status == APR_EGENERAL) {
656 else if (status != APR_SUCCESS) {
665 current_conn = ap_run_create_connection(ptrans, ap_server_conf, csd, my_child_num, sbh, bucket_alloc);
666 if (current_conn) {
667 ap_process_connection(current_conn, csd);
在这个位置有一个很大的停顿(~30 秒),直到 php 超时。在那之后,我得到了这个:
668 ap_lingering_close(current_conn);
676 if (ap_mpm_pod_check(pod) == APR_SUCCESS) { /* selected as idle? */
680 ap_scoreboard_image->global->running_generation) { /* restart? */
679 else if (ap_my_generation !=
680 ap_scoreboard_image->global->running_generation) { /* restart? */
679 else if (ap_my_generation !=
551 while (!die_now && !shutdown_pending) {
559 apr_pool_clear(ptrans);
562 && requests_this_child++ >= ap_max_requests_per_child)) {
561 if ((ap_max_requests_per_child > 0
562 && requests_this_child++ >= ap_max_requests_per_child)) {
561 if ((ap_max_requests_per_child > 0
562 && requests_this_child++ >= ap_max_requests_per_child)) {
561 if ((ap_max_requests_per_child > 0
566 (void) ap_update_child_status(sbh, SERVER_READY, (request_rec *) NULL);
573 SAFE_ACCEPT(accept_mutex_on());
575 if (num_listensocks == 1) {
最奇怪的是我无法在另一台机器上重现它。相同的操作系统,相同的软件包,相同的配置(傀儡)相同的内核,不同的硬件。
经过几周的调试和注意问题,我终于偶然发现了一条消息:
我会尝试这个修复,但是天哪,天哪,为什么 PHP 人要这样做?这太丑陋了,硬编码 nofile 限制是完全破坏的设计。更不用说如果这是解决方案,强迫我重新编译每个 PHP 次要版本和安全补丁并维护我自己的包是一个很大的麻烦。
编辑:经过更广泛的调试后,似乎不仅是 PHP 被“设计破坏”,memcache 扩展本身也存在很多问题。
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=629896
https://bugs.php.net/bug.php?id=59876
错误已经打开了很长一段时间,但没有任何反应。我想应该只是转储 memcache 扩展并找到独立于它的解决方案:-/