我在中等繁忙的生产服务器上部署 postgresql 时遇到意外的系统崩溃问题(Django 1.8 应用程序部署在带有gunicorn
应用程序服务器和nginx
反向代理的 DO droplet 上)。
我是一名偶然的 DBA,需要专家的意见来拼凑正在发生的事情。我现在不知道;我将在下面提供各种日志数据:
整个应用程序崩溃 - new relic 最终显示以下错误:
django.db.utils:OperationalError: ERROR: no more connections allowed (max_client_conn)
完整的堆栈跟踪是:
File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/gevent/baseserver.py", line 26, in _handle_and_close_when_done
File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/gunicorn/workers/ggevent.py", line 155, in handle
File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/gunicorn/workers/base_async.py", line 56, in handle
File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/gunicorn/workers/ggevent.py", line 160, in handle_request
File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/gunicorn/workers/base_async.py", line 114, in handle_request
File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/newrelic-2.56.0.42/newrelic/api/web_transaction.py", line 704, in __iter__
File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/newrelic-2.56.0.42/newrelic/api/web_transaction.py", line 1080, in __call__
File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/django/core/handlers/wsgi.py", line 189, in __call__
File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/django/core/handlers/base.py", line 108, in get_response
File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/newrelic-2.56.0.42/newrelic/hooks/framework_django.py", line 228, in wrapper
File "/home/ubuntu/app/mproject/middleware/mycustommiddleware.py", line 6, in process_request
File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/django/utils/functional.py", line 225, in inner
File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/django/utils/functional.py", line 376, in _setup
File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/django/contrib/auth/middleware.py", line 22, in <lambda>
File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/django/contrib/auth/middleware.py", line 10, in get_user
File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/django/contrib/auth/__init__.py", line 174, in get_user
File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/django/contrib/auth/backends.py", line 93, in get_user
File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/django/db/models/manager.py", line 127, in manager_method
File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/django/db/models/query.py", line 328, in get
File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/django/db/models/query.py", line 144, in __len__
File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/django/db/models/query.py", line 965, in _fetch_all
File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/django/db/models/query.py", line 238, in iterator
File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/django/db/models/sql/compiler.py", line 838, in execute_sql
File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/django/db/backends/base/base.py", line 164, in cursor
File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/django/db/backends/base/base.py", line 135, in _cursor
File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/django/db/backends/base/base.py", line 130, in ensure_connection
File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/django/db/utils.py", line 98, in __exit__
File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/django/db/backends/base/base.py", line 130, in ensure_connection
File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/django/db/backends/base/base.py", line 119, in connect
File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/django/db/backends/postgresql_psycopg2/base.py", line 176, in get_new_connection
File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/newrelic-2.56.0.42/newrelic/hooks/database_dbapi2.py", line 102, in __call__
File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/psycopg2/__init__.py", line 126, in connect
File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/psycogreen/gevent.py", line 32, in gevent_wait_callback
新遗物也显示出巨大的峰值Postgres set
:
Postgresql 慢日志显示过高的 COMMIT 持续时间,如下所示:
2020-05-15 01:22:47.474 UTC [10722] ubuntu@myapp LOG: duration: 280668.902 ms statement: COMMIT
2020-05-15 01:22:47.474 UTC [11536] ubuntu@myapp LOG: duration: 49251.277 ms statement: COMMIT
2020-05-15 01:22:47.474 UTC [10719] ubuntu@myapp LOG: duration: 281609.394 ms statement: COMMIT
2020-05-15 01:22:47.474 UTC [10719] ubuntu@myapp LOG: could not send data to client: Broken pipe
2020-05-15 01:22:47.474 UTC [10719] ubuntu@myapp FATAL: connection to client lost
2020-05-15 01:22:47.475 UTC [10672] ubuntu@myapp LOG: duration: 285371.872 ms statement: COMMIT
Digital Ocean 的仪表板显示 和 的峰值Load Average
(Disk IO
但CPU
,memory usage
和disk usage
保持低位)。
该Load Average
峰值时的快照统计数据是:
syslog
有以下错误日志的无数实例:
May 15 01:22:46 main-app gunicorn[10779]: error: [Errno 111] Connection refused
May 15 01:22:46 main-app gunicorn[10779]: [2020-05-15 01:22:46 +0000] [10874] [ERROR] Socket error processing request.
May 15 01:22:46 main-app gunicorn[10779]: Traceback (most recent call last):
May 15 01:22:46 main-app gunicorn[10779]: File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/gunicorn/workers/base_async.py", line 66, in handle
May 15 01:22:46 main-app gunicorn[10779]: six.reraise(*sys.exc_info())
May 15 01:22:46 main-app gunicorn[10779]: File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/gunicorn/workers/base_async.py", line 56, in handle
May 15 01:22:46 main-app gunicorn[10779]: self.handle_request(listener_name, req, client, addr)
May 15 01:22:46 main-app gunicorn[10779]: File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/gunicorn/workers/ggevent.py", line 160, in handle_request
May 15 01:22:46 main-app gunicorn[10779]: addr)
May 15 01:22:46 main-app gunicorn[10779]: File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/gunicorn/workers/base_async.py", line 129, in handle_request
May 15 01:22:46 main-app gunicorn[10779]: six.reraise(*sys.exc_info())
May 15 01:22:46 main-app gunicorn[10779]: File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/gunicorn/workers/base_async.py", line 114, in handle_request
May 15 01:22:46 main-app gunicorn[10779]: for item in respiter:
May 15 01:22:46 main-app gunicorn[10779]: File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/newrelic-2.56.0.42/newrelic/api/web_transaction.py", line 704, in __iter__
May 15 01:22:46 main-app gunicorn[10779]: for item in self.generator:
May 15 01:22:46 main-app gunicorn[10779]: File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/newrelic-2.56.0.42/newrelic/api/web_transaction.py", line 1080, in __call__
May 15 01:22:46 main-app gunicorn[10779]: self.start_response)
May 15 01:22:46 main-app gunicorn[10779]: File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/django/core/handlers/wsgi.py", line 189, in __call__
May 15 01:22:46 main-app gunicorn[10779]: response = self.get_response(request)
May 15 01:22:46 main-app gunicorn[10779]: File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/django/core/handlers/base.py", line 218, in get_response
May 15 01:22:46 main-app gunicorn[10779]: response = self.handle_uncaught_exception(request, resolver, sys.exc_info())
May 15 01:22:46 main-app gunicorn[10779]: File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/newrelic-2.56.0.42/newrelic/hooks/framework_django.py", line 448, in wrapper
May 15 01:22:46 main-app gunicorn[10779]: return _wrapped(*args, **kwargs)
May 15 01:22:46 main-app gunicorn[10779]: File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/newrelic-2.56.0.42/newrelic/hooks/framework_django.py", line 441, in _wrapped
May 15 01:22:46 main-app gunicorn[10779]: return wrapped(request, resolver, exc_info)
May 15 01:22:46 main-app gunicorn[10779]: File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/django/core/handlers/base.py", line 256, in handle_uncaught_exception
May 15 01:22:46 main-app gunicorn[10779]: 'request': request
May 15 01:22:46 main-app gunicorn[10779]: File "/usr/lib/python2.7/logging/__init__.py", line 1193, in error
May 15 01:22:46 main-app gunicorn[10779]: self._log(ERROR, msg, args, **kwargs)
May 15 01:22:46 main-app gunicorn[10779]: File "/usr/lib/python2.7/logging/__init__.py", line 1286, in _log
May 15 01:22:46 main-app gunicorn[10779]: self.handle(record)
May 15 01:22:46 main-app gunicorn[10779]: File "/usr/lib/python2.7/logging/__init__.py", line 1296, in handle
May 15 01:22:46 main-app gunicorn[10779]: self.callHandlers(record)
May 15 01:22:46 main-app gunicorn[10779]: File "/usr/lib/python2.7/logging/__init__.py", line 1336, in callHandlers
May 15 01:22:46 main-app gunicorn[10779]: hdlr.handle(record)
May 15 01:22:46 main-app gunicorn[10779]: File "/usr/lib/python2.7/logging/__init__.py", line 759, in handle
May 15 01:22:46 main-app gunicorn[10779]: self.emit(record)
May 15 01:22:46 main-app gunicorn[10779]: File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/django/utils/log.py", line 129, in emit
May 15 01:22:46 main-app gunicorn[10779]: self.send_mail(subject, message, fail_silently=True, html_message=html_message)
May 15 01:22:46 main-app gunicorn[10779]: File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/django/utils/log.py", line 132, in send_mail
May 15 01:22:46 main-app gunicorn[10779]: mail.mail_admins(subject, message, *args, connection=self.connection(), **kwargs)
May 15 01:22:46 main-app gunicorn[10779]: File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/newrelic-2.56.0.42/newrelic/api/function_trace.py", line 110, in literal_wrapper
May 15 01:22:46 main-app gunicorn[10779]: return wrapped(*args, **kwargs)
May 15 01:22:46 main-app gunicorn[10779]: File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/django/core/mail/__init__.py", line 98, in mail_admins
May 15 01:22:46 main-app gunicorn[10779]: mail.send(fail_silently=fail_silently)
May 15 01:22:46 main-app gunicorn[10779]: File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/newrelic-2.56.0.42/newrelic/api/function_trace.py", line 110, in literal_wrapper
May 15 01:22:46 main-app gunicorn[10779]: return wrapped(*args, **kwargs)
May 15 01:22:46 main-app gunicorn[10779]: File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/django/core/mail/message.py", line 303, in send
May 15 01:22:46 main-app gunicorn[10779]: return self.get_connection(fail_silently).send_messages([self])
May 15 01:22:46 main-app gunicorn[10779]: File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/django/core/mail/backends/smtp.py", line 100, in send_messages
May 15 01:22:46 main-app gunicorn[10779]: new_conn_created = self.open()
May 15 01:22:46 main-app gunicorn[10779]: File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/django/core/mail/backends/smtp.py", line 58, in open
May 15 01:22:46 main-app gunicorn[10779]: self.connection = connection_class(self.host, self.port, **connection_params)
May 15 01:22:46 main-app gunicorn[10779]: File "/usr/lib/python2.7/smtplib.py", line 256, in __init__
May 15 01:22:46 main-app gunicorn[10779]: (code, msg) = self.connect(host, port)
May 15 01:22:46 main-app gunicorn[10779]: File "/usr/lib/python2.7/smtplib.py", line 316, in connect
May 15 01:22:46 main-app gunicorn[10779]: self.sock = self._get_socket(host, port, self.timeout)
May 15 01:22:46 main-app gunicorn[10779]: File "/usr/lib/python2.7/smtplib.py", line 291, in _get_socket
May 15 01:22:46 main-app gunicorn[10779]: return socket.create_connection((host, port), timeout)
May 15 01:22:46 main-app gunicorn[10779]: File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/gevent/socket.py", line 96, in create_connection
May 15 01:22:46 main-app gunicorn[10779]: sock.connect(sa)
May 15 01:22:46 main-app gunicorn[10779]: File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/gevent/_socket2.py", line 244, in connect
May 15 01:22:46 main-app gunicorn[10779]: raise error(err, strerror(err))
一些行来自vmtstat 1
:
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
1 0 0 4698028 280600 16692992 0 0 0 14 0 0 13 1 87 0 0
3 0 0 4697204 280600 16693016 0 0 0 24 8294 8111 12 0 88 0 0
3 0 0 4698148 280600 16693044 0 0 0 204 8533 9047 15 0 85 0 0
4 0 0 4697628 280600 16693092 0 0 0 76 8695 8775 17 1 83 0 0
3 0 0 4697288 280600 16693140 0 0 8 16 9578 9720 19 1 80 0 0
3 0 0 4695768 280600 16693200 0 0 0 152 8594 8073 16 1 83 0 0
6 0 0 4686044 280600 16693272 0 0 0 32 7998 7706 12 1 87 0 0
5 0 0 4681876 280600 16693308 0 0 0 224 9130 8810 14 1 85 0 0
这就是我所拥有的。我根本无法拼凑出可能发生的事情。现在几乎每天发生一次。有专家能解释一下吗?提前致谢!
注意:在必要的情况下,SELECT version();
产量:
PostgreSQL 9.6.16 on x86_64-pc-linux-gnu (Ubuntu 9.6.16-1.pgdg16.04+1), compiled by gcc (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609, 64-bit
4.6 分钟的 COMMIT 几乎可以确定您的 IO 系统要么严重不足,要么完全被淹没。这将解释所有其他问题。如果连接不能提交,它们通常不会被关闭,所以它们会建立起来。最终你会得到“错误:不允许更多的连接”。如果客户端放弃等待提交响应并断开连接,您可能会在服务器上收到“致命:与客户端的连接丢失”。请注意,这里没有 PostgreSQL“崩溃”,它仍在运行并(缓慢地)服务连接,直到超过 max_connections,此时它仍在为现有连接提供服务。
COMMIT 会执行 fsync 以外的其他事情,偶尔但很少有这些事情会导致问题。因此,为了更加明确,您可以对 pg_stat_activity 进行采样,并查看在这些情节中提交的 wait_event 和 wait_event_type 是什么。如果 wait_event 是 'WALWriteLock' 或 'WALSync' 那是非常确定的。
如果您为服务级别>“沙鼠”的存储系统付费,我会说这完全落在了 Digital Ocean 的手中。
较长的提交时间表明数据库服务器上的 I/O 严重过载。机器似乎陷入了爬行,因此它无法及时响应连接请求。
检查数据库工作负载并查看导致过载的原因。