我正在尝试在大型数据库上运行查询而不终止与服务器的连接。
我在具有 16gb 内存和大约 40gb 可用磁盘的 mac 上使用 Postgres 12.1。数据库为 78gb,pg_database_size
根据 do 最大表为 20gb pg_total_relation_size
。
无论我运行哪个非工作查询,我(从日志中)得到的错误是:
server process (PID xxx) was terminated by signal 9: Killed: 9
在 VS 代码中,错误是"lost connection to server"
.
两个不起作用的例子是:
UPDATE table
SET column = NULL
WHERE column = 0;
select columnA
from table1
where columnA NOT IN (
select columnB
from table2
);
LIMIT
我可以通过添加例如1,000,000来运行一些查询(例如上面的查询) 。
我怀疑由于临时文件而导致磁盘不足,但在日志中(带有log_temp_files = 0
),我看不到任何正在写入的临时文件。
我尝试增加和减少work_mem
, maintenance_work_mem
,shared_buffers
和temp_buffers
. 没有工作,性能大致相同。
我尝试删除所有索引,这降低了某些查询的“成本”,但它们仍然终止了与服务器的连接。
什么可能是我的问题,我该如何进一步解决这个问题?
此外,我读到来自超时查询的临时文件存储在 pqsql_tmp 中。我检查了文件夹,它没有很大的文件。临时文件可以存储在其他地方吗?
运行失败查询的 posgtres 日志如下所示:
2020-02-17 09:31:08.626 CET [94908] LOG: server process (PID xxx) was terminated by signal 9: Killed: 9
2020-02-17 09:31:08.626 CET [94908] DETAIL: Failed process was running: update table
set columnname = NULL
where columnname = 0;
2020-02-17 09:31:08.626 CET [94908] LOG: terminating any other active server processes
2020-02-17 09:31:08.626 CET [94919] WARNING: terminating connection because of crash of another server process
2020-02-17 09:31:08.626 CET [94919] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exi$
2020-02-17 09:31:08.626 CET [94919] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-02-17 09:31:08.626 CET [94914] WARNING: terminating connection because of crash of another server process
2020-02-17 09:31:08.626 CET [94914] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exi$
2020-02-17 09:31:08.626 CET [94914] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-02-17 09:31:08.629 CET [94908] LOG: all server processes terminated; reinitializing
2020-02-17 09:31:08.698 CET [94927] LOG: database system was interrupted; last known up at 2020-02-17 09:30:57 CET
2020-02-17 09:31:08.901 CET [94927] LOG: database system was not properly shut down; automatic recovery in progress
2020-02-17 09:31:08.906 CET [94927] LOG: invalid record length at 17/894C438: wanted 24, got 0
2020-02-17 09:31:08.906 CET [94927] LOG: redo is not required
EXPLAIN
在第二个示例查询上运行返回:
Seq Scan on gas_prices_all p (cost=459.93..5635583.33 rows=128975016 width=16)
Filter: (NOT (hashed SubPlan 1))
SubPlan 1
-> Seq Scan on gas_station g (cost=0.00..423.14 rows=14714 width=16)
JIT:
Functions: 13
Options: Inlining true, Optimization true, Expressions true, Deforming true
FWIW 对于“成本”大约为零的查询,我得到了同样的错误。
更新:我发现了今天早些时候的崩溃报告:
Process: postgres [41042]
Path: /Users/USER/*/postgres
Identifier: postgres
Version: 0
Code Type: X86-64 (Native)
Parent Process: postgres [40107]
Responsible: postgres [41042]
User ID: 502
Date/Time: 2020-02-18 11:16:26.210 +0100
OS Version: Mac OS X 10.14.5 (18F132)
Report Version: 12
Anonymous UUID: F41CCD21-C558-6CB0-316D-D1FF3E279576
Sleep/Wake UUID: 5F08EAEC-373A-4D19-A243-E812E68D2697
Time Awake Since Boot: 1600000 seconds
Time Since Wake: 5700 seconds
System Integrity Protection: enabled
Crashed Thread: 0 Dispatch queue: com.apple.main-thread
Exception Type: EXC_BAD_ACCESS (Code Signature Invalid)
Exception Codes: 0x0000000000000032, 0x00000001044c4060
Exception Note: EXC_CORPSE_NOTIFY
Termination Reason: Namespace CODESIGNING, Code 0x2
kernel messages:
VM Regions Near 0x1044c4060:
__LINKEDIT 0000000104466000-00000001044c4000 [ 376K] r--/rw- SM=COW /Users/USER/*/*.dylib
--> VM_ALLOCATE 00000001044c4000-00000001044c5000 [ 4K] r-x/rwx SM=ZER
VM_ALLOCATE 00000001044c5000-00000001044c6000 [ 4K] rw-/rwx SM=ZER
Application Specific Information:
crashed on child side of fork pre-exec
Thread 0 Crashed:: Dispatch queue: com.apple.main-thread
0 ??? 0x00000001044c4060 0 + 4367073376
1 postgres 0x0000000103798851 ExecAgg + 1185 (executor.h:308)
2 postgres 0x0000000103785d0f standard_ExecutorRun + 287 (execMain.c:1652)
3 postgres 0x0000000103789c4e ParallelQueryMain + 670 (execParallel.c:1405)
4 postgres 0x00000001036520ce ParallelWorkerMain + 1054 (parallel.c:1434)
5 postgres 0x000000010385bec5 StartBackgroundWorker + 533 (bgworker.c:834)
6 postgres 0x000000010386acb9 maybe_start_bgworkers + 1161
7 postgres 0x00000001038696c5 sigusr1_handler + 357 (postmaster.c:5167)
8 libsystem_platform.dylib 0x00007fff76195b5d _sigtramp + 29
9 ??? 0x0000000000003200 0 + 12800
10 postgres 0x00000001037d54ae main + 1678
11 libdyld.dylib 0x00007fff75faa3d5 start + 1
Thread 0 crashed with X86 Thread State (64-bit):
rax: 0x00000001044c4060 rbx: 0x00007f98b9045600 rcx: 0x00000001072c48d8 rdx: 0x00007ffeec6242f4
rdi: 0x00007f98b9047520 rsi: 0x00007f98b9045fc0 rbp: 0x00007ffeec624320 rsp: 0x00007ffeec624288
r8: 0x00000000014bafca r9: 0xffffffff00000000 r10: 0x00000001072c48d0 r11: 0x0000000000000005
r12: 0x0000000103c51220 r13: 0x00007f98b9047510 r14: 0x00007f98b9045fc0 r15: 0x00007f98b90459a0
rip: 0x00000001044c4060 rfl: 0x0000000000010246 cr2: 0x00000001044c4060
Logical CPU: 0
Error Code: 0x00000015
Trap Number: 14
Binary Images:
**lots of stuff**
External Modification Summary:
Calls made by other processes targeting this process:
task_for_pid: 0
thread_create: 0
thread_set_state: 0
Calls made by this process:
task_for_pid: 0
thread_create: 0
thread_set_state: 0
Calls made by all processes on this machine:
task_for_pid: 134420121
thread_create: 0
thread_set_state: 0
VM Region Summary:
ReadOnly portion of Libraries: Total=398.8M resident=0K(0%) swapped_out_or_unallocated=398.8M(100%)
Writable regions: Total=4.2G written=0K(0%) resident=0K(0%) swapped_out=0K(0%) unallocated=4.2G(100%)
VIRTUAL REGION
REGION TYPE SIZE COUNT (non-coalesced)
=========== ======= =======
Kernel Alloc Once 8K 1
MALLOC 81.0M 19
MALLOC guard page 16K 3
MALLOC_LARGE (reserved) 72K 2 reserved VM address space (unallocated)
STACK GUARD 56.0M 1
Stack 8192K 1
VM_ALLOCATE 4.1G 4
__DATA 18.0M 200
__FONT_DATA 4K 1
__LINKEDIT 232.2M 11
__TEXT 166.6M 199
__UNICODE 564K 1
mapped file 64K 1
shared memory 552K 6
=========== ======= =======
TOTAL 4.7G 450
TOTAL, minus reserved VM space 4.7G 450
我可以在日志中找到两件可能很有趣的事情:
2020-02-18 19:01:52.044375+0100 localhost kernel[0]: CODE SIGNING: process 51528[postgres]: rejecting invalid page at address 0x1100c1000 from offset 0x0 in file "<nil>" (cs_mtime:0.0 == mtime:0.0) (signed:0 validated:0 tainted:0 nx:0 wpmapped:1 dirty:0 depth:0)
2020-02-18 19:01:52.044805+0100 localhost ReportCrash[52560]: unknown nested kcdata type: 0x1004, size: 2108
2020-02-18 19:01:55.268060+0100 localhost ReportCrash[52560]: (CrashReporterSupport) Saved crash report for postgres[51528] version 0 to postgres_2020-02-18-190155_MacBook-Pro.crash
2020-02-18 19:01:55.273159+0100 localhost ReportCrash[52560]: (CrashReporterSupport) Removing excessive log: postgres_2020-02-18-190155_MacBook-Pro.crash
2020-02-18 19:01:55.274208+0100 localhost ReportCrash[52560]: shouldDisplayUnexpectedlyQuitNotification is NO
更新:我运行了第二个 exaple 查询(上面),LIMIT
直到它从工作(LIMIT
2,200,000)到终止与服务器的连接(LIMIT
2,300,000)。EXPLAIN ANALYZE
关于LIMIT
2,200,000 的查询是:
Limit (cost=459.93..96581.42 rows=2200000 width=16) (actual time=13.228..38573.440 rows=2200000 loops=1)
-> Seq Scan table1 (cost=459.93..5635583.33 rows=128975016 width=16) (actual time=13.227..38374.070 rows=2200000 loops=1)
Filter: (NOT (hashed SubPlan 1))
Rows Removed by Filter: 139729529
SubPlan 1
-> Seq Scan on table2 (cost=0.00..423.14 rows=14714 width=16) (actual time=0.350..6.925 rows=14714 loops=1)
Planning Time: 0.138 ms
Execution Time: 38685.762 ms
EXPLAIN
在LIMIT
2,300,000 上是:(EXPLAIN ANALYZE
在这里崩溃)
Limit (cost=459.93..100950.58 rows=2300000 width=16)
-> Seq Scan on table1 (cost=459.93..5635583.33 rows=128975016 width=16)
Filter: (NOT (hashed SubPlan 1))
SubPlan 1
-> Seq Scan on table2 (cost=0.00..423.14 rows=14714 width=16)
JIT:
Functions: 14
Options: Inlining false, Optimization false, Expressions true, Deforming true
我正在读这篇文章,因为 JIT 在这里踢,这是有道理的给jit_above_cost
定(默认)设置为 100,000。所以也许问题出在 JIT 上?
更新 2:使用jit=off
第二个示例查询有效,并且也是 ca。快两倍。这怎么可能以及什么可能导致 JIT 在我的系统上出现问题?
“信号 9”表明这几乎可以肯定是内存不足的杀手。您可以查看
/var/log/kern.log
或在您的系统上调用任何内容进行验证。临时文件上的磁盘空间不足不会导致此错误。您应该
EXPLAIN
对查询进行查询以查看它正在使用什么计划。我通常会猜这是一个错乱的 HashAgg,除了你的第一个查询不应该做任何聚合、哈希或其他方式。您也可以尝试关闭过度使用。这样,您的进程就会出现普通的内存不足错误,然后他们可以有序地报告这些错误,而不是仅仅从轨道上被核弹而没有机会报告问题。有适用于 Linux 的说明,也许您可以将它们调整为适用于 MacOS。如果您可以在内存使用量开始大量增长时捕获该进程,但在它被杀死之前,您可以获得有关内存分配位置的报告。
感谢您的所有建议和意见。
我已经玩了足够长的时间来得出结论,禁用 JIT (
jit = off
) 解决了这个问题。我可以看到 JIT 仅在 PostgreSQL 11 中引入,这可能就是为什么网上没有太多关于它可能导致的问题的原因。